Vous êtes sur la page 1sur 4

A1.

2
A SELF-STEERING DIGITAL MICROPHONE ARRAY
Walter Kellermann *
Acoustics Research Department, AT&T Bell Laboratories, Murray Hill, NJ, USA

ABSTRACT A self-steering microphone array for teleconferencing is presented in which the digitally implemented steering algorithm consists of two parts. The first part, the beamforming, is based on already known concepts. The second part, a novel voting algorithm, integrates elements of pattern classification and exploits temporal characteristics of speech signals. It also accounts for perceptual criteria and the acoustic environment. A real-time implementation is outlined and results are discussed.

Thereby, the emphasis is bn the novel voting algorithm. Moreover, an implementation and some results are briefly reviewed.

GENERAL STRUCTURE AND DESIGN CONSIDERATIONS

The basic structure of the system is shown in Fig.1. The outputs

1 INTRODUCTION Steerable microphone arrays for audio applications aim a t picking up sound emitted by a distant source, while suppressing signals arriving from other directions. This is achieved by directing a beam of increased sensitivity towards the source so that, ideally, the output signal of the steered array should sound similar to a microphone placed next to the source ('presence effect'). Applications for such devices include hands-free telephony systems (e.g., mobile telephony, teleconferencing) and the more general situation where speech signals must be picked up out of a noisy environment and where the talker should not be required t o carry a personal microphone (as,e.g., in many speech recognition applications). Here, we focus on the application to teleconferencing where, at the same time, the 'presence effect', noise reduction, and suppression of interfering sources are desirable. Moreover, for conferences of several teleconferencing rooms, steerable arrays can relieve the echo cancellation problem by attenuating the local feedback path. The main difficulty for the bea.mforming is given by the bandwidth of the audio signal, which even for telephony extends over more than three octaves. Furthermore, for the intended application, the microphone array must accommodate several simultaneously active sources while maintaining good spatial selectivity, and it must be able to track a moving source, e.g., a talker walking around in the room. We approached this problem by using a two-stage strategy: First, fixed beams are formed, whose superposition covers the entire space of interest, and second, a voting algorithm selects the beam(s) that should contribute to the output signal. This idea was implemented earlier using mostly analog hardware [l]. The motivation for the work presented here was to explore the possibilities offered by digital signal processing algorithms and hardware and, thereby, to improve functionality, reduce hardware cost, and increase flexibility. In the following sections we briefly outline the system architecture, then discuss the beamforming and the voting algorithm.
'Now with Philips Kommunikations Industrie, Nirnberg, Germany.

-+
Output

-Microphone Signals

~eam Signais

Figure 1. Structure of the digitally steered microphone array.


of a linear microphone array are conveyed to the digital signal processing hardware after preamplification and A/D conversion. Here, beamforming and voting are performed to produce an output signal which may be used for transmission, speech recognition, or other purposes. The key parameters of such a system - determining both performance and processing load - are associated with the number of sensors and the number of beams. Considering a prototype for teleconferencing rooms or office-like environments, the number of microphones and their spacings is a compromise between a large aperture for good spatial resolution and the demand for a small aperture to ensure the validity of the far-field assumption on which the beamforming is based. Moreover, the spacings must be smaller than a half wavelength to preclude spatial aliasing. Aiming a t telephone bandwidth (sampling rate 8 kHz) which covers approximately three octaves, arrays for a low-frequency(LF), a mid-frequency(MF), and a high-frequency(HF) section were realized, each consisting of 11 (first-order differential) microphones with spacings of 16cm, Scm, and 4cm, respectively. (As some of the microphones can be used for several frequency sections, the entire array consists of 23 microphones.) Assuming that the array will be mounted to a wall and, therefore, should cover an angular range of somewhat less than 180,the number of beams was chosen to be 7 with 'look directions' OD,f20, f40", f60' off where broadside Unlike the earlier analog implementation [l], a 'track while scan' method used only two beams simultaneously, the digital implementation allows us to form a l l seven beams at each sampling instant.

'.

'The broadside axis is defined as being perpendicular to the array axis originating at the center of the array (cf. [2]).

- 3581 -

CH2977-719110000-3581 $1.00 1991 IEEE


@

BEAMFORMING
Speech / Noise Discrlmlnation
47

The beamforming comprises a sequence of linear signal processing operations as shown in Fig.2. First, the microphone signals
Analysis

Weight Assignm.

Estimation
1

Figure 3. Structure of the voting algorithm. estimation of the noise characteristics. Analysis, speech/noise discrimination, and background noise estimation are performed independently and identically for each beam signal. The assignment of the beam weights, however, must consider a l l beams as an ensemble. In the following sections we describe the various steps in more detail. The values given for various parameters are derived from experiments, so that no optimality can be claimed.

Figure 2. Signal processing for beamforming. are assigned to the frequency sections they contribute to (grouping). The subsequent steps are then performed independently for each section with the processing being essentially the same. The aperture shading stage multiplies each of the sensor signals by a weighting factor and thereby determines the balance between beamwidth and sidelobe attenuation for each beam. The weights are chosen according to the Dolph-Chebychev design method [3], which, for the broadside beam, yields the minimum beamwidth for a prescribed sidelobe attenuation and a given number of sensors. With regard to the sensitivity of higher sidelobe attenuation to calibration errors of the microphone array, the weights were chosen to yield a sidelobe attenuation of 25dB. In the wavefront reconstruction stage, for each beam a wavefront is reconstructed as it would be received if the array were rotated by the respective 'steering angle'. This can be achieved by delaying the sensor signals appropriately. Here, we assume far-field conditions and, therefore, reconstruct planar wavefronts for each of the 7 beam directions within each frequency section. Delays that are non-integer multiples of the sampling interval are realized by interpolation. Using 8 neighboring samples for the interpolation of a delayed sample and designing the interpolation filter according to [4], the maximum interpolation error, in our case, is less than 34dB for all beam directions. An efficient implementation circumvents upsampling before the interpolation filtering by using only subsets of the filter coefficients [ 5 ] . In the final stage of the beamforming, the three sections are summed after frequency-selective filtering. For LF, MF, and HF sections, a lowpass, a bandpass, and a highpass are employed, respectively. Each filter is designed as an elliptic IIR filter of order 6, 12, and 6, respectively. The sum of these filters approximates unity over the entire frequency range. The crossover points were chosen to be at 760Hz and 1680Hz. In combination with the given shading coefficients and steering angles, this choice was found to be a good compromise as it provides sufficient spatial coverage at the high frequencies of each frequency section while at the low end the beams do not become unnecessarily wide (see also [2]).

4.1 Analysis The choice of the features that are extracted from a signal frame is based on the results documented in [6], where speech detection schemes for satellite communication systems were investigated. With regard to computational complexity we limit ourselves to 3 features. The most important feature is the logarithm of the signal energy, while the first two PARCOR coefficients [7] were found to be a reasonable choice for the two other features [6]. 4.2 Speech/Noise Discrimination The decision whether a given frame at time k should be considered speech or noise is made on the basis of the following discriminant function:
D ( k ) = log, (det Cvv(k))t

(v(k) - mv(k))*.cVv(k)-' . ( ~ ( k-) mV(k)), where C,,(k) and mv(k) represent the covariance matrix and the mean vector of the estimated feature vectors v for background noise, respectively. The second term of the sum corresponds to the Mahalanobis distance [SI between the current feature vector v(k) and the estimated background noise features. The first term is constant as long as the background noise features do not vary. (This term makes D ( k ) for our discrimination task equivalent t o an estimate of the conditional probability density function for the current feature vector, assuming it is background noise with normally distributed, wide-sense stationary features 191.) The discriminant function is used for two decisions: First, it is decided whether the current signal frame is to be regarded as speech or noise. Using two thresholds D1, DZ (Dl > D z ) , a hysteresis over time is realized: To enter the 'speech' state D ( k ) must exceed D1 and only if D ( k ) < D z , the 'noise' state is reentered. The hysteresis is especially effective in noisy environments when unvoiced sounds are embedded in voiced sounds. These sounds would be classified as noise if only one threshold were provided. As an additional requirement for entering the 'speech' state, the energy of the current frame must exceed the mean noise energy. The second decision derived from D ( k ) can be viewed as part of the background noise estimation procedure and states whether or not a frame may be used to update the knowledge about the background noise: If D is smaller than a threshold D3 (which is usually somewhat smaller than D z ) , the corresponding feature vector becomes a candidate for updating the background noise estimates.

VOTING

The purpose of the voting stage is to select those beam signals which provide the best coverage of the currently active signal sources (talkers) and to form the corresponding output signal. The strategy developed here is outlined in Fig.3 and operates on a frame-by-frame basis (no overlap, frame length 16ms). The task of the voting algorithm is to derive the suitable weights for the incoming beam signals which are weighted and summed to give the output signal. This is achieved by a procedure consisting of four stages (see Fig.3). First, the analysis extracts a feature vector from each beam signal frame. In a second step, the feature vector is used to decide whether the current frame is part of a noise signal or part of a speech signal. This decision requires

- 3582 -

4.3

Background Noise Estimation

As a basis for the speech/noise discrimination, the mean vector mV(k)and the covariance matrix Cvv(k) of the feature vectors for background noise must be ascertained. This requires an estimation procedure which produces useful initial noise estimates as soon as possible after starting the system, so that discrimination between speech and noise can be performed. This estimate must then be consolidated before entering a steady state phase where the estimation procedure remains invariant. At all times, the estimation should be able to follow both abrupt and smooth changes of the background noise. In order to meet these requirements, a scheme was developed consisting of two adaptation procedures that run in parallel: One adaptation procedure estimates mv(k) and C,,(k) on the basis of the discriminant function values and a given threshold D s , while the other procedure adapts the threshold D3. Estimation of M e a n Vector a n d Covariance MatrixFor the adaptation of mV(k) and Cvv(k) a three-stage strategy is adopted modeled after [ 6 ] . During the first stage, the startup phase, (which extends over typically 50 frames) no previous knowledge about the background noise is available. Thus, each frame is considered to be background noise and is therefore used for the estimates that are formed a t the end of this phase. For the second and the third phase, the discriminant function can now be computed and for each frame the speech/noise discrimination is performed. For updating the noise estimates, only those frames are considered which not only are candidates according to the discriminant function value D ( k ) but also are embedded in a contiguous block of candidates. When choosing the number of candidates that must precede the first one which is accepted for the noise update, we take into account that reverberation in the acoustic environment might cause noise-like frames which actually are reverberated speech. At the end of such a contiguous block of noise candidates, one may find frames, which are detected as noise although they contain unvoiced sounds of emerging speech. Consequently, we decided to discard 50 frames (corresponding to 800ms) at the beginning of the block and 12 frames (192ms) a t the end of the block. During the second phase, the consolidation phase, the averaging is extended to typically 500 frames (8s), and during the third phase, the steady state phase, the estimates are recursively updated using a fixed time constant of 8s. For computational efficiency, during the consolidation and the steady state phase the estimates are not updated for each newly accepted noise candidate, but only if a block of typically 10 - such frames is complete. Threshold A d a p t a t i o n for t h e Discriminant FunctionThe motivation for the adaptation of the thresholds for the discriminant function arises from the time-variance of the background noise: After a change of the noise environment it may occur that feature vectors which describe the new background noise are not accepted as noise, because D ( k ) exceeds the current threshold D3. Thus, the background noise estimation may never adapt to the new noise situation. Obviously, in this case, the threshold D3 must be raised in order to incorporate the feature vectors of the new noise background into the estimation. On the other hand, D3 should be kept as low as possible to avoid acceptance of speech segments as background noise, which would result in a degraded speech/noise discrimination. As a first step, the concept for the adaptation of the threshold requires a noise detection which is independent of the discriminant function and the associated thresholds. We use here as a criterion the dynamic range of the energy of the signal frames within a time interval of typically 100 frames. If the dynamic range of the frame energies is below a given threshold this signal segment is considered to be background noise. The underlying assumption is that background noise exhibits a distinctly smaller dynamic range of frame energy than speech. Once a signal segment is

decided to be background noise, a series of tests is performed which use quantities derived from frame energy and from the discrimination function. These tests determine whether or not the threshold D3 should be changed, and if so, they also determine the amount of change. Accordingly, D3 is increased if the noise features changed abruptly. A corresponding mechanism decreases D3 again if the averaged discriminant function value is much smaller than the threshold D3 over a certain time interval. As for the thresholds D1 and D z , it was found that changing these along with D3 by the same amount yields good results.
4.4 Beam Weight Assignment The assignment of the weights to the individual beams must satisfy two requirements. First, it must select those beams which provide best coverage of the currently active sources, and second, it must also account for perceptual criteria, such as the unpleasantness of switching effects. The algorithm consists of two stages which will be explained below. At the first stage, we assign a potential to each beam thereby determining which beams are activated. Here, activation means that this beam is considered as pointing to an active talker and should obtain the corresponding weight. Based on the potentials the second step assigns the actual weight values. Potential Assignment-The potential assignment method was motivated mainly by the inability of the speech/noise discrimination to always find the correct beam for an active talker: Speech will usually be detected for several beams as long as the environment is reverberant. Choosing the beam of which the discriminant function D ( k ) is maximum is prone to errors, too, since D ( k ) is measured relative to the noise background estimates which are not equal for all beams. Therefore, a strategy was developed that essentially exploits the energy bursts that are characteristic for voiced sounds in speech signals. The idea is that an energy burst in the beam signal generates a positive potential which corresponds to the number of future frames that this beam should be activated following the current frame. This potential should only be assigned to bursts arriving via the shortest path, not to reflections, and should not be eroded completely by time before the next burst in continuous speech can be expected. For each time frame we select those beams as candidates for new potential that are maximum with respect to the instantaneous (i.e., the current frame) energy or a lowpass-filtered energy. The potential for the maximum of the instantaneous energy lasts for typically 5 frames, while for the maximum of the averaged energy 20 frames proved to be reasonable. Using two different criteria allows fast detection of emerging speech as well as bridging unvoiced sounds between bursts, and discards rapidly pulsive noises. Provided that the candidates were recognized as speech by the speech/noise discrimination, some more tests are performed before they obtain new nonzero potential. First, an estimate for the signal-to-noise ratio(SNR) - formed by the instantaneous or the lowpass-filtered energy and the mean background noise energy - must exceed a given threshold (e.g., 3dB). This prevents beams from being activated due to background noise that is not well represented by the estimates. Two more criteria are introduced to account for the interactions between already active beams (having nonzero potential from previous assignment) and potentially newly activated beams. Both are applied only to those candidates which had zero potential in the previous time frame. The first criterion realizes a burst echo suppression and should prevent a candidate from being activated if the corresponding beam signal is an echoed version of a burst which already caused another beam t o be activated. Thus, if there are already activated beams, the candidates energy must exceed the attenuated maximum of a l l beams over a certain number of preceding frames. The attenuation factor and the number of frames

- 3583 -

correspond to the reflectiveness and the reverberation time of the acoustic environment, respectively. The second criterion is a neighbor inhibition mechanism and prevents two neighboring beams from being activated a t the same time. This avoids cancellation effects in the directivity pattern which are caused by the interference of two neighboring beams. As the directivity patterns of neighboring beams overlap for most frequencies, this procedure does not impair the coverage of possibly active sources significantly, but it retains the amount of spatial selectivity which is achieved by a single beam. The algorithm for the neighbor inhibition proceeds as follows: If a beam is a candidate for new potential and its neighboring beam is already active, the candidate is discarded if it is not the maximum of both instantaneous and lowpass-filtered energy. To obtain the new potential, the candidate must also exceed its neighbors' lowpass-filtered energy by a prescribed amount (typically 3dB). If the candidate meets these conditions, the potential of the previously activated adjacent beam is set to zero, so that only one of the two beams remains active. The required excess energy causes a hysteresis and is very useful as it prevents the algorithm from alternately assigning the potential to two beams while the source actually is located in between their 'look directions'. Thus, the undesirable switching of the background noise with the same talker being active does not occur. On the other hand, the ability of the algorithm to track a moving source is not affected as long as the amount of required excess energy is not too large. Finally, each beam gets assigned a potential that is the newly aquired potential or the decremented previously assigned potential, whichever is larger. The potential assignment method proved to be very efficient in keeping the number of activated beams minimum while still covering all active sources. Although at most two beams could have new potentials assigned within a time frame (lSms), experiments showed that three simultaneously active talkers are covered without a beam being 'lost'. C o m p u t a t i o n of weights-The weights for the individual beam signals range between 0 and 1 and are mainly determined by the corresponding potential and the previous weight. For a newly activated beam, it was found to be perceptually very important that the transition of the weight from 0 to 1 has sigmoid character, while for the transition from 1 t o 0 - initiated when a previously activated beam runs out of potential - a simply exponentially decaying weight yields satisfactory behavior. However, the case that no beam has nonzero potentid has to be treated separately: To avoid the 'dead channel' phenomenon, at least one weight should not decay exponentially. Thus, for the beam which most recently had nonzero potential, the weight is kept constant until a beam is reactivated.

reaction time of the array to emerging speech is short enough to avoid noticeable chopping of speech and that no switching noise is heard when beams are activated or deactivated. As intended, the number of activated beams is always kept minimal. Interestingly, violation of the far-field condition by the source does not degrade the system performance as long as the background noise sources (including reverberation) meet this condition. Moreover, the implemented system showed good 'self-healing' capability when recovering from the most problematic situation, i.e., when the noise estimation could not follow a changing background noise because of an intensive conversation. In both environments it could be verified that the good functionality is quite robust to parameter variations.

CONCLUSION

The results obtained by real-time experiments confirm that the proposed concept deals successfully with teleconferencing environments and that it yields substantially better performance than earlier concepts based on analog hardware. Future work could aim at larger bandwidth and larger rooms, e.g., auditoria. Conceptually, these extensions are straightforward, as the proposed voting algorithm can be applied without major alterations, and for the beamforming the processing remains essentially the same, although the numbers of sensors and beams will increase. ACKNOWLEDGEMENT The author wishes to thank Gary Elko, Jim Snyder, and Bob Kubli for their guidance and support as well as many other individuals of the Information Principles Research Lab for providing an inspiring and creative environment for this work.

REFERENCES
J.L. Flanagan, J.D. Johnston, R. Zahn, and G.W. Elko. Computer-steered microphone arrays for sound transduction in large rooms. J . Acoust. Soc. Am., 78(5):1508-1518, November 1985. J.L. Flanagan. Beamwidth and useable bandwidth of delay-steered microphone arrays. A TBT Technical Joumal, 64(4):983-995, April 1985. C.L. Dolph. A current distribution for broadside arrays which optimizes the relationship between beamwidth and sidelobe level. Proceedings of the IRE, 34:335-348, June 1946. G. Oetken, T.W. Parks, and H.W. Schiifller. A computer program for digital interpolator design. In Digital Signal Processing Committee of the IEEE ASSP Society, editor, Progmms for Digital Signal Processing, chapter 8.1. IEEE Press, 1979. R.G. Pridham and R.A. Mucci. Digital interpolation beamforming for low-pass and bandpass signals. Proceedings of the IEEE, 67(6):904-919, June 1979. H. Schramm. Untersuchungen an Sprachdetektoren fur digitale Spmchinlerpolationsverfahren. PhD thesis, Universitat Erlangen-Niirnberg, Erlangen, FRG, 1987. (in German). L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals. Prentice Hall, Englewood Cliffs, NJ, 1978. R. 0. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, NY, 1973.

5 IMPLEMENTATION AND RESULTS The proposed system was implemented in real-time hardware and tested in different environments. For the sensor array firstorder differential microphones were employed. The A/D interface provides a linear 16 bit converter for each sensor signal and forms a serial bitstream. The digital signal processing is performed by a cascade of 4 AT&T DSP32C processors, with 3 of them being used for the beamforming and the fourth realizing the voting algorithm. A total of 27.2 Mips ('million instructions per second') are executed with a required storage of 57.4 kByte for both program and data. The DSPs are monitored and controlled via a personal computer. Menu-driven user interface software allows control of all parameters of the voting algorithm using a 'mouse'. Measurements in an anechoic chamber confirmed that the beamforming performance of the implemented system is in good agreement with theory. The functionality of the voting algorithm was examined by careful listening tests in a teleconferencing room and in an office environment. It was found that the

A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, NY, 2nd edition, 1984.

3584 -

Vous aimerez peut-être aussi