Vous êtes sur la page 1sur 3

Methods of Formant Tracking

Muhammad Fadhli Mustaffa


Identification and Tracking of Speech Formants in Noise

1. Formants
When the frequency spectrum of a speech signal is obtained, frequency peaks can be observed. These peaks correspond to a multiple of the fundamental frequency of the sampled voice. An example is shown in Figure 1.

Figure 1: Spectrum of the vowel ah showing three formant regions. Obtained from http://www.sfu.ca/sonic-studio/handbook/Formant.html A formant is a peak in an acoustic frequency spectrum which results from the resonant frequencies of any acoustical system. These formants describe the spectral structure of voiced speech. They are the characteristic partials that enable us to identify the type of sound produced, especially the vowels.

2. Methods of Formant Tracking


The identification and tracking of formants are quite important in speech analysis. There have been several methods and algorithms proposed and implemented and four of these will be discussed based on four research papers.

2.1 Multiband Energy Demodulation


A. Potamianos, P. Maragos, Speech Formant Frequency and Bandwidth Tracking Using Multiband Energy Demodulation, 1995 A filtering scheme is used to isolate a single resonance signal from the spectrum. The scheme used is a bank of Gabor bandpass filters which are uniformly spaced in frequency. The time-frequency distribution obtained will have a time resolution equal to the step of the short-time window used. A demodulation algorithm is then used to obtain the amplitude envelope and the instantaneous frequency signals. Two kinds of algorithms were used: 1. Energy separation algorithm (ESA) A simple algorithm which is computationally efficient and has excellent time resolution. 2. Hilbert transform demodulation (HTD) An alternative way to find the estimates. It can be implemented in the frequency domain as a 90 phase splitter. ESA is mostly used because of its advantages over HTD.

The advantages of using this algorithm are as follows: Conceptually simple. Easy to implement in parallel. Behaves well in the presence of nasalization. Provides realistic formant estimates.

2.2 Pole Interaction


Y. S. Hsiao, D. G. Childers, A New Approach to Formant Estimation and Modification Based on Pole Interaction, 1997 Using the root-finding formant estimation method, the linear prediction (LP) polynomials are factorised and the appropriate roots are assigned to simulate the resonances of the vocal tract. This method deletes spurious roots. However, these roots actually help reinforce the formant peaks and narrow the bandwidths. The motivation behind this algorithm is that the conventional root-finding estimation algorithm is less reliable due to problems stated, which is caused by pole interaction. Hence, an algorithm was introduced that considered the pole interaction problem in order to produce a more reliable formant spectrum. In speech perception, formant energy is more important than formant bandwidth. Therefore, in this method, the speech formants are not only characterised by the formant frequencies, but also by the formant spectral densities. The formant bandwidths are modified to reduce the degradation of pole interaction. We would want to synthesize a specific formant structure by modifying a known formant polynomial, but directly shifting formant poles will affect the entire formant spectrum. The spectral densities of the other frequencies will also be affected. This may result in the formant not appearing in the spectrum. To reduce the effect, the radii of the poles are modified such that the spectral energy of the modified formant polynomial is equal to a specified spectral value. All the equations used take into account the effect of the poles interacting. An iteration process is done until the total distortion between the specified and modified power spectrums are below a certain threshold. The final formant polynomial is then constructed using the modified poles and their complex-conjugated pairs.

2.3 Parameter-Free Non-Linear Predictor


I. Bazzi, A. Acero, L. Deng, An Expectation Maximization Approach for Formant Tracking Using a Parameter-Free Non-Linear Predictor, 2003 This approach for formant tracking uses a parameter-free non-linear predictor that maps formant frequencies and bandwidths into the acoustic feature space. It uses a model that decomposes the signal into two components. The first captures the mapping from the formant space into the acoustic measurements space using the assumption that it is an all-pole model. The second component captures the residual in the speech signal. Mapping is done by quantizing the formant frequency and bandwidth space and creating a predictor codebook. Let x be the vocal tract resonances and their bandwidths. By quantizing x over some range of frequencies and bandwidths and obtaining the Mel Frequency Cepstral Coefficients (MFCCs), the predictor F(x) can be constructed. The obtained mapping has important properties, which are its analytical nature and its independence of any speech data. The size of the codebook could be large.

Formant tracking is then achieved by searching the codebook for the most suitable set of formant values. There are two methods of formant tracking: 1. Frame-by-frame formant tracking: Estimate formants for each frame independently. MAP estimate reduces to the ML estimates. 2. Formant tracking with continuity constraints: Continuity constraints added in the form of formant transition probabilities. Tracking performed using a Viterbi search. The advantages of using this method over other approaches: The relationship between formant values and their contribution to the acoustic measurement is explicitly represented through the predictor codebook. Explores the complete formant space, thus avoiding errors due to premature elimination of formant candidates during the analysis step.

2.4 Context-Dependent Phonemic Information


M. Lee, J. v. Santen, B. Mobius, J. Olive, Formant Tracking Using Context-Dependent Phonemic Information, 2004 The formant tracking algorithm in this paper uses phoneme information. Given the phoneme identity, the algorithm can have a better clue of where to look for formants. Using this method, the error rate can be significantly reduced. The algorithm consists of three phases: 1. Analysis phase Linear predictive coding (LPC) analysis and root-solving. Formant candidates are obtained by solving the LP polynomials from LPC analysis on pre-emphasized speech 2. Segmentation phase Segmentation and alignment using Hidden Markov Model (HMM)-based forcedalignment algorithm. Input text is converted into a sequence of phoneme signals. The phoneme signals are then time-aligned with the acoustic speech utterance. 3. Formant-tracking phase Formant tracking by the Viterbi searching algorithm. The best set of formant frequencies is selected from the candidates, based on the minimum-cost criteria. For each analysis frame, the set of formant candidates closest to the nominal tracks is chosen, while satisfying the continuity constraints.

3. Brief Conclusion
The majority of these papers are motivated by the error rate or disadvantages when using LP polynomials to perform formant tracking. I still do not have a thorough understanding of these methods and I have not yet researched on the LP methods. This is what I will be working on for the next few weeks.

Vous aimerez peut-être aussi