Académique Documents
Professionnel Documents
Culture Documents
Outline
Abstract Chapter 1 Introduction----------------------------------- 3 Chapter 2 Short Time Fourier Transform-------------3 Chapter3 Gabor Wavelet Transform-------------------10 Chapter4 Gabor Feature in Speech Recognition----17 Chapter5 Gabor Feature for Face Recognition------21 Chapter6 Conclusion--------------------------------------24 References
Abstract
Fourier transform are a common method for signal analysis in recent years.However, Fourier transform focus on the frequency domain, cant analyze the variations of frequency with time.Short time fourier transform have thus been proposed to do the analysis of frequency with time.The most famous one of short time fourier transform is Gabor transform.Gabor transform use gaussian function as its window function to do the short time fourier transform and is commonly used in signal analysis nowadays.Gabor transform has been applied to speech recognition,acoustic,signal sampling,image recognition,etc.In this paper well introduce what is gabor transform and its different from traditional fourier transform.Also well cover how gabor filter can be used to extract gabor feature for speech recognition and face recognition,etc.
Chapter 1
Introduction
Although the Fourier transform of the entire time series does contain information about the spectral components in time series, it cannot detect the time distribution of different frequency, so for a large class of practical applications, the Fourier transform is unsuitable. So the time-frequency analysis is proposed and applied in some special situations. The STFT is most often used.The paper are organized as the following:Chapter 2 will give a brief introduction to short time fourier transform and its advantages in signal analysis.Chapter 3 will introduce gabor transform and its application.Chapter 4 will give a overview of how gabor filter can be use in speech recognition,which will definitely be our key point in this paper.Chapter 5 will introduce the use of gabor feature in image recognition.The last part will be the conclusion and some future work for gabor features.
Instead of representing a signal either as a function of time or frequency separately, in 1946, Dennis Gabor proposed Gabor expansion, that any signal could be expressed as a summation of mutually orthogonal time-shift and frequency-shift Gaussian function. Gabor expansion is one kind of sampled Short Time Fourier Transform (STFT). STFT, the earliest time-frequency representation (TFR) [3-5] uses sliding window to get the local signal and then transform the masked signal into frequency domain. The most famous TFRs are STFT, Wigner transform, Wavelet transform, Cohens class, and S-transform. Short Time Fourier Transform (STFT), the simplest time-frequency representation, is a two-dimensional representation created by computing the Fourier Transform and using a sliding temporal window. By using the STFT we can observe how the frequency of the signal changes with time.
(2.2) where the frequency has ranges from ~ . The conventional Fourier transform does a good job, when analyzing stationary signal or periodic signal. Because the continuous signal cannot be analyzed, we have to sample the signal with sampling frequency f s = 1/ T and implement the Fourier Transform in discrete form. The relation and formulas are x[n] = x(nT ) (2.3) (2.4) In order to avoid aliasing effect, the sampling frequency should be more than the double of the signals bandwidth. If the signal is time-limited, then we can use FFT to implements (2.4) to reduce the computing time. For non-stationary signals, such as chirp signal, the conventional Fourier non-stationary signal which has two frequency components, 0.5 Hz and 1.0 Hz, but they occur in different place as shown in Figure 2.1(a) (2.5)
The power spectrum shown in Figure 2.1(b) has two peaks at 0.5 Hz and 1.0 Hz.
Figure 2.1: (a) The non-stationary signal x1 (t ) in time domain. (b) The power spectrum of x1 (t ) .
Several problems of conventional Fourier Transform have to be noticed. The first is that we cannot obtain the information about where in time these frequency components occur, and the second is that there are some leakages in the frequency domain but in reality they do not appear in (2.5). For stationary signal or periodic signal conventional Fourier Transform can do a good job. But from the above example, we know that conventional Fourier Transform is not a suitable analyzing tool for us to analyze the non-stationary signal and non-periodic signal. Since almost all the signals in real world are non-stationary and non-periodic, conventional Fourier Transform will not be very useful in real-world signal analysis. Fortunately, we have
another tool for non-stationary signal,short time fourier transform which will be introduced in the next section.
STFT uses a fixed sliding window to mask the signal and then transforms it to the frequency domain. The STFT of x1 (t ) is shown in Figure 2.2, and we can observe the two frequency components: 0.5 Hz and 1.0 Hz. 0.5 Hz appears around 10s~20s, and 1.0 Hz appears around 30s~40s. The time-frequency location is almost the same as defined in (2.5). Through the help of STFT, we can observe how the frequency of the signal changes with time. Therefore, in real world TFRs are more suitable to analyze signal than the conventional Fourier Transform.
Figure 2.2: The STFT amplitude of x1 (t ) as shown in (2.5) and Figure 2.1(a).
Figure 2.3: Block diagram illustrates how STFT works. 1 , 2 ..... n are the discrete
6
Unlike the conventional Fourier Transform, which transforms one-dimensional signal into another one-dimensional signal and has only one unique inverse formula, the STFT transforms one dimensional signal into two-dimensional signal and has a lot of redundancies. Therefore, the inverse of STFT is really a big problem, especially when we want to use the STFT to do time-frequency filtering. In general form, we can express the inverse of STFT as (2.7)
where wi ( t ) is the inverse weighting function which has to fulfill the following relation
(2.8) After showing the several inverse formulas, there is one most important thing left, the choice of sliding window w( t ) . In order to have a local property around , we have to make the window fulfill the following restriction (2.9) The earliest and common window is rectangular window, which has value 1 in some range and 0 outside the range. Figure 2.4 shows the rectangular window and its amplitude response.
Figure 2.4: (a) The rectangular analysis window function wrect (t ) . (b) The amplitude response of wrect (t )
From (2.6), we can regard the upper equation as the Fourier Transform of the masked signal x(t ) w( t ) . If the window narrows, then the time resolution will be better. On the other hand, we can also regard the lower equation of (2.6) as the inverse
7
Fourier Transform of X ( + f )W ( ) ; therefore, the frequency resolution will be worse if the window becomes more narrow. The narrow window will give a better time resolution and worse frequency resolution. Hence the width of window function balances the time resolution and the frequency resolution, and it has relation with the Uncertainty Principle . The Uncertainty Principle state that (2.10) where t denotes the standard deviation of the signal in time domain (uncertainty in time), and f denotes the standard deviation of the signal in frequency domain(uncertainty in frequency). It is impossible for us to have a STFT that could provide both nice time and frequency resolution. Figure 2.5 and Figure 2.6 demonstrate the tradeoff of time resolution and frequency resolution with different window size for the signal(2.1).
Figure 2.5: Demonstrate the tradeoff between time resolution and frequency resolution; (a) The time function of the wide Hamming window. (b) The Fourier Transform of the wide Hamming window. (c) The amplitude of STFT.
In Figure 2.5 we can observe that it really has poor time resolution but good frequency resolution, and Figure 2.6 has good time resolution but poor frequency resolution. This result corresponds with what we have mentioned above. Besides the rectangular window, the other two common windows are Gaussian window and Hamming window. Their continuous formulas are (2.11) (2.12) If Gaussian window is used in STFT, it is also named Gabor Transform, which is widely used due to its less leakage in time-frequency domain. In speech analysis, people usually prefer Hamming window,which will be introduced in the next chapter.
1. Signal analysis: By using the TFR we can learn about the signals time-frequency components, and then we could analyze them and get more information, that we cannot observed direct from time or frequency domain, such as medical, geologic, power quality, optical, speech, and image signals analysis. 2. Time-frequency filtering: If we mixed two chirp signals which have different location in time-frequency domain, we can use a time-frequency mask to filter the signal and get the desired signal. Recently time-frequency filtering is widely used, especially using the Wavelet Transform and STFT. 3. Pattern recognition: A lot of signal has its own time-frequency pattern. For example, music instrument usually has its own time-frequency pattern unlike others. Different words and different people have their own voice pattern. Then we can recognize these time-frequency pattern to decide which this signal belongs to in our database. In this chapter, we have introduced the conventional Fourier Transform and its limitation when dealing with non-stationary signals. STFT is more suitable for non-stationary signal analysis.We also introduce the framework of STFT which will be extended to Gabor transform in the next chapter.
(3.1) The Gabor transform is like the short time Fourier transform. We can see that the Gabor transform kernel is the Fourier transform kernel plus a Gaussian function. Therefore we can make a lot of transforms like the Gabor transform. Since the Gaussian sig nal is more concentrated than the rectangular function in the frequency domain, the frequency resolution of the Gabor transform is much better than short time Fourier trans form, Fig.3-1
10
Frequency (Hz)
-5
10
15 Tim (Sec) e
20
25
(3.2) x(t) = cos(2 t) when t < 10, x(t) = cos(6 t) when 10 t < 20, x(t) = cos(4 t) when t 20 (3.3)
Figure 3-2 (a) The short time Fourier transform of (3.2) (b) The short time Fourier transform of (3.3). From the above figure, we can easily see that the frequency resolution of the Gabor transform is much better than short time Fourier trans form.
concentrated in time. Sinusoids are useful in analyzing periodic and time-invariant phenomena, while wavelets are well suited for the analysis of transient, time-varying signals. Most standard wavelets are based on one wavelet function, ( x ) , which has some special properties . The wavelet function have oscillation property expressed mathematically by an integration to zero given by (3.4) A wavelet basis is a two-parameter family of functions that are related to a function ( x ) . They are defined by the set { j ,k ( x)} of wavelets given by (3.5) The variables j and k are integers that scale and displace the function ( x ) to generate a succession of wavelets. The scale index j indicates the wavelets width, and the location index k gives its position of displacement. Notice that the functions ( x ) are rescaled, or dilated by powers of two, and translated by integers. Once we know about the functions ( x ) , we know everything about the basis. The wavelet functions j ,k ( t ) for all kZ span a subspace, called W j . That is (3.6) And if f ( x ) W j , it can be expanded as (3.7) In real applications the coefficients { j ,k | j ,kZ} are processed by the discrete wavelet transform (DWT) which is an implementation of the wavelet transform using a filter bank. The discrete sequences processed by the DWT constitute multi-resolution representation. As shown in Fig. 3-3, W ( j , k ) and W ( j ,
k ) are the detail and approximation coefficients at scale j , W ( j + 1, k ) is the
approximation coefficients at scale j + 1 . h ( n ) and h ( n ) are the time-reversed low-pass and high-pass filters associated with and respectively
12
We can easily extend the one-dimensional transform to the two-dimensional case. In two-dimensional, an image is filtered and decomposed into an approximation and details images by applying a separable filter bank. The original image is split into approximation W ( j , m, n ) , and details WH ( j , m, n ) , WV ( j , m, n ) , WD ( j , m, n ) at level j , in horizontal, vertical, and diagonal directions. Like the one-dimensional discrete wavelet transform, the two- dimensional wavelet transform can be implemented using digital filters and down-sampling as shown in Fig. 3-4. A resulting decomposition is shown in Fig. 3-5 in which the wavelet sub-images consisting of three sub-bands in vertical, horizontal, and diagonal direction respectively.
13
Fig. 3-11 Gabor function - phase part. (In fact, it is a level. Here planning to ( , ] )
is the DC composition. In this way, the filter can be free of DC composition. k ,v is the wave-vector of the filter corresponding to orientation and scale v . Through choosing a series of k ,v a set of Gabor filter can be obtained. is a constant that with k ,v portray the wavelength of the Gauss window together. Here we choose = 2 . k ,v can be further written as (3.9) where kv = kmax f v and = / 8 . kmax is the maximum frequency, and f is the spacing factor between kernels in the frequency domain Different v is chosen to describe different wavelength of the Gauss window, and then control the scale of sampling. We can say too that controls frequency. Different is chosen to describe the oscillation function with different direction, and then control the direction of sampling. In this thesis, we useGabor wavelets at five different scales, v {0,4}, and eight orientation, {0,7}.The morphology of 40 Gabor filters is shown in Fig3-12
15
Fig. 3-12 Morphology of 40 Gabor filters Five different scales and eight orientations generate 40 filters. Fig. 3-13 shows the real part of the 40 Gabor kernels at scale v = 0,4 and orientation = 0,7 with the following parameters : = 2 , kmax = /2 and f = 21/2 . The filter demonstrate desirable property of spatial locality, and orientation selectivity.
Fig. 3-13 Gabor wavelets. (a) The real part of the Gabor kernels at five scales and eight orientations for = 2 , kmax = /2 , and f = 21/2 . (b) Magnitudes of the Gabor kernels at five different scales The other use of Gabor feature will be explained in the 5 Chapter.
16
Thirdly,each frame signal is multiplied with a window function, and is then transformed into thespectral domain with fast Fourier transform (FFT). The resulting magnitude spectrum(the phase part is discarded) is further processed with a mel-frequency filter-bank, and each filter output is the weighted sum of the magnitude values within the pass-band.Finally, all the filter outputs are further processed by the logarithm operation and the discrete cosine transform (DCT). The resulting new parameters are just the melfrequency cepstral coeffcients (MFCCs) for that frame signal. Besides, in addition to the MFCCs for a single frame, we often group the MFCCs of several adjacent frames to obtain the delta and delta-delta MFCCs, which are used together with the original MFCCs to be the finally-used feature vector for that frame.
Figure 4-1: The flowchart of MFCC feature extraction Although MFCC performs quite well and is thus widely used for speech recognition,we believe it can be further enhanced by taking some points into consideration. First, since the speech signal is a non-stationary random process, dividing it into frames and realizing the short-time Fourier transform (STFT) for obtaining MFCC may just provide a good estimate of the underlying characteristics. As we said in the previous chapter, the discrete Fourier transform gives a better analysis for periodic signals than for the signals containing sudden bursts. Secondly, using the overlapped triangular-shaped" mel-filters in deriving MFCC is efficient in computation, but is not optimal in any sense.
18
function (with the Gabor function centered on the current frame and desired frequency channel) and a subsequent summation over frequency. This yields one output value per frame per Gabor function (we call these output values the Gabor features) and is equivalent to a 2-D correlation of the input representation with the complete filter function and a subsequent selection of the desired frequency channel of the output. In this study, log mel-spectrograms serve as input features for Gabor feature extraction. This was chosen for its widespread use in ASR and because the logarithmic compression andmel-frequency scalemight be considered a very simple model of peripheral auditory processing. Any other spectro-temporal representation of speech could be used instead and especiallymore sophisticated auditory models might be a good choice for future experiments. The two-dimensional complex Gabor function g(t,f) is defined as the product of a Gaussian envelope n(t,f) and the complex Euler function e(t,f). The envelope width is dened by standard deviation values f and t , while the periodicity is dened by the radian frequenciesf and t denoting the frequency and time axis, respectively. The two independent parametersf andt allow the Gabor function to be tuned to particular directions of spectro-temporal modulation, including diagonal modulations. Further parameters are the centers of mass of the envelope in time and frequency t0 and f0 . In this notation the Gaussian envelope n (t,f) is dened as (4.1) and the complex Euler function e(f,t) as (4.2) It is reasonable to set the envelope width depending on the modulation frequenciesf andt to keep the same number of periods T in the lter function for all frequencies. Here, the spread of the Gaussian envelope in dimension x was set to The innite support of the Gaussian envelope is cut off at between x and 2x from the center. For time dependent features, t0 is set to the current frame, leaving f0 ,f andt as free parameters. From the complex results of the lter operation, real-valued features may be obtained by using the real or imaginary part only. In this case, the overall DC bias was removed from the template. The magnitude of the complex output can also be used. Special cases are temporal lters (f=0 ) and spectral lters(t=0 ).
20
An image can be represented by the Gabor wavelet transform allowing the description of both the spatial frequency structure and spatial relations. Convolving the image with complex Gabor filters with 5 spatial frequency (v =0,,4) and 8 orientation (= 0,,7) captures the whole frequency spectrum, both amplitude and phase (Figure 5-1). In Figure 5-2, an input face image and the amplitude of the Gabor filter responses are
21
shown.
Figure 5-2 Example of a facial image response to above Gabor filters, a) original face image (from Stirling database), and b) filter responses. One of the techniques used in the literature for Gabor based face recognition is based on using the response of a grid representing the facial topography for coding the face.Instead of using the graph nodes, high-energized points can be used in comparisons which forms the basis of this work. This approach not only reduces computational complexity, but also improves the performance in the presence of occlusions.
number of facial characteristics of different faces, such as dimples, moles, etc., which are also the features that people might use for recognizing faces (Figure 5-3).
Figure 5-3: Facial feature points found as the high-energized points of Gabor wavelet responses. From the responses of the face image to Gabor filters, peaks are found by searching the locations in a window W0 of size WxW by the following procedure: A feature point is located at (x0, y0), if (5.1) (5.2) where Rj is the response of the face image to the jth Gabor filter . N1 N2 is the size of face image, the center of the window, W0 is at (x0, y0). Window size W is one of the important parameters of proposed algorithm, and it must be chosen small enough to capture the important features and large enough to avoid redundancy Equation (5.2) is applied in order not to get stuck on a local. maximum, instead of finding the peaks of the responses.
Figure 5-4: Flowchart of the feature extraction stage of the facial images
23
5.3.2 Feature vector generation Feature vectors are generated at the feature points as a composition of Gabor wavelet transform coefficients. kth feature vector of ith reference face is defined as, (5.3) While there are 40 Gabor filters, feature vectors have 42 components. The first two components represent the location of that feature point by storing (x, y) coordinates. Since we have no other information about the locations of the feature vectors, the first two components of feature vectors are very important during matching (comparison) process. The remaining 40 components are the samples of the Gabor filter responses at that point. Although one may use some edge information for feature point selection, here it is important to construct feature vectors as the coefficients of Gabor wavelet transform. Feature vectors, as the samples of Gabor wavelet transform at feature points, allow representing both the spatial frequency structure and spatial relations of the local image region around the corresponding feature point.
Chapter 6 Conclusion
This tutorial report introduces the well-known Gabor featurewavelet transform and its application. The multi-resolution and multi-orientation properties of the Gabor wavelet transform makes it a popular method for feature extraction even if the intrinsic nonorthogonality exists. Among all the works based on Gabor wavelet, face recognition and speech recognition are the most noticeable applications, and other research used the Gabor wavelets mainly for feature extraction. Several Matlab implementations are presented in this tutorial and show both the theoretical and application aspects of Gabor wavelets. There seems no further necessity to modify the formula of Gabor wavelets while the feature representation and more possible applications remain spaces for future works.
24
References
[1] F. Smeraldi and J. Bigun, Facial feature detection by saccadic exploration of the Gabor decomposition, Proc. Intl Conf. Image Processing, 163-167 [2] F. Samaria and F. Fallside, Face identification and feature extraction using Hidden Markov Models, Image Processing: Theory and Applications, 1993. [3] M. Kleinschmidt, Methods for capturing spectro-temporal modulations in ASR, Acustica united with acta acustica,2002 [4] M. Kleinschmidt, Spectro-temporal Gabor features as a front end for ASR, in Proc. Forum Acusticum Sevilla, 2002. [5] T. S. Lee, Image representation using 2D Gabor wavelets, IEEE Trans. Pattern Analysis and Machine Intelligence, 18(10), 1996 [6] L. Shen and L. Bai, A review of Gabor wavelets for face recognition, Patt. Anal. Appl. 9: 273-292, 2006 [7] B. S. Manjunath, R. Chellappa, and C. von der Malsburg, A feature based approach to face ecognition, Proc. IEEE Conf. CVPR92: 373-378, 1992 [8] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, Coding facial expressions with Gabor wavelets, Proc. Intl Conf. Automatic Face and Gesture Recognition, 200-205, 1998 [9] F. Smeraldi and J. Bigun, Facial feature detection by saccadic exploration of the Gabor decomposition, Proc. Intl Conf. Image Processing, 163-167
25