Vous êtes sur la page 1sur 7

Feature Extraction from Speech Spectrograms using Mu1ti-Layered Network Models

Mathew J. Palakal and Michael J. Zoran Departmentof Computerand Information Science t Indianapolis Indiana University - Purdue University a 1201 East 38th Street AD 135 Indianapolis,Indiana 46205-2868

Abstract
Speech spectrogram is an invaluable tool in Automatic Speech Recognition (ASR) research and it contains rich acoustic and phonetic knowledge about speech. Expert human spectrogram readers are able to interpret speech spectrograms by visual examination. The interpretation is usually based on the experts linguistic knowledge and correlating this knowledge with the characteristic pattern of speech. Machines can have similar capability i f patterns of various speech units can be collected, described, and learned statistically. Recently, researchers i n ASR area are investigating the use of artificial neural network models for learning inter- and intra-speaker variations as well as using it as a recognition tool. In this paper we propose a method to capture speaker invariant features from speech spectrograms ' using Artificial Neural Network (ANN) models.
1. Introduction

Speaker-independent recognition of large or difficult vocabularies by computers is still an unsolved task, even if the words are pronounced in an isolated manner. Using existing knowledge about production and perception of speech, phonemes, diphones and syllables can be useful for conceiving prototypes of Speech Units. Speech Unit prototypes can be characterized by a redundant set of Acoustic Properties. Automatic Speech Recognition systems based on acoustic property descriptors is not very efficient i f the set of properties used and the algorithms for their extraction are not well chosen and conceived. According to early work done on speech spectrograms it has been established that the underlying phonetic information can be recovered entirely by visually examining the spectrogram. An experienced spectrogram reader can correctly identify close to 90% of phonetic segments by visual examination. Using spectrograms, one could work in a visual domain rather than in the conventional audio domain. Working in visual domain may be better because it is easier to "verbalize" speech spectrogram process than to verbalize hearing process. Speech spectrogram readers interpret spectrograms based on several a priori knowledge, namely, (a) by considering pictorial scene changes, (b) by relating acoustic knowledge and phonetic knowledge to these scene changes, and (c) by associating prior linguistic knowledge to such scenechanges. I n our approach, the 3-dimensional speech spectrograms are treated as images, and well known image processing (pattern recognition) techniques are applied on

these images. Contextual variations in speech images are somewhat similar to real world changes in scene analysis. Therefore, by considering speech spectrograms as images and by applying image processing techniques to these patterns, one should be able to interpret and capture variations in a better way. There are several reasons for taking such an approach. Firstly, speech signal is highly variable. Existing ASR systems lack the power to understand unexpected variations, absence, or hidden features caused by inter- and intraspeaker variations. However, contextual variations can be observed on speech spectrograms, and therefore, by treating speech spectrograms as images, these variations can be better represented and recognized. Secondly, existing ASR systems tend to look for specific features and take action based on the presence or absence of the expected features. These features could be at the acoustic level or at the phonetic level. It is very likely that a "misplaced" feature means absence of that particular feature. In this approach, by keeping redundant information together with significant information, one does not throw away certain contextually significant features. Finally, knowledge about spectrograms is incomplete. We know that some properties that can be visually detected are relevant for perception. The same property may have variations from pattern-to-pattern of the same word because of inter- and intra-speaker variations. It is important to characterize knowledge about such variations. This characterization has to be statistical because we do not have other types of knowledge on how basic word pattern prototypes are distorted when different speakers pronounce that same word. On the other hand, it is very important to characterize word prototypes in terms of properties that are relevant for speech production andlor perception. Based on the above considerations, in our previous 2, 3, 41, we proposed a method to work along this line [i, derive morphological features from speech patterns. Such features were then statistically learned using Hidden Markov Models. Even though the results were encouraging, perfect recognition rate was not achieved. In this work we propose to use an artificial neural network model, the Neocognitron model, to learn and extract features from speech spectrograms. The Neocognitron network model is a biologically based pattern recognition system capable of recognizing images that are distorted or shifted in position. It is a 3-layer network system and the initial layer of the network extracts small features; each advancing layer looks for larger and larger features. As pattern information progresses through the network, slight distortions and shifts in position are allowed.

224

1 9 & / 8 9 ~ / 0 2 2 4 $ 0 1 . 0 0Q 1989 IEEE

Early work on neocognitron models by Fukushima [5] shows that these models are capable of capturing features that are distorted by position and by size. Speech spectrograms contain features that are usually distorted by position due to inter- and intra-speaker variations. For example, the formant positions of vowels may change sightly from speakerto-speaker or with in the same speaker. Similarly, duration of the features may also vary depending on how long the sound was sustained. These variations are usually visible in speech spectrograms. We intend to use neocognitron models to capture such feature variations. Even though neognitron models can be used to recognize patterns, our goal here is to use this model as a feature extractor and not as a classifier. In the following sections of this paper we describe how speech patterns are obtained from speech spectrograms, development of a neocognitron model for extracting features, and some results on feature extraction using the neocognitron network model.

2. Generating Speech Patterns


The time-frequency-energy pattern of speech segments are obtained by considering the 0-4KHz portions of spectra computed with the Fast Fourier Transformation (EFT) algorithm applied to the preemphasized speech signal. Fig. 1 shows an example of such a pattern. In Fig. 1, frequency from 0 to 3.5 kHz is shown along the horizontal axis and each printed line corresponds to a centisecond interval. Also in Fig. 1, letters indicate the peak relative energy. Letter "B" represents twice the energy that is represented by letter "A"; digit "0" represents an energy that is twice that represented by letter "2" and so on. Digit "9" is the strongest energy point. The speech pattern is then skeletonized and preprocessed as explained below.

the original pattern in Fig. 1. Preprocessing on skeletonized pattern is carried out to discard all isolated, weak, and scattered points in the pattern. Such points do not contribute to any meaningful assumptions either at grouping level or at hypothesis level. Preprocessing is carried out by applying an algorithm based on the strategy . . of 121. The line tracing algorithm, LTA, retains properties like, collinearity, curvelinearity, continuity etc. present in the pattern. The non-accidental lines or significant lines in speech pattern are usually surrounded by lines which are less significant. The human vision system is capable of identifying such "biological camouflage" to a great extent. Similarly, when spectrogram experts read spectrograms, their vision system is capable of discarding non-significant (accidental) lines and to give emphasis on significant lines (nonaccidental). Skeletonization and preprocessing are the two techniques which would give the machine similar capability as human to a certain extent in this context. Fig. 3 shows the preprocessed pattern of Fig. 2. Thinning and preprocessing surface all significant and non-significant lines in the pattern and discard all scattered points. In terms of acoustic point of view, some of these lines are the formants. Some of the lines are StrnnQ and some are The strength and weakness of lines are measured in terms of energy and duration. Even though non-significant points are eliminated, non-significant lines are still present and keeping these lines is important in order to detect certain linguistic properties of speech.

e.

3. Artificial Neural Network Models


An artificial neural network model is a highly parallel dynamical system with the topology of a directed graph that can carry out information processing by means of its state response to continuous or discrete valued inputs. Recently, artificial neural network models have begun to emerge as powerful tools for learning and recognizing patterns with great variabilities (similar to speech patterns). We propose to use artificial neural network models to solve mainly two problems: (a) Extraction of speaker invariant features from speech patterns, and (b) Learning the features to recognize vowels and diphthongs in English sounds. To solve both
2.0 2.)

It is important to notice at this point that, the smallest significant elements in speech patterns considered are the lines and curves and not isolated points. Lines and curves represent perceptual information while locational information about lines and curves represents spatial correspondence. Skeletonized patterns are easier for spatial description and for perceptual grouping. Fig. 2 shows the skeletonized pattern of
0.5 1.0
I S

3.0

m.

Fig. 1 Example of a speech pattern of letter "a" spoken isolated.

225

1.0 I
42
43 44 45 46 47

1.1

-1

.a

4s
SO

51
I2
51 5.

IS
5 '

57

I8
5, '0 'I '2 1 ' 64

6,
'6 67 6a 'I 70 71 72

7s
7'
1,

: :
: :
T I .
E r . -

-"-

1 1 -

-.____
Fig. 3 Preprocessed pattern of Fig. 2 for the letter "a".
0.5

-.-.-I --*--I I --=-I --)-I I I - 1 I

-I-

l.0

1.5

2 .O

2.5

1.0

m.

42

41
44 .5 4'

.*

17

4, 50 51

52

I3
54

IS
54

I7 5.
5, '0 41
'2

61
'4 '5 (6 '7 6s
'9

70

71
12

71 74

7s
I' 77 78
7 T I ,-

_____
. , I

_-.__

_I___ ___I__

Fig. 2 Skeletonized pattern of Fig. 3.3 for letter "a".

problems, we propose to use layered network models similar to neocognitron and perceptron models. A brief description of a layered network model is considered now. The neocognitron model, developed by Fukushima [5], is a pattern recognition system modeled after the visual nervous system. The network is self-organizing, with learning occurring without a teacher (unsupervised learning). The pattern recognition mechanism employed by the neocognitron is feature based: Features extracted from the patterns presented during training are used to identify a test pattern. Presently the main application' of the neocognitron is visual pattern recognition, however, in this work we show the use of neocognitron model in auditory information processing, such as speech recognition. Since there is often considerable variance between spectrograms from different speakers, a recognition system must be equipped to tolerate such variance. The neocognitron recognizes patterns by looking for distinguishing features, not by performing a pixel by pixel comparison. As a result the neocognitron can recognize patterns correctly, even if the patterns are shifted in position or distorted in shape. Thus the

powerful feature based recognition mechanism of the neocognitron makes it an excellent candidate for a speech recognition system.
of the

The cell is the smallest unit in the network. Cells may be either excitatory (they excite other cells that they connect to), or inhibitory. The neocognitron contains four types of cells: S-Cells, C-Cells, VS-Cells and VC-Cells. S-Cells are primarily responsible for feature extraction and pattern recognition, while C-Cells serve to allow the network to deal with distortion in patterns. VS and VC-Cells are inhibitory, allowing the network to selectively respond to learned patterns. The next unit in the network is the plane, a matrix of cells (usually a square matrix). Planes contain either all S-Cells (called an S-Plane) or all C-Cells (C-Plane). All cells within the same plane have identically weighted connections. As a result, each cell in the plane is looking for the same feature, but at different locations in the preceding layer. Each layer of the neocognitron consists of one module (an array) of S-Cells,

226

Fig. 4 Organization of a Neocognitron model [5] followed by one module of C-Cells. Fig. 4 shows the overall organization of a neocognitron model as proposed by Fukushima [5,6 $71. VS-Cells: The first type of inhibitory cell in the network is VS-Cell. Unlike C and S-Cells, VS-Cells are inhibitory. VS-Cells receive input from S-Cells and lead into (and inhibit) C-Cells. The purpose of the VS Cell is to shunt the output of the C-Cell if inputs are too small. VC-Cells : The other kind of inhibitory cell in the network is the VCCell. VC-Cells receive input from C-Cells and lead into (and inhibit) S-Cells. The purpose of the VS-Cell is to enable the S-Cell to be selective to a particular feature.

S-Cells: The S-Cell is an excitatory cell which has variably weighted incoming connections. Most connections into an SCell come from a fixed region of C-Cells in the preceding layer (or for S1, the input layer). I f the preceding layer contains multiple planes, each S-Cell contains connections from every plane. C-Cell connection serve to excite the S-Cell. In addition to the excitatory connections, each S-Cell has one inhibitory connection (from a VC-Cell). S-Cells are the primary units of learning in the network. An S-Cell learns to detect a certain feature from the presented pattern. Suppose that a given S-Cell responds to feature f. As learning progresses in time, this S-Cell yields stronger and stronger outputs when presented with f, and weaker outputs if presented with a different pattern. Eventually, the S-Cell will yield 1.0 when presented with f, and less than 1.0 when presented anything other than f. Essentially, S-Cells become selectively responsive to a certain feature. Thus the S-Cell is the basic mechanism for feature detection (and ultimately pattern detection) in the network. C-Cells:

C-Cells are the other type of excitatory cells in the network. Unlike S-Cells, C-Cells do not have variably weighted incoming connections. The connections are fixed according to a monotonically decreasing function, d(v). The primary purpose of the C-Cell is allow distortion and positional shifts in the image. The function 4 v ) controls the amount of distortion permitted. If d(v) decreases gradually (linearly), a great deal of distortion is allowed. If it decreases rapidly (exponentially), only slight distortion is allowed. 4. Feature Extraction using Neocognitron Model Most input to a C-Cell comes from a region of S-Cells in the preceding layer. The amount of influence that an S-Cell The multi-layered neocognitron actually performs two (from the region) has on C-Cell is a result of the function 4~). main tasks: Feature extraction and pattern recognition. The In addition to the excitatory inputs from S-Cells, the C-Cell first layer of the network extracts small features from a has one inhibitory function from a VS-Cell. presented image, and each advancing layer extracts larger and

The next unit in the network is the plane. A plane is actually a matrix of cells (usually a square matrix). Every plane contains either all S-Cells or all C-Cells (the inhibitory cells may be considered part of the plane as well). All cells within the same plane have identically weighted connections. As a result, each cell in the plane is looking for the same feature, but at different locations in the preceding layer. Two remaining unit of structure in the network are the module and layer. A module is an array of S or C planes. A layer in the network consists of one S Module and one C Module. Fig. 5 shows an example of the outputs of a multi-layer neocognitron model for an input pattern of digit 1. This diagram shows the output generated by a neocognitron used to recognize digits. Cells are represented by the individual pixels within the boxes. If the output of a cell exceeds a threshold value, the pixel is shown as black, otherwise it is left as white. Planes are represented by rectangles, while modules are shown as groups of rectangles. A layer is represented by one S-Module, and a corresponding CModule (the first S-Module and first C-Module comprise the first layer). The cell shown by the arrow is an S-Cell in the second plane of the S module of the first layer.

221

Input Image

First Layer Connections

l i l
C

First S Module

First C

Module

[
where, C1-Jv) are fixed interconnections determined to decrease monotonically with respect to IVI and to satisfy,
KI-I

Output Module

larger features. Feature extraction and pattern recognition are tied closely together, with recognition occurring implicitly as features are extracted. One problem with the multi-layered neocognitron is that the number of cells and connections (and hence the runtime) tends to increase greatly as more layers are added. One way around this difficulty is to use a single layer neocognitron to extract features and use some other classification scheme (a different neural net or a conventional vector comparison algorithm) to perform the recognition task. The first layer of a trained neocognitron breaks the image down into a feature vector, with the output of each S-Plane being a component of the vector. The resulting feature vector can be compared to a series of stored feature vectors, with the closest vector producing the match. For this experiment (due to the large size of the input image), the single-layer model of the neocognitron will be used. This model contains only S and VC-Cells. Connections leading into the S-Cells are of variable strength, while connections leading into VC-Cells are fixed according to a monotonically decreasing function. The single layer Neocognitron, unlike the multi-layer version, does not recognize the input image (although it does extract features). The output generated by this neocognitron is an array of S-Planes (the first S-Module of network). Essentially, the first S-Module of a trained neocognitron breaks the image down into a feature vector, with the output of each S-Plane being a component of the vector.
for a S i n e l e v e r Network

2)

Calculate all the S-Cell outputs for each layer according to equation [5]:

where,

@Ix]
3)

{z
0

(z L 011 (x < 0)

i f training then adjust the weights leading into SCell module of layer according to the following procedure: a) Construct S-Columns and select certain representative cells from the S-Module [5]. b) Adjust the weights of all representatives according to the following equations [5]: I f U , [ ( k p ) is the chosen representative S-Cell, then the weights are reinforced according to [5]:

The algorithm to learn the features of the pattern using a single layer network model is given below: for each pattern do for layer := 1 to NumberOfLayers do
1)

Calculate all the VC-Cell outputs for each layer according to the following equation [5]:

endfor endfor

228

. . [2escnDtlonQfvariablesllSezi
r:

K:
a:

b:
S:

n:

U:

inhibition intensity number of Cell planes excitatory interconnecting coefficients inhibitory interconnecting coefficients number of cells in the connectable area location of cell in the S-Plane output of the S-Cell

/e/

Experimental Results
The proposed network model was used to learn vowel features from six different vowel and diphthong sounds in English, namely, lal, l e l , lil, 101, lul, and lyl. The quasistationary vocalic regions is first extracted from the speech signal using total-energy and zero-crossing information. Spectrograms of the vocalic region is then generated as described in Section 2. From the speech spectrograms, the speech pattern is obtained. The speech pattern is then skeletonized and preprocessed. The preprocessed pattern is normalized to 40 time-frames, each time frame with 120 spectral points. The input to the neocognitron model is thus a pattern of size 40x120. Neocognitron parameters used in this experiment is as follows: Every S Plane in the first layer is equivalent in dimensions to the input layer. Thus each plane contains (120x 40) 4800 cells. There are 50 such planes in the network, so that the network is attempting to detect 50 features. Altogether there are 240,000 S-Cells and 50 VC-Cells. The connectability of each S-Plane is 20 x 20,so that the network is attempting to extract features that are 20 x 20 in size. The 20 point horizontal window divides the spectral pattern into 6 equal regions of 600Hz. ie. the first region consists of 0600Hz, second region between 600-1200 Hz and so on. For vowel sounds, significant spectral changes occur within these frequency ranges. The 20 time-frame vertical window divides the spectral pattern into two regions in time. For certain diphthong sounds, for example, lil, spectral changes do occur in time as well. In order to test the learning property of the network model, two tests were carried out. First, ten different utterance of vowel lal pronounced by five different male and five different female speakers were trained on the network. Features extracted for sound /a/ is shown in Fig. 6. Another test was done using the utterance of all the sounds from a single speaker. Fig. 7 shows sample input patterns of all six vowels used for this trest. The features learned by the network for all the six sounds are shown in Fig. 8. In Fig. 8. the sixteen features extracted for all the six vowel patterns is numbered one through sixteen. Feature (l), for example, is a unique feature for /al. Feature (4)is unique for lil, and so on. By comparing these features against the patterns given in Fig. 7, one could see that sufficient unique features are extracted for each pattern category. Altogether there are 50 features extracted, however, Fig. 8 shows only the strongest features. Our initial test results shows that the network model is capable of learning all important features that are present in the pattern.

Fig. 6 Pattern of sound lal and corresponding 10 features extracted by the network

/a/

/e/

/U/

Fig 7. Patterns of six vowel sounds used as input to the single layer network

(9)

(IO) (11)

(12)

(13)

(14)

(15)

(16)

Fig. 8 Sixteen features extracted by the netwotx for six different vowel patterns

6. Conclusion
The main goal of this research work is to explore the possibility of obtaining robust, speaker-independent properties from speech spectrograms using multi-layered network models. Preliminary work carried out along this

229

direction showed interesting results. Only preliminary results are reported in this paper and extensive training and various parameter adjustments are needed in order to exploit the networks capability at its maximum. Once the features are extracted properly and completely, the next step will be to design a classifier system to recognize the sound classes. This classifier system could be another neural network model, such as, a perceptron model or it could any other conventional classification scheme. Even though our ultimate goal is speaker-independent recognition of connected or continuous speech, our immediate focus is on recognizing specific sound classes. In our future work we will extend this technique to classify other speech sounds, such as, plosives, fricatives, nasals, etc. At present we are working on vowel and diphthong sounds.

Reference
[ 11 R. DeMori, M. Palakal, "On The Use Of A Taxonomy of

Time-Frequency Morphologies for Automatic Speech Recognition", International Joint Conference on Artificial Intelligence, Los Angeles, California, 1985. [ 21 M. Palakal, "Morphological Representation of Speech Knowledge for Automatic Speech Recognition Systems", i n Recent Advances in Speech Understanding Systems, NATO Advanced Study Institute, Bad Windsheim, W. Germany, 1987. [ 3 ] E. Merlo, R. DeMori, M. Palakal, and G. Mercier, " A continuous Parameter and frequency domain based Markov Model," Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Tokyo, 1986. [ 4 ] R. DeMori, E. Merlo, M. Palakal, J. Rouat, "Use of Procedural Knowledge for Automatic Speech Recognition", International Joint Conference on Artificial Intelligence, Milano, Italy, 1987. [5] K. Fukushima, " A Neocognitron: A new Algorithm for Pattern Recognition tolerant of deformations and shifts in position", Pattern Recognition, Vol. 15, No. 6, 1982, pp. 455-469. [6] K. Fukushima, "A Neural Network For Visual Pattern ," IEEE Computers, March 1988, pp. 65-74 [7] K. Fukushima, "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position," Biological Cybernetics, Vo1.36, No.4, April 1980, pp.193-202

230