Vous êtes sur la page 1sur 7

The Comparison and Combination of Genetic and Gradient Descent Learning in

Recurrent Neural Networks: An Application to Speech Phoneme Classification

Rohitash Chandra
School of Science and Technology
The University of Fiji
rohitashc@unifiji.ac.fj

Christian W. Omlin
The University of Western Cape

Abstract neural networks. The goal for gradient descent learning is


to minimize the networks output error by adjusting the
We present a training approach for recurrent neural weights of the network upon the presentation of training
networks by combing evolutionary and gradient descent samples. Backpropagation through time is an extension of
learning. We train the weights of the network using the backpropagation algorithm used for training
genetic algorithms. We then apply gradient descent feedforward networks. The algorithm unfolds a recurrent
learning on the knowledge acquired by genetic training to neural network in time and views it as a deep multilayer
further refine the knowledge. We also use genetic neural feedforward network. Gradient descent learning faces the
learning and gradient descent learning for training on the problem of getting the network trapped in the local
same network topology for comparison. We apply these minima. A momentum term is usually used to cater for
training methods to the application of speech phoneme this learning difficulty. In some cases, the network
classification. We use Mel frequency cepstral coefficients topology is usually pruned for improving the networks
for feature extraction of phonemes read from the TIMIT generalization by deleting some neurons from the network
speech database. Our results show that the combined [6]. In this paper, we will use gradient descent learning to
genetic and gradient descent learning can train recurrent train recurrent neural networks for phoneme classification.
neural networks for phoneme classification; however,
their generalization performance does not show Knowledge based neurocomputing is a paradigm
significant difference when compared to the performance which combines expert knowledge into neural networks
of genetic neural learning and gradient descent alone. prior to training for an improved training and
Genetic neural learning has shown the best training generalization performance [7]. Expert knowledge
performance in terms of training time. provides the network with hints during training. This
paradigm is limited to application where expert
1. Introduction knowledge is not available. Evolutionary optimization
techniques such as genetic algorithms have been popular
Recurrent neural networks have been an important for training neural networks other than gradient decent
learning [8]. It has been observed that genetic algorithms
focus of research as they can be applied to difficult
overcome the problem of local minima whereas in
problems involving time-varying patterns. Their
gradient descent search for the optimal solution, it may be
applications range from speech recognition and financial
difficult to drive the network out of the local minima
prediction to gesture recognition [1]-[3]. Recurrent neural which in turn proves costly in terms of training time.
networks are capable of modeling complicated recognition
tasks. They have shown more accuracy in speech In this paper, we will show how genetic algorithms
recognition in cases of low quality noisy data compared to can be applied to train recurrent neural networks and
hidden Markov models [4]. Recurrent neural networks are compare their performance with gradient descent learning.
dynamical systems and it has shown been that they can We will combine genetic and gradient descent learning to
represent deterministic finite automaton in their internal train the network architecture for classification of two
weight representations [5]. phonemes extracted from the TIMIT speech database.
After successful training, we will use gradient descent
Backpropagation through time employs gradient learning to train further on the knowledge acquired in the
descent learning and are popular for training recurrent genetic training process. In this way, we will combine
both training paradigms. Gradient descent learning will be second-order recurrent networks [10], NARX networks
used to further refine the knowledge acquired by genetic [11] and LSTM recurrent networks [12]. A detailed study
training. about the vast variety of recurrent neural networks is
beyond the scope of this paper. We will use first–order
2. Definition and Methods recurrent neural network to show the combination of
evolutionary and gradient descent learning. Their
2.1 Recurrent Neural Networks dynamics is shown in equation 1.

 K J

Neural networks are loosely modeled to the brain. S i ( t ) = g  ∑ V ik S k ( t − 1) + ∑W ij I j ( t − 1)  (1)
They learn by training on past experience and make good  k =1 j =1 
generalization on unseen instances. Neural networks are
characterized into feedforward and recurrent neural where S k (t ) and I j (t ) represent the output of the state
networks. Feedforward networks are used in application
neuron and input neurons, respectively. Vik and W ij
where the data does not contain time variant information
while recurrent neural networks model time series represent their corresponding weights. g(.) is a sigmoidal
sequences and possesses dynamical characteristics. discriminant function.
Recurrent neural networks contain feedback connections.
They have the ability to maintain information from past
states for the computation of future state outputs.

Backpropagation employs gradient descent learning


and is the most popular algorithm used for training neural
networks. One limitation for training neural networks
using gradient descent learning is their weakness of
getting trapped in the local minima resulting in poor
training and generalization performance. Evolutionary
optimization methods such as genetic algorithms are also
used for neural network training; they do not face the
problems faced during gradient descent learning.

In the past, research had been done to improve the


training performance of neural networks which has
significance on its generalization. Symbolic or expert Figure 1: First–order recurrent neural network
knowledge is inserted into neural networks prior to architecture. The recurrence from the hidden to the
training for a better training and generalization context layer is shown. Dashed lines indicate that more
performance. It has been shown that deterministic finite- neurons can be used in each layer depending on the
state automata can be directly encoded into recurrent application.
neural networks prior to training [6]. Until recently,
neural networks were viewed as black boxes as they could
2.2 Backpropagation Through-Time
not explain the knowledge learnt in the training process.
The extraction of rules from neural networks shows how
they arrived to a particular solution after training. The Backpropagation is the most widely applied learning
extraction of finite-state automata from trained recurrent algorithm for both feedforward and recurrent neural
neural networks shows that they have characteristics for networks. It learns the weights for a multilayer network,
modeling dynamical systems. given a network with a fixed set of weights and
interconnections. Backpropagation employs gradient
Recurrent neural networks are composed of an input descent to minimize the squared error between the
layer, a context layer which provides state information, a networks output values and desired values for those
hidden layer and an output layer as shown in Figure 1. outputs. The learning problem faced by backpropagation
Each layer contains one or more processing units called is to search a large hypothesis space defined by weight
neurons which propagate information from one layer to values for all the units of the network.
the next by computing a non-linear function of their
weighted sum of inputs. Popular architectures of recurrent
neural networks include first-order recurrent networks [9],
Backpropagation through time (BPTT) is a gradient δ jL = ( d j − S jL ) S jL (1 − S jL ) (5)
descent learning algorithm used for training first order
recurrent neural networks [13]. BPTT is the extension of The error gradient for the hidden layers is given by:
backpropagation algorithm. The general idea behind
BPTT is to unfold the recurrent neural network in time so m

that it becomes a deep multilayer feedforward network. S Lj (1 − S Lj ) ∑ δ kL +1 w kjL + 1 (6)


k =1
When unfolded in time, the network has the same
behavior as a recurrent neural network for a finite number
Heuristic to improve the performance of backpropagation
of time steps.
include adding a momentum term and training multiple
networks with the same data but different small random
The goal of gradient descent learning is to minimize
initializations prior to training.
the sum of squared errors by propagating error signals
backward through the network architecture upon the
presentation of training samples from the training set. 2.3 Evolutionary Training of Recurrent Neural
These error signals are used to calculate the weight Networks
updates which represent the knowledge learnt in neural
networks. In gradient descent search for a solution, the 2.3.1 Genetic Algorithms
network searches through a weight space of errors.
Therefore it may get trapped in a local minima easily. Genetic algorithms provide a learning method
This may prove costly in terms for network training and motivated by biological evolution. They are search
generalization performance. In time varying sequences, techniques that can be used for both solving problems
longer patterns represent long time dependencies. and modeling evolutionary systems [15]. The problem
Gradient descent has difficulties in learning long time faced by genetic algorithms is to search a space of
dependencies as error gradient vanishes with increasing candidate hypothesis and find the best hypothesis. The
duration of dependencies [14]. hypothesis fitness is a numerical measure which computes
the ‘best hypothesis’ that optimizes the problem. The
Given below are the training equations unfolded in algorithm operates by iteratively updating a pool of
time, hence time t become the layer L. For each training hypothesis, called the population. The population consists
of many individuals called chromosomes. All member of
example d, every weight w ji is updated by adding ∆w ji
the population are evaluated by the fitness function on
to it. each iteration. A new population is then generated by
∂Ed (2) probabilistically selecting the most fit chromosomes from
∆w ji = −α
∂w ji the current population. Some of the selected chromosomes
are added to the new generation while others are selected
where ∂Ed is the error on training example d, summed over as parent chromosomes. Parent chromosomes are used
for creating new offspring’s by applying genetic operators
all output units in the network
such as crossover and mutation. Traditionally, the
chromosomes represent bit strings; however, real number
1 m
Ed = ∑ ( d j − S Lj ) 2
2 j =1
(3) representation is possible.

2.3.2 Genetic Algorithms for Training Neural


Here d j is the desired output for neuron j in the output Networks.
layer which containing m neurons, and S Lj is the network
In order to use genetic algorithms for training neural
output of neuron j in the output layer L. The weight networks, we need to represent the problem as
updates after computing the derivation is done by: chromosomes. Real numbered values of weights must be
encoded in the chromosome other than binary values. This
is done by altering the crossover and mutation operators.
∆ w Lj i = αδ iL S Lj −1 (4)
A crossover operator takes two parent chromosomes and
creates a single child chromosome by randomly selecting
where α is the learning rate constant . The learning rate corresponding genetic materials from both parents. The
determines how fast the weights are updated in the mutation operator adds a small random real number
direction of the gradient. The error gradient for neuron i, between -1 and 1 to a randomly selected gene in the
δ iL for the output layer is given by: chromosome.
In evolutionary neural learning, the task of genetic For recurrent neural networks, finite-state automata
algorithms is to find the optimal set of weights in a are the basis for knowledge insertion. It has been shown
network topology which minimizes the error function. The that deterministic finite-state automata can be encoded in
fitness function must define the performance of the neural discrete-time second-order recurrent neural networks by
network. Thus, the fitness function is the reciprocal of directly programming a small subset of available weights
sum of squared error of the neural network. To evaluate [6]. For first order recurrent neural networks, a method for
the fitness function, each weight encoded in the encoding finite-state automata has been proposed and
chromosome is assigned to the respective weight links of shown [18].
the network. The training set of examples is then
presented to the network which propagates the 2.5 Speech Phoneme Classification
information forward and the sum of squared errors is
calculated. In this way, genetic algorithms attempts to find A speech sequence contains huge amount of irrelevant
a set of weights which minimizes the error function of the information. In order to model them, feature extraction is
network. Recurrent neural networks have been trained by necessary. In feature extraction, useful information from
evolutionary computation methods such as genetic speech sequences are extracted which is then used for
algorithms which optimizes the weights in the network modeling. Recurrent neural networks and hidden Markov
architecture for a particular problem. Compared to models have been successfully applied for modeling
gradient descent learning, genetic algorithms can help the speech sequences [1,4]. They have been applied to
network to escape from the local minima. recognize words and phonemes. The performance of
speech recognition system can be measured in terms of
2.4 Knowledge Based Neurocomputing accuracy and speed. Recurrent neural networks are
capable of modeling complicated recognition tasks. They
The general paradigm of knowledge based have shown more accuracy in recognition in cases of low
neurocomputing includes the combination of symbolic quality noisy data compared to hidden Markov models.
knowledge in neural networks for better training and However, hidden Markov models have shown to perform
generalization performance [7]. The fidelity in the better when it comes to large vocabularies. Extensive
mapping of the prior knowledge is very important since research on the application of research recognition has
the network may not take advantage of poorly encoded been done for more than forty years; however, scientists
knowledge. Poorly encoded knowledge may hinder the are unable to implement systems which can show
learning process. Good prior knowledge encoding may excellent performance in environments with background
provide the network with beneficial features such as: 1) noise.
The learning process may lead to faster convergence to a
solution meaning better training performance, 2) networks Mel frequency cepstral coefficients (MFCC) are useful
trained with prior knowledge may provide better feature extraction techniques as the Mel filter has
generalization when compared to networks trained with no characteristics similar to the human auditory system [19].
prior knowledge and, 3) the rules in prior knowledge may The human ear performs similar techniques before
help to generate additional training data which are not presenting information to the brain for processing. We
present in the original data set. will apply MFCC feature extraction techniques to extract
features from phonemes from the TIMIT speech database.
Prior knowledge usually represented in the form of MFCC feature extraction is done by applying the
explicit rules in symbolic form is encoded in neural following procedure. A frame of speech signal obtained
networks by programming some weights prior to training by windowing and is presented to the discrete Fourier
[7]. In feedforward neural networks, prior knowledge is transformation function to change the signal from time
encoded in propositional logic expression form by domain to its frequency domain. Then the discrete Fourier
programming a subset of weights. Prior knowledge also transformed based spectrum is mapped onto the Mel scale
determines the topology of the network i.e. the number of using triangular overlapping windows. Finally, we
neurons and hidden layers appropriate for encoding the compute the log energy at the output of each filter and
knowledge. The paradigm has been successfully applied then do a discrete cosine transformation of the Mel
to real world problems including bio-conservation [16] amplitudes from which we obtain a vector of MFCC
and molecular biology [17]. The prior or expert features.
knowledge helps the network to get better generalization
and training performance when compared with network
architecture without prior knowledge encoding.
3. Combining Evolutionary and Gradient point. We then windowed the signal and presented it to
Descent Learning the discrete Fourier transformation. Furthermore, we
mapped the spectrum onto the Mel scale using triangular
In the past, a vast amount of research had been done to filters and then did a discrete cosine transformation on the
improve the training performance of neural networks. Mel amplitudes. In this way we obtained a vector of 12
Some of the proposed methods include adding a MFCC features for each frame of the phoneme.
momentum during training, pruning the network
architecture, and combining expert knowledge with neural 4.2 Training Recurrent Neural Networks using
network training. In the previous sections, we have Gradient Descent for Phoneme Classification
discussed the major training methods for recurrent neural
network and outlined how their training performance can We used the training and testing data set of features
be improved. Knowledge based neurocomputing has from the two phonemes ‘b’ and ‘d’ as discussed in
proven successful as the paradigm uses symbolic subsection 4.1. We used the recurrent neural network
knowledge for programming a subset of weights into the topology as follows: 12 neurons in the input layer which
network architecture prior to training. However, they can represents the speech feature input and 2 neurons in the
not be applied in application where the expert knowledge output layer representing each phoneme. We used a
is not available. learning rate of 0.2 and ran experiment with 12, 14, 16
and 18 neurons in the hidden layer. We used the
We have discussed how recurrent neural networks are backpropagation through-time algorithm which employs
trained using genetic algorithms. Genetic algorithms guess gradient descent for training. We ran two major
the optimal set of weights for a given network experiments; Table 1 shows illustrative results for
architecture. We need to investigate what sets of weight experiment 1 which uses small random weights in the
initializations are best for training using genetic range of -1 to 1 while Table 2 shows larger weight values
algorithms. We have addressed the problem that in used for weight initialization. We would terminate training
gradient descent learning, the network may become if the network could learn 88% of the training samples and
trapped in the local minima resulting in poor training and tested the networks generalization performance with data
generalization performance. However, in training using set not included in the training set. For both experiments,
genetic algorithms, there is no such problem. In this paper, we set the maximum training time of 100 epochs. The
we will combine both genetic and gradient descent results show that gradient descent has been successful in
learning. We will train a recurrent neural network training recurrent neural networks in the application of
architecture using genetic neural learning, after successful phoneme classification. Excessive number of neurons in
training we will apply gradient descent learning to further the hidden layer has training difficulty. The two different
refine the knowledge acquired by genetic neural learning. sets of weight initializations have no major effect in the
We will train recurrent neural networks using gradient generalization performance. Upon successful training, the
descent learning and record their training performance for best generalization performance recorded was 82.6% on
classification of speech phonemes obtained from the the presentation of data set which was not included in the
TIMIT database. We will then train recurrent neural training set.
networks for phoneme classification using genetic neural
learning. We will use different sets of weight Table 1: Gradient descent learning
initialization. We will define a genetic population size and
train the network architecture until they can learn the No. of No. of
Training
training dataset. Hidden Training Generalization
Performance
Neurons Epochs Performance
4. Empirical Results and Discussion 12 100 82.8% 82.6%
14 100 87.5% 82.6%
4.1 MFCC Feature Extraction 16 100 88% 82.6%
18 100 0.2% 0.4%
We used Mel cepstral frequency coefficients for
feature extraction from two phonemes read from the Small random weights initialised in the range of - 1 to 1
TIMIT speech database. We read phonemes ‘b’ and ‘d’
from the training and testing set in the TIMIT database.
The training dataset contained 645 samples while the
testing set consisted of 238 samples. For each phoneme
read, we applied a window of size 512 every 256 sample
Table 2: Gradient descent learning the network until it could at least learn 88% of samples in
the training set. Illustrative results for each experiment are
No. of No. of shown in Table 3 and Table 4, respectively. The results
Training show 82.6 percent generalization performance on unseen
Hidden Training Generalization
Performance samples which were not included in the training process.
Neurons Epochs Performance
12 100 88% 82.6% Genetic neural learning has shown better training
14 100 88% 82.6% performance compared to gradient descent learning;
16 100 88% 82.6% however, their generalization performance is the same.
18 100 0.2% 0.4%
4.4 Combined Genetic and Gradient Descent
Large random weights initialised in the range of - 5 to 5 Learning.

4.3 Training Recurrent Neural Networks using We apply the combined genetic and gradient descent
Genetic Algorithms learning for phoneme classification using recurrent neural
networks. We classify two phonemes from features
We obtained the training and testing data set of extracted in Section 4.1. We used the network topology as
phonemes ‘b’ and‘d’ as discussed in Section 4.1. We used discussed in Section 4.3. We applied genetic algorithms
the recurrent neural network topology as follows: 12 for training, once the network has shown to learn 88% of
neurons in the input layer which represents the speech the samples; we terminate the training and apply gradient
frame input and 2 neurons in the output layer representing descent to further refine the knowledge learnt by genetic
each phoneme. We experimented with different number of training. We trained for 100 training epochs. Table 5 and
neurons in the hidden layer. 6 show illustrative results of the two major experiments
initialized with two different sets of weights, respectively.
Table 3: Genetic neural learning
The results show that the generalization performance
of combined genetic and gradient descent learning does
No. of No. of not improve significantly when compared to the
Training Generalization
Hidden Training performance of genetic neural learning alone.
Performance Performance
Neurons Generations
12 2 88% 82.6% Table 5: Genetic and gradient descent learning
14 4 88% 82.6%
16 7 88% 82.6% No. of No. of
Training
18 3 87.5% 82.6% Hidden Training Generalization
Performance
Neurons Generations Performance
Small random weights initialised in the range of - 1 to 1 12 100 81.8% 82.6%
14 100 81.1% 82.6%
Table 4: Genetic neural learning 16 100 81.3% 82.6%
18 100 81.3% 82.6%
No. of No. of
Training
Hidden Training Generalization Small random weights initialised in the range of - 1 to 1
Performance
Neurons Generations Performance
12 4 88% 82.6% Table 6: Genetic and gradient descent learning
14 4 88% 82.6%
16 3 88% 82.6% No. of No. of
Training
18 2 88% 82.6% Hidden Training Generalization
Performance
Neurons Generations Performance
Large random weights initialised in the range of - 5 to 5 12 100 78.1% 82.6%
14 100 85.6% 82.6%
We ran some sample experiments and found that the 16 100 81.3% 82.6%
population size of 40, crossover probability of 0.7 and 18 100 81.5 82.6%
mutation probability of 0.1 have shown good genetic
training performance. Therefore, we used these values for Large random weights initialised in the range of - 5 to 5
all our experiments. We ran two major experiments with
different weight initialization prior to training and trained
5. Conclusions [7] G. Towell and J.W. Shavlik, “Knowledge based artificial
neural networks”, Artificial Intelligence, vol. 70, no.4, 1994, pp.
119-166.
We have discussed about the popular recurrent neural
network training paradigms and outlined their strengths [8] C. Kim Wing Ku, M. Wai Mak, and W. Chi Siu, “Adding
and limitations. We discussed the application of recurrent learning to cellular genetic algorithms for training recurrent
neural networks on speech phoneme classification using neural networks,” IEEE Transactions on Neural Networks, vol.
Mel frequency cepstral coefficients feature extraction 10, no.2, 1999, pp. 239-252.
techniques. We have successfully trained recurrent neural
networks to classify phonemes ‘b’ and ‘d’ extracted from [9] P. Manolios and R. Fanelli, “First order recurrent neural
the speech database by gradient descent and genetic networks and deterministic finite state automata,” Neural
training methods. Computation, vol. 6, no. 6, 1994, pp.1154-1172.

[10] R. L. Watrous and G. M. Kuhn, “Induction of finite-state


We have successfully compared and combined genetic languages using second-order recurrent networks,” Proc. of
and gradient descent learning in recurrent neural Advances in Neural Information Systems, California, USA,
networks. Our results demonstrate that genetic neural 1992, pp. 309-316.
learning has better training performance over gradient
descent; however, their generalization performance is the [11] T. Lin, B.G. Horne, P. Tino and C.L. Giles, “Learning
same. We combined genetic and gradient descent learning long-term dependencies in NARX recurrent neural networks,”
for recurrent network training and found out that their IEEE Transactions on Neural Networks, vol. 7, no. 6, 1996, pp.
generalization performance does not perform satisfactorily 1329-1338.
when compared to genetic neural learning alone. The
[12] S. Hochreiter and J. Schmidhuber, “Long short-term
application of the combined training method to other real memory”, Neural Computation, vol. 9, no. 8, 1997, pp. 1735-
world application problems remains an open question. 1780.

[13] P. J. Werbos, “Backpropagation through time: what it does


6. References and how to do it,” Proc. of the IEEE, vol. 78, no. 10, 1990,
pp.1550-1560.
[1] A.J Robinson, “An application of recurrent nets to phone
probability estimation”, IEEE transactions on Neural Networks, [14] Y. Bengio, P. Simard and P. Frasconi, “Learning long-term
vol.5, no. 2 , 1994, pp. 298-305. dependencies with gradient descent is difficult,” IEEE
Transactions on Neural Networks, vol. 5, no. 2, 1994, pp. 157-
[2] C.L. Giles, S. Lawrence and A.C. Tsoi, “Rule inference for 166.
financial prediction using recurrent neural networks”, Proc. of
the IEEE/IAFE Computational Intelligence for Financial [15] T. M. Mitchell, Machine Learning, McGraw Hill, 1997.
Engineering, New York City, USA, 1997, pp. 253-259
[16] R. Chandra, R. Knight and C. W. Omlin, “Combining
[3] K. Marakami and H Taguchi, “Gesture recognition using Expert Knowledge and Ground Truth: A Knowledge based
recurrent neural networks”, Proc. of the SIGCHI conference on Neurocomputing Paradigm for Bio-conservation Decision
Human factors in computing systems: Reaching through Support Systems”, Proceedings of the International Conference
technology, Louisiana, USA, 1991, pp. 237-242. on Environmental Management, Hyderabad India, 2005.

[4] M. J. F. Gales, “Maximum likelihood linear transformations [17] C. W. Omlin and S. Snyders, “Inductive bias strength in
for HMM-based speech recognition”, Computer Speech and knowledge-based neural networks: application to magnetic
Language, vol. 12, 1998, pp. 75-98. resonance spectroscopy of breast tissues”, Artificial Intelligence
in Medicine, vol. 28, no. 2, 2003.
[5] C. Lee Giles, C.W Omlin and K. Thornber, “Equivalence in
Knowledge Representation: Automata, Recurrent Neural [18] P. Fransconi, M. Gori, M. Maggini, and G. Soda, “Unified
Networks, and dynamical Systems”, Proc. of the IEEE, vol. 87, integration of explicit rules and learning by example in recurrent
no. 9, 1999, pp.1623-1640. networks”, IEEE Transactions on Knowledge and Data
Engineering, vol. 7, no. 2, 1995, pp. 340-346.
[6] C.W. Omlin and C.L. Giles: "Pruning Recurrent Neural
Networks for Improved Generalization Performance", IEEE [19] I. Potamitis, N. Fakotakis and G. Kokkinakis, "Improving
Transactions on Neural Networks, vol. 5, no. 5, 1994, pp. 848- the Robustness of Noisy MFCC Features Using Minimal
851. Recurrent Neural Networks”, Proc. of the IEEE International
Joint Conference on Neural Networks, vol. 5, 2000, p. 5271.

Vous aimerez peut-être aussi