Académique Documents
Professionnel Documents
Culture Documents
Ajith Abraham
Oklahoma State University, Stillwater, OK, USA
Handbook of Measuring System Design, edited by Peter H. Sydenham and Richard Thorn.
2005 John Wiley & Sons, Ltd. ISBN: 0-470-02143-8.
902 Elements: B – Signal Conditioning
Hidden layer
x1 Input layer
Output layer
x2 w1
w2 q f
x3 output (o )
w3
x4 w4
The best-known examples of this technique occur in the where o is the desired output for
backpropagation algorithm, the delta rule, and the percep-
i = 1 to n(inputs).
tron rule. In unsupervised learning (or self-organization),
a (output) unit is trained to respond to clusters of pattern
Unfortunately, plain Hebbian learning continually streng-
within the input. In this paradigm, the system is supposed
thens its weights without bound (unless the input data is
to discover statistically salient features of the input pop-
properly normalized).
ulation. Unlike the supervised learning paradigm, there is
no a priori set of categories into which the patterns are to
be classified; rather, the system must develop its own rep- 3.2 Perceptron learning rule
resentation of the input stimuli. Reinforcement learning is
learning what to do – how to map situations to actions – so The perceptron is a single layer neural network whose
as to maximize a numerical reward signal. The learner is weights and biases could be trained to produce a correct
not told which actions to take, as in most forms of machine target vector when presented with the corresponding input
learning, but instead must discover which actions yield the vector. The training technique used is called the perceptron-
most reward by trying them. In the most interesting and learning rule. Perceptrons are especially suited for simple
challenging cases, actions may affect not only the imme- problems in pattern classification.
diate reward, but also the next situation and, through that, Suppose we have a set of learning samples consisting
all subsequent rewards. These two characteristics, trial-and- of an input vector x and a desired output d(k). For a
error search and delayed reward are the two most important classification task, the d(k) is usually +1 or −1. The
distinguishing features of reinforcement learning. perceptron-learning rule is very simple and can be stated
as follows:
3 NEURAL NETWORK LEARNING 1. Start with random weights for the connections.
2. Select an input vector x from the set of training
3.1 Hebbian learning samples.
3. If output yk = d(k) (the perceptron gives an incorrect
The learning paradigms discussed above result in an adjust- response), modify all connections wi according to:
ment of the weights of the connections between units, δwi = η(dk − yk )xi ; (η = learning rate).
according to some modification rule. Perhaps the most influ- 4. Go back to step 2.
ential work in connectionism’s history is the contribution Note that the procedure is very similar to the Hebb
of Hebb (1949), where he presented a theory of behav- rule; the only difference is that when the network responds
ior based, as much as possible, on the physiology of the correctly, no connection weights are modified.
nervous system.
The most important concept to emerge from Hebb’s
work was his formal statement (known as Hebb’s postu- 4 BACKPROPAGATION LEARNING
late) of how learning could occur. Learning was based on
the modification of synaptic connections between neurons. The simple perceptron is just able to handle linearly separa-
Specifically, when an axon of cell A is near enough to excite ble or linearly independent problems. By taking the partial
a cell B and repeatedly or persistently takes part in firing derivative of the error of the network with respect to each
it, some growth process or metabolic change takes place weight, we will learn a little about the direction the error
in one or both cells such that A’s efficiency, as one of the of the network is moving.
cells firing B, is increased. The principles underlying this In fact, if we take the negative of this derivative (i.e.
statement have become known as Hebbian Learning. Vir- the rate change of the error as the value of the weight
tually, most of the neural network learning techniques can increases) and then proceed to add it to the weight, the error
be considered as a variant of the Hebbian learning rule. The will decrease until it reaches a local minima. This makes
basic idea is that if two neurons are active simultaneously, sense because if the derivative is positive, this tells us that
their interconnection must be strengthened. If we consider the error is increasing when the weight is increasing. The
a single layer net, one of the interconnected neurons will obvious thing to do then is to add a negative value to the
be an input unit and one an output unit. If the data are rep- weight and vice versa if the derivative is negative. Because
resented in bipolar form, it is easy to express the desired the taking of these partial derivatives and then applying
weight update as them to each of the weights takes place, starting from the
output layer to hidden layer weights, then the hidden layer
wi (new) = wi (old) + xi o, to input layer weights (as it turns out, this is necessary since
904 Elements: B – Signal Conditioning
changing these set of weights requires that we know the data to get the network familiarized with noise and natural
partial derivatives calculated in the layer downstream), this variability in real data.
algorithm has been called the backpropagation algorithm. Poor training data inevitably leads to an unreliable and
A neural network can be trained in two different modes: unpredictable network. Usually, the network is trained for
online and batch modes. The number of weight updates of a prefixed number of epochs or when the output error
the two methods for the same number of data presentations decreases below a particular error threshold.
is very different. Special care is to be taken not to overtrain the network.
The online method weight updates are computed for By overtraining, the network may become too adapted in
each input data sample, and the weights are modified after learning the samples from the training set, and thus may
each sample. be unable to accurately classify samples outside of the
An alternative solution is to compute the weight update training set.
for each input sample, but store these values during one Figure 3 illustrates the classification results of an over-
pass through the training set which is called an epoch. trained network. The task is to correctly classify two pat-
At the end of the epoch, all the contributions are added, terns X and Y. Training patterns are shown by ‘ ’ and test
and only then the weights will be updated with the compos- patterns by ‘ ’. The test patterns were not shown during
ite value. This method adapts the weights with a cumulative the training phase.
weight update, so it will follow the gradient more closely. As shown in Figure 3 (left side), each class of test data
It is called the batch-training mode. has been classified correctly, even though they were not
Training basically involves feeding training samples as seen during training. The trained network is said to have
input vectors through a neural network, calculating the error good generalization performance. Figure 3 (right side) illus-
of the output layer, and then adjusting the weights of the trates some misclassification of the test data. The network
network to minimize the error. initially learns to detect the global features of the input
The average of all the squared errors (E) for the outputs and, as a consequence, generalizes very well. But after
is computed to make the derivative easier. Once the error prolonged training, the network starts to recognize indi-
is computed, the weights can be updated one by one. In the vidual input/output pairs rather than settling for weights
batched mode variant, the descent is based on the gradient that generally describe the mapping for the whole training
∇E for the total training set set (Fausett, 1994).
δE
wij (n) = −η∗ + α ∗ wij (n − 1) (4)
δwij 5.1 Choosing the number of neurons
where η and α are the learning rate and momentum respec-
tively. The number of hidden neurons affects how well the network
The momentum term determines the effect of past weight is able to separate the data. A large number of hidden
changes on the current direction of movement in the neurons will ensure correct learning, and the network is
weight space. A good choice of both η and α are required able to correctly predict the data it has been trained on,
for the training success and the speed of the neural- but its performance on new data, its ability to generalize,
network learning. is compromised. With too few hidden neurons, the network
It has been proven that backpropagation learning with may be unable to learn the relationships amongst the data
sufficient hidden layers can approximate any nonlinear and the error will fail to fall below an acceptable level.
function to arbitrary accuracy. This makes backpropaga- Thus, selection of the number of hidden neurons is a
tion learning neural network a good candidate for signal crucial decision.
prediction and system modeling.
Y Y
5.2 Choosing the initial weights The fourth method of Levenberg and Marquardt is specif-
ically adapted to the minimization of an error function that
The learning algorithm uses a steepest descent technique, arises from a squared error criterion of the form we are
which rolls straight downhill in weight space until the assuming. A common feature of these training algorithms
first valley is reached. This makes the choice of initial is the requirement of repeated efficient calculation of gradi-
starting point in the multidimensional weight space critical. ents. The reader can refer to Bishop (1995) for an extensive
However, there are no recommended rules for this selection coverage of higher-order learning algorithms.
except trying several different starting weight values to see Even though artificial neural networks are capable of per-
if the network results are improved. forming a wide variety of tasks, in practice, sometimes, they
deliver only marginal performance. Inappropriate topology
selection and learning algorithm are frequently blamed.
There is little reason to expect that one can find a uni-
5.3 Choosing the learning rate formly best algorithm for selecting the weights in a feed-
forward artificial neural network. This is in accordance
Learning rate effectively controls the size of the step that is with the no free lunch theorem, which explains that for
taken in multidimensional weight space when each weight any algorithm, any elevated performance over one class of
is modified. If the selected learning rate is too large, then the problems is exactly paid for in performance over another
local minimum may be overstepped constantly, resulting in class (Macready and Wolpert, 1997).
oscillations and slow convergence to the lower error state. The design of artificial neural networks using evolu-
If the learning rate is too low, the number of iterations tionary algorithms has been widely explored. Evolutionary
required may be too large, resulting in slow performance. algorithms are used to adapt the connection weights, net-
work architecture, and so on, according to the problem
environment.
6 HIGHER ORDER LEARNING A distinct feature of evolutionary neural networks is their
ALGORITHMS adaptability to a dynamic environment. In other words, such
neural networks can adapt to an environment as well as
Backpropagation (BP) often gets stuck at a local minimum changes in the environment. The two forms of adaptation,
mainly because of the random initialization of weights. evolution and learning in evolutionary artificial neural net-
For some initial weight settings, BP may not be able works, make their adaptation to a dynamic environment
to reach a global minimum of weight space, while for much more effective and efficient than the conventional
other initializations the same network is able to reach an learning approach. Refer to Abraham (2004) for more tech-
optimal minimum. nical information related to evolutionary design of neu-
A long recognized bane of analysis of the error sur- ral networks.
face and the performance of training algorithms is the
presence of multiple stationary points, including multiple 7 DESIGNING ARTIFICIAL NEURAL
minima.
Empirical experience with training algorithms show that NETWORKS
different initialization of weights yield different resulting
networks. Hence, multiple minima not only exist, but there To illustrate the design of artificial neural networks, the
may be huge numbers of them. Mackey-Glass chaotic time series (Box and Jenkins, 1970)
In practice, there are four types of optimization algo- benchmark is used. The performance of the designed neural
rithms that are used to optimize the weights. The first three network is evaluated for different architectures and activa-
methods, gradient descent, conjugate gradients, and quasi- tion functions. The Mackey-Glass differential equation is a
Newton, are general optimization methods whose operation chaotic time series for some values of the parameters x(0)
can be understood in the context of minimization of a and τ .
quadratic error function. dx(t) 0.2x(t − τ )
Although the error surface is surely not quadratic, for = − 0.1 x(t). (5)
dt 1 + x 10 (t − τ )
differentiable node functions, it will be so in a sufficiently
small neighborhood of a local minimum, and such an We used the value x(t − 18), x(t − 12), x(t − 6), x(t)
analysis provides information about the behavior of the to predict x(t + 6). Fourth order Runge-Kutta method was
training algorithm over the span of a few iterations and used to generate 1000 data series. The time step used in the
also as it approaches its goal. method is 0.1 and initial condition were x(0) = 1.2, τ =
906 Elements: B – Signal Conditioning
Table 1. Training and test performance for Mackey-Glass Series Table 2. Mackey-Glass time series: training and generalization
for different architectures. performance for different activation functions.
Hidden neurons Root mean-squared error Activation function Root mean-squared error
Hidden neurons
20 0.89
17, x(t) = 0 for t < 0. The first 500 data sets were used
18 0.8
for training and remaining data for testing.
16 0.71
14 0.62
A feed-forward neural network with four input neurons, one Figure 5. Computational complexity for different architectures.
hidden layer and one output neuron is used. Weights were
randomly initialized and the learning rate and momentum two node transfer functions. The generalization looks better
are set at 0.05 and 0.1 respectively. The numbers of hidden with TSAF.
neurons are varied (14, 16, 18, 20, 24) and the general- Figure 5 illustrates the computational complexity in bil-
ization performance is reported in Table 1. All networks lion flops for different numbers of hidden neurons. At
were trained for an identical number of stochastic updates present, neural network design relies heavily on human
(2500 epochs). experts who have sufficient knowledge about the differ-
ent aspects of the network and the problem domain. As
the complexity of the problem domain increases, manual
design becomes more difficult.
7.2 Role of activation functions
The effect of two different node activation functions in 8 SELF-ORGANIZING FEATURE MAP
the hidden layer, log-sigmoidal activation function LSAF
AND RADIAL BASIS FUNCTION
and tanh-sigmoidal activation function TSAF), keeping
24 hidden neurons for the backpropagation learning algo- NETWORK
rithm, is illustrated in Figure 4. Table 2 summarizes the
empirical results for training and generalization for the 8.1 Self-organizing feature map
algorithm is that it allows neurons that are neighbors to the 9 RECURRENT NEURAL NETWORKS
winning neuron to be output values. Thus, the transition of AND ADAPTIVE RESONANCE THEORY
output vectors is much smoother than that obtained with
competitive layers, where only one neuron has an output at
a time.
9.1 Recurrent neural networks
The problem that data visualization attempts to solve
is that humans simply cannot visualize high-dimensional Recurrent networks are the state of the art in nonlinear
data. The way SOFM goes about reducing dimensions is time series prediction, system identification, and temporal
by producing a map of usually 1 or 2 dimensions, which pattern classification. As the output of the network at time
plot the similarities of the data by grouping similar data t is used along with a new input to compute the output of
items together (data clustering). In this process, SOFM the network at time t + 1, the response of the network is
accomplish two things, they reduce dimensions and display dynamic (Mandic and Chambers, 2001).
similarities. Time Lag Recurrent Networks (TLRN) are multilayered
It is important to note that while a self-organizing map perceptrons extended with short-term memory structures
does not take long to organize itself so that neighboring that have local recurrent connections. The recurrent neural
neurons recognize similar inputs, it can take a long time for network is a very appropriate model for processing temporal
the map to finally arrange itself according to the distribution (time-varying) information.
of input vectors. Examples of temporal problems include time-series pre-
diction, system identification, and temporal pattern recog-
nition. A simple recurrent neural network could be con-
structed by a modification of the multilayered feed-forward
8.2 Radial basis function network network with the addition of a ‘context layer’. The context
layer is added to the structure, which retains information
between observations. At each time step, new inputs are
The Radial Basis Function (RBF) network is a three-layer
fed to the network. The previous contents of the hidden
feed-forward network that uses a linear transfer function for
layer are passed into the context layer. These then feed
the output units and a nonlinear transfer function (normally
back into the hidden layer in the next time step. Initially,
the Gaussian) for the hidden layer neurons (Chen, Cowan
the context layer contains nothing, so the output from the
and Grant, 1991). Radial basis networks may require more
hidden layer after the first input to the network will be the
neurons than standard feed-forward backpropagation net-
same as if there is no context layer. Weights are calculated
works, but often they can be designed with lesser time.
in the same way for the new connections from and to the
They perform well when many training data are avail-
context layer from the hidden layer.
able.
The training algorithm used in TLRN (backpropagation
Much of the inspiration for RBF networks has come from
through time) is more advanced than standard backprop-
traditional statistical pattern classification techniques. The
agation algorithm. Very often, TLRN requires a smaller
input layer is simply a fan-out layer and does no processing.
network to learn temporal problems when compared to
The second or hidden layer performs a nonlinear mapping
MLP that use extra inputs to represent the past samples.
from the input space into a (usually) higher dimensional
TLRN is biologically more plausible and computationally
space whose activation function is selected from a class of
more powerful than other adaptive models such as the hid-
functions called basis functions.
den Markov model.
The final layer performs a simple weighted sum with a
Some popular recurrent network architectures are the
linear output. Contrary to BP networks, the weights of the
Elman recurrent network in which the hidden unit activation
hidden layer basis units (input to hidden layer) are set using
values are fed back to an extra set of input units and the
some clustering techniques. The idea is that the patterns in
Jordan recurrent network in which output values are fed
the input space form clusters. If the centers of these clusters
back into hidden units.
are known, then the Euclidean distance from the cluster
center can be measured. As the input data moves away
from the connection weights, the activation value reduces.
This distance measure is made nonlinear in such a way that 9.2 Adaptive resonance theory
for input data close to a cluster center gets a value close to
1. Once the hidden layer weights are set, a second phase Adaptive Resonance Theory (ART) was initially introduced
of training (usually backpropagation) is used to adjust the by Grossberg (1976) as a theory of human information
output weights. processing. ART neural networks are extensively used for
908 Elements: B – Signal Conditioning
supervised and unsupervised classification tasks and func- Box, G.E.P. and Jenkins, G.M. (1970) Time Series Analy-
tion approximation. sis, Forecasting and Control, Holden Day, San Francisco,
There exist many different variations of ART networks CA.
today (Carpenter and Grossberg, 1998). For example, ART1 Carpenter, G. and Grossberg, S. (1998) in Adaptive Resonance
performs unsupervised learning for binary input patterns, Theory (ART), The Handbook of Brain Theory and Neural
Networks, (ed. M.A. Arbib), MIT Press, Cambridge, MA, (pp.
ART2 is modified to handle both analog and binary input 79–82).
patterns, and ART3 performs parallel searches of distributed
Chen, S., Cowan, C.F.N. and Grant, P.M. (1991) Orthogonal
recognition codes in a multilevel network hierarchy. Fuzzy Least Squares Learning Algorithm for Radial Basis Func-
ARTMAP represents a synthesis of elements from neural tion Networks. IEEE Transactions on Neural Networks, 2(2),
networks, expert systems, and fuzzy logic. 302–309.
Fausett, L. (1994) Fundamentals of Neural Networks, Prentice
Hall, USA.
10 SUMMARY
Grossberg, S. (1976) Adaptive Pattern Classification and Uni-
versal Recoding: Parallel Development and Coding of Neural
This section presented the biological motivation and fun-
Feature Detectors. Biological Cybernetics, 23, 121–134.
damental aspects of modeling artificial neural networks.
Hebb, D.O. (1949) The Organization of Behavior, John Wiley,
Performance of feed-forward artificial neural networks for
New York.
a function approximation problem is demonstrated. Advan-
Kohonen, T. (1988) Self-Organization and Associative Memory,
tages of some specific neural network architectures and
Springer-Verlag, New York.
learning algorithms are also discussed.
Macready, W.G. and Wolpert, D.H. (1997) The No Free Lunch
Theorems. IEEE Transactions on Evolutionary Computing,
REFERENCES 1(1), 67–82.
Mandic, D. and Chambers, J. (2001) Recurrent Neural Networks
Abraham, A. (2004) Meta-Learning Evolutionary Artificial Neu- for Prediction: Learning Algorithms, Architectures and Stabil-
ral Networks, Neurocomputing Journal, Vol. 56c, Elsevier Sci- ity, John Wiley & Sons, New York.
ence, Netherlands, (1–38). McCulloch, W.S. and Pitts, W.H. (1943) A Logical Calculus of
Bishop, C.M. (1995) Neural Networks for Pattern Recognition, the Ideas Immanent in Nervous Activity. Bulletin of Mathemat-
Oxford University Press, Oxford, UK. ical Biophysics, 5, 115–133.