Académique Documents
Professionnel Documents
Culture Documents
1
and
2
are inversely related to the size of the curvature along each axis. Using the above
learning rate matrix has the effect of scaling the gradient differently to make the surface
"look" spherical.
If the axes are not aligned with coordinate axes, the we need a full matrix of learning rates.
This matrix is just the inverse Hessian. In general, H
-1
is not diagonal. We can obtain the
curvature along each axis, however, by computing the eigenvalues of H. Anyone remember
what eigenvalues are??
Taking a Step Back
We have been spending a lot of time on some pretty tough math. Why? Because training a
network can take a long time if you just blindly apply the basic algorithms. There are
techniques that can improve the rate of convergence by orders of magnitude. However,
understanding these techniques requires a deep understanding of the underlying characteristics
of the problem (i.e. the mathematics). Knowing what speed-up techniques to apply, can make
a difference between having a net that takes 100 iterations to train vs. 10000 iterations to train
(assuming it trains at all).
The previous slides are trying to make the following point for linear networks (i.e. those
networks whose cost function is a quadratic function of the weights):
1. The shape of the cost surface has a significant effect on how fast a net can learn.
Ideally, we want a spherically symmetric surface.
2. The correlation matrix is defined as the average over all inputs of xx
T
3. The Hessian is the second derivative of E with respect to w.
For linear nets, the Hessian is the same as the correlation matrix.
4. The Hessian, tells you about the shape of the cost surface:
5. The eigenvalues of H are a measure of the steepness of the surface along the curvature
directions.
6. a large eigenvalue => steep curvature => need small learning rate
7. the learning rate should be proportional to 1/eigenvalue
8. if we are forced to use a single learning rate for all weights, then we must use a learning
rate that will not cause divergence along the steep directions (large eigenvalue
directions). Thus, we must choose a learning rate that is on the order of 1/max
where max is the largest eigenvalue.
9. If we can use a matrix of learning rates, this matrix is proportional to H
-1
.
10. For real problems (i.e. nonlinear), you don't know the eigenvalues so you just have to
guess. Of course, there are algorithms that will estimate max ....We won't be
considering these here.
[56]
11. An alternative solution to speeding up learning is to transform the inputs (that is, x ->
Px, for some transformation matrix P) so that the resulting correlation matrix,
(Px)(Px)
T
, is equal to the identity.
12. The above suggestions are only really true for linear networks. However, the cost
surface of nonlinear networks can be modeled as a quadratic in the vicinity of the
current weight. We can then apply the similar techniques as above, however, they will
only be approximations.
[57]
3.2 Summary of Linear Nets
Characteristics of Networks
- number of layers
- number of nodes per layer
- activation function (linear, binary, softwmax)
- error function (mean squared error (MSE), cross entropy)
- type of learning algorithms (gradient descent, perceptron, delta rule)
Types of Applications and Associated Nets
- Regression:
o uses a one-layer linear network (activation function is identity)
o uses MSE cost function
o uses gradient decent learning
- Classification - Perceptron Learning
o uses a one-layer network with a binary step activation function
o uses MSE cost function
o uses the perceptron learning algorithm (identical with gradient descent when
targets are +1 and -1)
- Classification - Delta Rule
o uses a one-layer network with a linear activation function
o uses MSE cost function
o uses gradient descent
o the network chooses the class by picking the output node with the largest output
- Classification - Gradient Descent (the right way)
o uses a one-layer network with a softmax activation function
o uses the cross entropy error function
o outputs are interpreted as probabilities
o the network chooses the class with the highest probability
Modes of Learning for Gradient Descent
- Batch
o At each iteration, the gradient is computed by averaging over all inputs
- Online (stochastic)
o At each iteration, the gradient is estimated by picking one (or a small number) of
inputs.
o Because the gradient is only being esitimated, there is a lot of noise in the weight
updates. The error comes down quicly but then tends to jiggle around. To
remove this noise one can switch to batch at the point where the error levels out
and or to continue to use online but to decrease the learning rate (called
[58]
annealing the learning rate). One way annealing is to use =
0
/t where
0
us the
originial learning rate and t is the number of timesteps after annealing is turned
on.
Picking Learning Rates
- Learning rates that are too big cause the algorithm to diverge
- Learning rates that are too small cause the algorithm to converge very slowly.
- The optimal learning rate for linear networks is /(H
-1
) where H is the Hessian and is
defined as the second derivative of the cost function with respect to the weights.
Unfortunately, this is a matrix whose inverse can be costly to compute.
- The best learning rate for batch is the inverse Hessian.
- More details if you are interested:
o The next best thing is to use a separate learning rate for each weight. If the
Hessian is diagonal these learning rates are just one over the eigenvalues of the
Hessian. Fat chance that the hessian is diagonal though!
o If using a single scalar learning then the best one to use is 1 over the largest
eigenvalue of the Hessian. There are fairly inexpensive algorithms for estimating
this. However, many people just use the ol' brute force method of picking the
learning rate - trial and error.
o For linear networks the Hessian is < x x
T
> and is independent of the weights. For
nonlinear networks (i.e. any network that has an activation function that isn't the
identity), the Hessian depends on the value of the weights and so changes
everytime the weights are updated - arrgh! That is why people love the trial and
error approach.
Limitations of Linear Networks
- For regression, we can only fit a straight line through the data points. Many problems
are not linear.
- For classification, we can only lay down linear boundaries between classes. This is
often inadequate for most real world problems.
Where do we go next - Multilayer Nonlinear Networks!!!
[59]
Chapter 4 - The Backprop Toolbox
4.1 Multilayer Networks and Backpropagation
Introduction
Much early research in networks was abandoned because of the severe limitations of single
layer linear networks. Multilayer networks were not "discovered" until much later but even
then there were no good training algorithms. It was not until the `80s that backpropagation
became widely known.
People in the field joke about this because backprop is really just applying the chain rule to
compute the gradient of the cost function. How many years should it take to rediscover the
chain rule?? Of course, it isn't really this simple. Backprop also refers to the very efficient
method that was discovered for computing the gradient.
Note: Multilayer nets are much harder to train than single layer networks. That is,
convergence is much slower and speed-up techniques are more complicated.
Method of Training: Backpropagation
Define a cost function (e.g. mean square error)
where the activation y at the output layer is given by
[60]
and where
- z is the activation at the hidden nodes
- f2 is the activation function at the output nodes
- f1 is the activation function at the hidden nodes.
Written out more explicitly, the cost function is
or all at once:
Computing the gradient: for the hidden-to-output weights:
[61]
the gradient: for the input-to-hidden weights:
Summary of Gradients
hidden-to-output weights:
where
[62]
input-to-hidden:
where
Implementing Backprob
Create variables for :
- the weights W and w,
- the net input to each hidden and output node, neti
- the activation of each hidden and output node, yi = f(neti)
- the "error" at each node, i
For each input pattern k:
Step 1: Foward Propagation
Compute neti and yi for each hidden node, i=1,..., h:
Compute netj and yj for each output node, j=1,...,m:
Step 2: Backward Propagation
Compute 2's for each output node, j=1,...,m:
[63]
Compute 1's for each hidden node, i=1,...,h
Step 3: Accumulate gradients over the input patterns (batch)
Step 4: After doing steps 1 to 3 for all patterns, we can now update the weights:
Networks with more than 2 layers
The above learning procedure (backpropagation) can easily be extended to networks with any
number of layers.
[64]
4.2 Online vs Batch for Non-Linear Networks
Making a Lot of Noise
Disadvantage of Noise in Online Updates
We have seen that online can often be much faster than batch early in the training process.
However, the noise in the updates causes the network to bounce around near the minimum
and never converge to the very bottom.
Solution:
The Advantage of Noise
In linear networks the cost function is in the nice shape of a bowl. There is a single minimum.
In nonlinear networks, however, the cost surface can be very complex. There can be many
minima, valleys, plateau's which make training very difficult. Batch gradient descent will
simply move the bottom of the local minimum it randomly starts in. If it is on a plateau, the
gradient may be very small and so learning takes a very long time.
[65]
Valleys are common when using sigmoids. Consider what happens when sigmoids are added.
Below, the green sigmoid is added to the blue to obtain the red.
[66]
Now, look what can happen in 2 dimensions. We obtain a valley that can be difficult to escape
from:
The noise in online makes it possible to escape from local minima and plateaus. It can help
somewhat with valleys as well.
Too Much of a Good Thing: OverTraining
The good news is that multilayer networks can approximate any smooth function as long as
you have enough hidden nodes. The bad news is that this added flexibility can cause the
network to learn the noise in the data. Consider regression and classification problems where
you have a collection of noisy data. The solid line is the "true" function or class boundary and
the +'s and o's is the data:
[67]
If you have lots of hidden nodes you may find that the network "discovers" the function
(dotted lines) given below:
In the above example, the network has not only learned the function but it has also learned the
noise present in the data. When the net has learned the noise, we say it has overtrained. The
reason for this name is that as a net trains it first learns the rough structure of the data. As it
continues to learn, it will pick up the details (i.e. the noise).
Generalization
Why is overtraining a problem? The whole purpose of training these nets is to be able to
predict the function output (regression) or class (classification) for inputs that the net has
never seen before (i.e. was not trained on).
A network is said to generalize well if it can accurately predict the correct output on data it
has never seen.
[68]
Preventing Overtraining
There are several ways to prevent overtraining:
- training for less time. The method for doing this is called early stopping
- Reducing the number of hidden nodes reduces the number or parameters (weights)
so that the net is not able to learn as much detail. Problems are
* what is the right number of nodes?
[69]
* there is reason to believe that better solutions can be found by having too
many hidden nodes than too few.
- Often, better to start with a big net, train, and then carefully prune the net so that it
is smaller (one version of pruning is called optimal brain damage)
- Instead of reducing the number of weights, people instead put constraints on the
weights so that there are effectively fewer parameters. One example of this is weight
decay.
Weight decay pushes the weights toward zero. Note that this corresponds to the linear region
of the sigmoid
[70]
4.3 Momentum
We saw that if the cost surface is not spherical, learning can be quite slow because the
learning rate must be kept small to prevent divergence along the steep curvature directions
One way to solve this is to use the inverse Hession (= correlation matrix for linear nets) as the
learning rate matrix. This can be problematic because the Hessian can be a large matrix that is
difficult to invert. Also, for multilayer networks, the Hessian is not constant (i.e. it changes as
the weights change). Recomputing the inverse Hessian at each iteration would be
prohibitively expensive and not worth the extra computation. However a much simpler
approach is to use the addition of a momentum term.
where w(t) is the weight at the tth iteration. Written another way
where Aw(t) = w(t)-w(t-1). Thus, the amount you change the weight is proportional to the
negative gradient plus the previous weight change.
| is called the momentum parameter. and must satisfy 0 <= | < 1.
Momentum Example
Consider the oscillatory behavior shown above. The gradient changes sign at each step. By
adding in a small amount of the previous weight change, we can lessen the oscillations.
Suppose = .8, w(0)=10
E = w
2
=> w
min
= 0 and dE/dx = 2w
No Momentum | = 0:
t = 0: Aw(1) = -.8 = -.8 (20) = -16, w(1) = 10-16 = -6
[71]
t = 2: Aw(1) = -.8 = -.8 (-12) = 9.6, w(2) = -6+9.6 = 3.6
t = 3: Aw(1) = -.8 = - .8(7.2) = -5.76, w(2) = 3.6 - 5.76 = -2.16
With Momentum | = .1:
t = 0: Aw(1) = -.8 + | Aw(0) = -.8 (20) + .1*0 = -16, w(1) = 10-16 = -6
t = 2: Aw(1) = -.8 + | Aw(1) = -.8 (-12) + .1*(-16) = 8, w(2) = -6+8 = 2
t = 3: Aw(1) = -.8 + | Aw(2) = - .8(4) + .1*(8) = -2.4, w(2) = 2-2.4 = -.4
[72]
4.4 Delta-Bar-Delta (Jacobs)
Since the cost surface for multi-layer networks can be complex, choosing a learning rate can
be difficult. What works in one location of the cost surface may not work well in another
location. Delta-Bar-Delta is a heuristic algorithm for modifying the learning rate as training
progresses:
- Each weight has its own learning rate.
- For each weight: the gradient at the current timestep is compared with the gradient at
the previous step (actually, previous gradients are averaged)
- If the gradient is in the same direction the learning rate is increased
- If the gradient is in the opposite direction the learning rate is decreased
- Should be used with batch only.
Let
g
ij
(t) = gradient of E wrt w
ij
at time t
then define
Then the learning rate
ij
for weight w
ij
at time t+1 is given by
where |, , and k are chosen by the hand.
Downsides:
- Knowing how to choose the parameters |, , and k is not easy.
- Doesn't work for online.
[73]
4.5 Error Backpropagation
We have already seen how to train linear networks by gradient descent. In trying to do the
same for multi-layer networks we encounter a difficulty: we don't have any target values for
the hidden units. This seems to be an insurmountable problem - how could we tell the hidden
units just what to do? This unsolved question was in fact the reason why neural networks fell
out of favor after an initial period of high popularity in the 1950s. It took 30 years before
the error backpropagation (or in short: backprop) algorithm popularized a way to train
hidden units, leading to a new wave of neural network research and applications.
(Fig. 1)
In principle, backprop provides a way to train networks with any number of hidden units
arranged in any number of layers. (There are clear practical limits, which we will discuss
later.) In fact, the network does not have to be organized in layers - any pattern of connectivity
that permits a partial ordering of the nodes from input to output is allowed. In other words,
there must be a way to order the units such that all connections go from "earlier" (closer to the
input) to "later" ones (closer to the output). This is equivalent to stating that their connection
pattern must not contain any cycles. Networks that respect this constraint are
called feedforward networks; their connection pattern forms a directed acyclic
graph or dag.
The Algorithm
We want to train a multi-layer feedforward network by gradient descent to approximate an
unknown function, based on some training data consisting of pairs (x,t). The
vector x represents a pattern of input to the network, and the vector t the
corresponding target (desired output). As we have seen before, the overall gradient with
respect to the entire training set is just the sum of the gradients for each pattern; in what
follows we will therefore describe how to compute the gradient for just a single training
pattern. As before, we will number the units, and denote the weight from unit j to unit i by w
ij
.
[74]
1. Definitions:
o the error signal for unit j:
o the (negative) gradient for weight w
ij
:
o the set of nodes anterior to unit i:
o the set of nodes posterior to unit j:
2. The gradient. As we did for linear networks before, we expand the gradient into two
factors by use of the chain rule:
The first factor is the error of unit i. The second is
Putting the two together, we get
.
To compute this gradient, we thus need to know the activity and the error for all
relevant nodes in the network.
3. Forward activaction. The activity of the input units is determined by the network's
external inputx. For all other units, the activity is propagated forward:
Note that before the activity of unit i can be calculated, the activity of all its anterior
nodes (forming the set A
i
) must be known. Since feedforward networks do not contain
cycles, there is an ordering of nodes from input to output that respects this condition.
4. Calculating output error. Assuming that we are using the sum-squared loss
[75]
the error for output unit o is simply
5. Error backpropagation. For hidden units, we must propagate the error back from the
output nodes (hence the name of the algorithm). Again using the chain rule, we can
expand the error of a hidden unit in terms of its posterior nodes:
Of the three factors inside the sum, the first is just the error of node i. The second is
while the third is the derivative of node j's activation function:
For hidden units h that use the tanh activation function, we can make use of the special
identity
tanh(u)' = 1 - tanh(u)
2
, giving us
Putting all the pieces together we get
Note that in order to calculate the error for unit j, we must first know the error of all its
posterior nodes (forming the set P
j
). Again, as long as there are no cycles in the
network, there is an ordering of nodes from the output back to the input that respects
this condition. For example, we can simply use the reverse of the order in which
activity was propagated forward.
[76]
Matrix Form
For layered feedforward networks that are fully connected - that is, each node in a given layer
connects to every node in the next layer - it is often more convenient to write the backprop
algorithm in matrix notation rather than using more general graph form given above. In this
notation, the biases weights, net inputs, activations, and error signals for all units in a layer are
combined into vectors, while all the non-bias weights from one layer to the next form a matrix
W. Layers are numbered from 0 (the input layer) to L (the output layer). The backprop
algorithm then looks as follows:
1. Initialize the input layer:
2. Propagate activity forward: for l = 1, 2, ..., L,
where b
l
is the vector of bias weights.
3. Calculate the error in the output layer:
4. Backpropagate the error: for l = L-1, L-2, ..., 1,
where T is the matrix transposition operator.
5. Update the weights and biases:
You can see that this notation is significantly more compact than the graph form, even though
it describes exactly the same sequence of operations.
[77]
4.6 Backpropagation of error: an example
We will now show an example of a backprop network as it learns to model the highly
nonlinear data we encountered before.
The left hand panel shows the data to be modeled. The right hand panel shows a network with
two hidden units, each with a tanh nonlinear activation function. The output unit computes a
linear combination of the two functions
[78]
(1)
Where
(2)
and
(3)
To begin with, we set the weights, a..g, to random initial values in the range [-1,1]. Each
hidden unit is thus computing a random tanh function. The next figure shows the initial two
activation functions and the output of the network, which is their sum plus a negative
constant. (If you have difficulty making out the line types, the top two curves are the tanh
functions, the one at the bottom is the network output).
[79]
We now train the network (learning rate 0.3), updating the weights after each pattern (online
learning). After we have been through the entire dataset 10 times (10 training epochs), the
functions computed look like this (the output is the middle curve):
[80]
After 20 epochs, we have (output is the humpbacked curve):
[81]
and after 27 epochs we have a pretty good fit to the data:
[82]
As the activation functions are stretched, scaled and shifted by the changing weights, we hope
that the error of the model is dropping. In the next figure we plot the total sum squared error
over all 88 patterns of the data as a function of training epoch. Four training runs are shown,
with different weight initialization each time:
[83]
You can see that the path to the solution differs each time, both because we start from a
different point in weight space, and because the order in which patterns are presented is
random. Nonetheless, all training curves go down monotonically, and all reach about the same
level of overall error.
[84]
4.7 Overfitting
In the previous example we used a network with two hidden units. Just by looking at the data,
it was possible to guess that two tanh functions would do a pretty good job of fitting the data.
In general, however, we may not know how many hidden units, or equivalently, how many
weights, we will need to produce a reasonable approximation to the data. Furthermore, we
usually seek a model of the data which will give us, on average, the best possible predictions
for novel data. This goal can conflict with the simpler task of modelling a specific training set
well. In this section we will look at some techniques for preventing our model becoming too
powerful (overfitting). In the next, we address the related question of selecting an appropriate
architecture with just the right amount of trainable parameters.
Bias-Variance trade-off
Consider the two fitted functions below. The data points (circles) have all been generated
from a smooth function, h(x), with some added noise. Obviously, we want to end up with a
model which approximatesh(x), given a specific set of data y(x) generated as:
(1)
In the left hand panel we try to fit the points using a function g(x) which has too few
parameters: a straight line. The model has the virtue of being simple; there are only two free
parameters. However, it does not do a good job of fitting the data, and would not do well in
predicting new data points. We say that the simpler model has a high bias.
[85]
The right hand panel shows a model which has been fitted using too many free parameters. It
does an excellent job of fitting the data points, as the error at the data points is close to zero.
However it would not do a good job of predicting h(x) for new values of x. We say that the
model has a high variance. The model does not reflect the structure which we expect to be
present in any data set generated by equation (1) above.
Clearly what we want is something in between: a model which is powerful enough to
represent the underlying structure of the data (h(x)), but not so powerful that it faithfully
models the noise associated with this particular data sample.
The bias-variance trade-off is most likely to become a problem if we have relatively few data
points. In the opposite case, where we have essentially an infinite number of data points (as in
continuous online learning), we are not usually in danger of overfitting the data, as the noise
associated with any single data point plays a vanishingly small role in our overall fit. The
following techniques therefore apply to situations in which we have a finite data set, and,
typically, where we wish to train in batch mode.
Preventing overfitting
Early stopping
One of the simplest and most widely used means of avoiding overfitting is to divide the data
into two sets: a training set and a validation set. We train using only the training data. Every
now and then, however, we stop training, and test network performance on the independent
[86]
validation set. No weight updates are made during this test! As the validation data is
independent of the training data, network performance is a good measure of generalization,
and as long as the network is learning the underlying structure of the data (h(x) above),
performance on the validation set will improve with training. Once the network stops learning
things which are expected to be true of any data sample and learns things which are true only
of this sample (epsilon in Eqn 1 above), performance on the validation set will stop
improving, and will typically get worse. Schematic learning curves showing error on the
training and validation sets are shown below. To avoid overfitting, we simply stop training at
time t, where performance on the validation set is optimal.
One detail of note when using early stopping: if we wish to test the trained network on a set of
independent data to measure its ability to generalize, we need a third, independent, test set.
This is because we used the validation set to decide when to stop training, and thus our trained
network is no longer entirely independent of the validation set. The requirements of
independent training, validation and test sets means that early stopping can only be used in a
data-rich situation.
[87]
Weight decay
The over-fitted function above shows a high degree of curvature, while the linear function is
maximally smooth. Regularization refers to a set of techniques which help to ensure that the
function computed by the network is no more curved than necessary. This is achieved by
adding a penalty to the error function, giving:
(2)
One possible form of the regularizer comes from the informal observation that an over-fitted
mapping with regions of large curvature requires large weights. We thus penalize large
weights by choosing
(3)
Using this modified error function, the weights are now updated as
(4)
where the right hand term causes the weight to decrease as a function of its own size. In the
absence of any input, all weights will tend to decrease exponentially, hence the term "weight
decay".
Training with noise
A final method which can often help to reduce the importance of the specific noise
characteristics associated with a particular data sample is to add an extra small amount of
noise (a small random value with mean value of zero) to each input. Each time a specific input
pattern x is presented, we add a different random number, and use instead.
At first, this may seem a rather odd thing to do: to deliberately corrupt ones own data.
However, perhaps you can see that it will now be difficult for the network to approximate any
specific data point too closely. In practice, training with added noise has indeed been shown to
reduce overfitting and thus improve generalization in some situations.
[88]
If we have a finite training set, another way of introducing noise into the training process is to
use online training, that is, updating weights after every pattern presentation, and to randomly
reorder the patterns at the end of each training epoch. In this manner, each weight update is
based on a noisy estimate of the true gradient.
[89]
4.8 Growing and Pruning Networks
The neural network modeler is faced with a huge array of models and training regimes from
which to select. This course can only serve to introduce you to the most common and general
models. However, even after deciding, for example, to train a simple feed forward network,
using some specific form of gradient descent, with tanh nodes in a single hidden layer, an
important question to be addressed is remains: how big a network should we choose? How
many hidden units, or, relatedly, how many weights?
By way of an example, the nonlinear data which formed our first example can be fitted very
well using 40 tanh functions. Learning with 40 hidden units is considerably harder than
learning with 2, and takes significantly longer. The resulting fit is no better (as measured by
the sum squared error) than the 2-unit model.
The most usual answer is not necessarily the best: we guess an appropriate number (as we did
above).
Another common solution is to try out several network sizes, and select the most promising.
Neither of these methods is very principled.
Two more rigorous classes of methods are available, however. We can either start with a
network which we know to be too small, and iteratively add units and weights, or we can train
an oversized network and remove units/weights from the final network. We will look briefly
at each of these approaches.
Growing networks
The simplest form of network growing algorithm starts with a small network, say one with
only a single hidden unit. The network is trained until the improvement in the error over one
epoch falls below some threshold. We then add an additional hidden unit, with weights from
inputs and to outputs. We initialize the new weights randomly and resume training. The
process continues until no significant gain is achieved by adding an extra unit. The process is
illustrated below.
[90]
Cascade correlation
Beyond simply having too many parameters (danger of overfitting), there is a problem with
large networks which has been called the herd effect. Imagine we have a task which is
essentially decomposable into two sub-tasks A and B. We have a number of hidden units and
randomly weighted connections. If task Ais responsible for most of the error signal arriving at
the hidden units, there will be a tendency for all units to simultaneously try to solve A. Once
the error attributable to A has been reduced, error from subtask Bwill predominate, and all
units will now try to solve that, leading to an increase again in the error from A. Eventually,
due mainly to the randomness in the weight initialization, the herd will split and different
units will address different sub-problems, but this may take considerable time.
To get around this problem, Fahlman (1991) proposed an algorithm called cascade
correlation which begins with a minimal network having just input and output units. Training
a single layer requires no back-propagation of error and can be done very efficiently. At some
point further training will not produce much improvement. If network performance is
satisfactory, training can be stopped. If not, there must be some remaining error which we
wish to reduce some more. This is done by adding a new hidden unit to the network, as
described in the next paragraph. The new unit is added, its input weights are frozen (i.e. they
[91]
will no longer be changed) and all output weights are once again trained. This is repeated until
the error is small enough (or until we give up).
To add a hidden unit,we begin with a candidate unit and provide it with incoming connections
from the input units and from all existing hidden units. We do not yet give it any outgoing
connections. The new unit's input weights are trained by a process similar to gradient descent.
Specifically, we seek tomaximize the covariance between v, the new unit's value, and E
o
, the
output error at output unit o.
We define S as:
(1)
where o ranges over the output units and p ranges over the input patterns. The terms are
the mean values of v and E
o
over all patterns. Performing gradient ascent on the partial
derivative (we will skip the explicit formula here) ensures that we end up with a unit
whose activation is maximally correlated (positively or negatively) with the remaining error.
Once we have maximized S, we freeze the input weights, and install the unit in the network as
described above. The whole process is illustrated below.
In (1) we train the weights from input to output. In (2), we add a candidate unit and train its
weights to maximize the correlation with the error. In (3) we retrain the output layer, (4) we
train the input weights for another hidden unit, (5) retrain the output layer, etc. Because we
train only one layer at a time, training is very quick. What is more, because the weights
feeding into each hidden unit do not change once the unit has been added, it is possible to
record and store the activations of the hidden units for each pattern, and reuse these values
without recomputation in later epochs.
[92]
Pruning networks
An alternative approach to growing networks is to start with a relatively large network and
then remove weights so as to arrive at an optimal network architecture. The usual procedure is
as follows:
1. Train a large, densely connected, network with a standard training algorithm
2. Examine the trained network to assess the relative importance of the weights
3. Remove the least important weight(s)
4. retrain the pruned network
5. Repeat steps 2-4 until satisfied
Deciding which are the least important weights is a difficult issue for which several heuristic
approaches are possible. We can estimate the amount by which the error function E changes
for a small change in each weight. The computational form for this estimate would take us a
little too far here. Various forms of this technique have been called optimal brain damage,
and optimal brain surgeon.
[93]
4.9 Preconditioning the Network
Ill-Conditioning
In the preceding section on overfitting, we have seen what can happen when the network
learns a given set of data too well. Unfortunately a far more frequent problem encountered by
backpropagation users is just the opposite: that the network does not learn well at all! This is
usually due to ill-conditioning of the network.
(Fig. 1a)
Recall that gradient descent requires a reasonable learning rate to work well: if it is too low
(Fig. 1a), convergence will be very slow; set it too high, and the network will diverge (Fig.
1b).
(Fig. 1b)
Unfortunately the best learning rate is typically different for each weight in the network!
Sometimes these differences are small enough for a single, global compromise learning rate to
[94]
work well - other times not. We call a network ill-conditioned if it requires learning rates for
its weights that differ by so much that there is no global rate at which the network learns
reasonably well. The error function for such a network is characterized by long, narrow
valleys:
(Fig. 2)
(Mathematically, ill-conditioning is characterized by a high condition number. The condition
number is the ratio between the largest and the smallest eigenvalue of the network's Hessian.
The Hessian is the matrix of second derivatives of the loss function with respect to the
weights. Although it is possible to calculate the Hessian for a multi-layer network and
determine its condition number explicitly, it is a rather complicated procedure, and rarely
done.)
Ill-conditioning in neural networks can be caused by the training data, the network's
architecture, and/or its initial weights. Typical problems are: having large inputs or target
valuess, having both large and small layers in the network, having more than one hidden
layer, and having initial weights that are too large or too small. This should make it clear that
ill-conditioning is a very common problem indeed! In what follows, we look at each possible
source of ill-conditioning, and describe a simple method to remove the problem. Since these
methods are all used before training of the network begins, we refer to them
aspreconditioning techniques.
Normalizing Inputs and Targets
(Fig. 3)
Recall the simple linear network (Fig. 3) we first used to learn the car data set. When we
presented thebest linear fit, we had rescaled both the x (input) and y (target) axes. Why did we
do this? Consider what would happen if we used the original data directly instead: the input
(weight of the car) would be quite large - over 3000 (pounds) on average. To map such large
[95]
inputs onto the far smaller targets, the weight from input to output must become quite small -
about -0.01. Now assume that we are 10% (0.001) away from the optimal value. This would
cause an error of (typically) 3000*0.001 = 3 at the output. At learning rate , the weight
change resulting from this error would be *3*3000 = 9000 . For stable convergence, this
should be smaller than the distances to the weight's optimal value: 9000 < 0.001, giving us
< 10
-7
, a very small learning rate. (And this is for online learning - for batch learning, where
the weight changes for several patterns are added up, the learning rate would have to be even
smaller!)
Why should such a small learning rate be a problem? Consider that the bias unit has a constant
output of 1. A bias weight that is, say, 0.1 away from its optimal value would therefore have a
gradient of 0.1. At a learning rate of 10
-7
, however, it would take 10 million steps to move the
bias weight by this distance! This is a clear case of ill-conditioning caused by the vastly
different scale of input and bias values. The solution is simple: normalize the input, so that it
has an average of zero and a standard deviation of one. Normalization is a two-step process:
To normalize a variable, first
1. (centering) subtract its average, then
2. (scaling) divide by its standard deviation.
Note that for our purposes it is not really necessary to calculate the mean and standard
deviation of each input exactly - approximate values are perfectly sufficient. (In the case of
the car data, the "mean" of 3000 and "standard deviation" of 1000 were simply guessed after
looking at the data plot.) This means that in situations where the training data is not known in
advance, estimates based on either prior knowledge or a small sample of the data are usually
good enough. If the data is a time series x(t), you may also want to consider using the first
differences x(t) - x(t-1) as network inputs instead; they have zero mean as long as x(t) is
stationary. Whichever way you do it, remember that you should always
- normalize the inputs, and
- normalize the targets.
[96]
To see why the target values should also be normalized,
consider the network we've used to fit a sigmoid to the car
data (Fig. 4). If the target values were those found in
theoriginal data, the weight from hidden to output unit
would have to be 10 times larger. The error signal
propagated back to the hidden unit would thus be
multiplied by 17 along the way. In order to compensate
for this, the global learning rate would have to be lowered
correspondingly, slowing down the weights that go
directly to the output unit. Thus while large inputs cause
ill-conditioning by leading to very small weights, large
targets do so by leading to very large weights.
Finally, notice that the argument for normalizing the
inputs can also be applied to the hidden units (which after all look like inputs to their posterior
nodes). Ideally, we would like hidden unit activations as well to have a mean of zero and a
standard deviation of one. Since the weights into hidden units keep changing during training,
however, it would be rather hard to predict their mean and standard deviation accurately!
Fortunately we can rely on our tanh activation function to keep things reasonably well-
conditioned: its range from -1 to +1 implies that the standard deviation cannot exceed 1, while
its symmetry about zero means that the mean will typically be relatively small. Furthermore,
its maximum derivative is also 1, so that backpropagated errors will be neither magnified nor
attenuated more than necessary.
Note: For historic reasons, many people use the logistic sigmoid f(u) = 1/(1 + e
-u
) as
activation function for hidden units. This function is closely related to tanh (in fact, f(u) =
tanh(u/2)/2 + 0.5) but has a smaller, asymmetric range (from 0 to 1), and a maximum
derivative of 0.25. We will later encounter a legitimate use for this function, but as activation
function for hidden units it tends to orsen the network's conditioning. Thus
- do not use the logistic sigmoid f(u) = 1/(1 + e
-u
) as activation function for hidden
units.
Use tanh instead: your network will be better conditioned.
Initializing the Weights
Before training, the network weights are initialized to small random values. The random
values are usually drawn from a uniform distribution over the range [-r,r]. What should r be?
If the initial weights are too small, both activation and error signals will die out along their
way through the network. Conversely, if they are too large, the tanh function of the hidden
units will saturate - be very close to its asymptotic value of +/-1. This means that its
(Fig. 4)
[97]
derivative will be close to zero, blocking any backpropagated error signals from passing
through the node; this is sometimes called paralysis of the node.
To avoid either extreme, we would initally like the hidden units' net input to be approximately
normalized. We do not know the inputs to the node, but we do know that they're
approximately normalized - that's what we ensured in the previous section. It seems
reasonable then to model the expected inputs as independent, normalized random variables.
This means that their variances add, so we can write
since the initial weights are in the range [-r,r]. To ensure that Var(net
i
) is at most 1, we can
thus set r to the inverse of the square root of the fan-in |A
i
| of the node - the number of
weights coming into it:
- initialize weight w
ij
to a uniformly random value in the range [-r
i
, r
i
],
where
Setting Local Learning Rates
Above we have seen that the architecture of the network - specifically: the fan-in of its nodes
- determines the range within which its weights should be initialized. The architecture also
affects how the error signal scales up or down as it is backpropagated through the network.
Modelling the error signals as independent random variables, we have
Let us define a new variable v for each hidden or output node, proportional to the (estimated)
variance of its error signal divided by its fan-in. We can calculate all the v by a
backpropagation procedure:
- for all output nodes o, set
- backpropagate: for all hidden nodes j, calculate
[98]
Since the activations in the network are already normalized, we can expect the gradient for
weight w
ij
to scale with the square root of the corresponding error signal's variance, v
i
|A
i
|. The
resulting weight change, however, should be commensurate with the characteristic size of the
weight, which is given by r
i
. To achieve this,
- set the learning rate
i
(used for all weights w
ij
into node i) to
If you follow all the points we have made in this section before the start of training, you
should have a reasonably well-conditioned network that can be trained effectively. It remains
to determine a good global learning rate . This must be done by trial and error; a good first
guess (on the high size) would be the inverse of the square root of the batch size (by a similar
argument as we have made above), or 1 for online learning. If this leads to divergence, reduce
and try again.
[99]
4.10 Momentum and Learning Rate Adaptation
Local Minima
In gradient descent we start at some point on the error function defined over the weights, and
attempt to move to the global minimum of the function. In the simplified function of Fig 1a
the situation is simple. Any step in a downward direction will take us closer to the global
minimum. For real problems, however, error surfaces are typically complex, and may more
resemble the situation shown in Fig 1b. Here there are numerous local minima, and the ball is
shown trapped in one such minimum. Progress here is only possible by climbing higher before
descending to the global minimum.
(Fig. 1a) (Fig.
1b)
We have already mentioned one way to escape a local minimum: use online learning. The
noise in thestochastic error surface is likely to bounce the network out of local minima as long
as they are not too severe.
Momentum
Another technique that can help the network out of local minima is the use of
a momentum term. This is probably the most popular extension of the backprop algorithm; it
is hard to find cases where this is not used. With momentum m, the weight update at a given
time t becomes
(1)
[100]
where 0 < m < 1 is a new global parameter which must be determined by trial and error.
Momentum simply adds a fraction m of the previous weight update to the current one. When
the gradient keeps pointing in the same direction, this will increase the size of the steps taken
towards the minimum. It is otherefore often necessary to reduce the global learning rate
when using a lot of momentum (m close to 1). If you combine a high learning rate with a lot
of momentum, you will rush past the minimum with huge steps!
When the gradient keeps changing direction, momentum will smooth out the variations. This
is particularly useful when the network is not well-conditioned. In such cases the error surface
has substantially different curvature along different directions, leading to the formation of
long narrow valleys. For most points on the surface, the gradient does not point towards the
minimum, and successive steps of gradient descent can oscillate from one side to the other,
progressing only very slowly to the minimum (Fig. 2a). Fig. 2b shows how the addition of
momentum helps to speed up convergence to the minimum by damping these oscillations.
(Fig. 2a) (Fig.
2b)
To illustrate this effect in practice, we trained 20 networks on a simple problem (4-2-4
encoding), both with and without momentum. The mean training times (in epochs) were
momentum Training time
0 217
0.9 95
Learning Rate Adaptation
In the section on preconditioning, we have employed simple heuristics to arrive at reasonable
guesses for the global and local learning rates. It is possible to refine these values significantly
once training has commenced, and the network's response to the data can be observed. We
will now introduce a few methods that can do so automatically by adapting the learning rates
during training.
[101]
Bold Driver
A useful batch method for adapting the global learning rate is the bold driver algorithm. Its
operation is simple: after each epoch, compare the network's loss E(t) to its previous value,
E(t-1). If the error has decreased, increase by a small proportion (typically 1%-5%). If the
error has increased by more than a tiny proportion (say, 10
-10
), however, undo the last weight
change, and decrease sharply - typically by 50%. Thus bold driver will keep growing
slowly until it finds itself taking a step that has clearly gone too far up onto the opposite slope
of the error function. Since this means that the network has arrived in a tricky area of the error
surface, it makes sense to reduce the step size quite drastically at this point.
Annealing
Unfortunately bold driver cannot be used in this form for online learning: the stochastic
fluctuations in E(t) would hopelessly confuse the algorithm. If we keep fixed, however,
these same fluctuations prevent the network from ever properly converging to the minimum -
instead we end up randomly dancing around it. In order to actually reach the minimum, and
stay there, we must anneal (gradually lower) the global learning rate. A simple, non-adaptive
annealing schedule for this purpose is the search-then-convergeschedule
(t) = (0)/(1 + t/T) (2)
Its name derives from the fact that it keeps nearly constant for the first T training patterns,
allowing the network to find the general location of the minimum, before annealing it at a
(very slow) pace that is known from theory to guarantee convergence to the minimum. The
characteristic time T of this schedule is a new free parameter that must be determined by trial
and error.
Local Rate Adaptation
If we are willing to be a little more sophisticated, we go a lot further than the above global
methods. First let us define an online weight update that uses a local, time-varying learning
rate for each weight:
(3)
The idea is to adapt these local learning rates by gradient descent, while simultaneously
adapting the weights. At time t, we would like to change the learning rate (before changing
the weight) such that the loss E(t+1) at the next time step is reduced. The gradient we need is
[102]
(4)
Ordinary gradient descent in
ij
, using the meta-learning rate q (a new global parameter),
would give
(5)
We can already see that this would work in a similar fashion to momentum: increase the
learning rate as long as the gradient keeps pointing in the same direction, but decrease it when
you land on the opposite slope of the loss function.
Problem:
ij
might become negative! Also, the step size should be proportional to
ij
so that it
can be adapted over several orders of magnitude. This can be achieved by performing the
gradient descent on log(
ij
) instead:
(6)
Exponentiating this gives
(7)
where the approximation serves to avoid an expensive exp function call. The multiplier is
limited below by 0.5 to guard against very small (or even negative) factors.
Problem: the gradient is noisy; the product of two of them will be even noisier - the learning
rate will bounce around a lot. A popular way to reduce the stochasticity is to replace the
gradient at the previous time step (t-1) by an exponential average of past gradients. The
exponential average of a time series u(t) is defined as
(8)
where 0 < m < 1 is a new global parameter.
Problem: if the gradient is ill-conditioned, the product of two gradients will be even worse -
the condition number is squared. We will need to normalize the step sizes in some way. A
radical solution is to throw away the magnitude of the step, and just keep the sign, giving
[103]
(9)
where r = e
q
. This works fine for batch learning, but...
(Fig. 3)
Problem: Nonlinear normalizers such as the sign function lead to systematic errors in
stochastic gradient descent (Fig. 3): a skewed but zero-mean gradient distribution (typical for
stochastic equilibrium) is mapped to a normalized distribution with non-zero mean. To avoid
the problems this is casuing, we need a linear normalizer for online learning. A good method
is to divide the step by , an exponential average of the squared gradient. This gives
(10)
Problem: successive training patterns may be correlated, causing the product of stochastic
gradients to behave strangely. The exponential averaging does help to get rid of short-term
correlations, but it cannot deal with input that exhibits correlations across long periods of
time. If you are iterating over a fixed training set, make sure you permute (shuffle) it before
each iteration to destroy any correlations. This may not be possible in a true online learning
situation, where training data is received one pattern at a time.
To show that all these equations actually do something useful, here is a typical set of online
learning curves (in postscript) for a difficult benchmark problem, given
either uncorrelated training patterns, or patterns with strong short-term or long-
term correlations. In these figures "momentum" corresponds to using equation (1) above, and
[104]
"s-ALAP" to equation (10). "ALAP" is like "s-ALAP" but without the exponential averaging
of past gradients, while "ELK1" and "SMD" are more advanced methods (developed by one
of us).
[105]
4.11 Delta-Bar-Delta (Jacobs)
Since the cost surface for multi-layer networks can be complex, choosing a learning rate can
be difficult. What works in one location of the cost surface may not work well in another
location. Delta-Bar-Delta is a heuristic algorithm for modifying the learning rate as training
progresses:
- Each weight has its own learning rate.
- For each weight: the gradient at the current timestep is compared with the gradient at
the previous step (actually, previous gradients are averaged)
- If the gradient is in the same direction the learning rate is increased
- If the gradient is in the opposite direction the learning rate is decreased
- Should be used with batch only.
Let
g
ij
(t) = gradient of E wrt w
ij
at time t
then define
Then the learning rate
ij
for weight w
ij
at time t+1 is given by
where |, , and k are chosen by the hand.
Downsides:
- Knowing how to choose the parameters |, , and k is not easy.
- Doesn't work for online.
[106]
[107]
Chapter 5 Unsupervised Learning
5.1 Unsupervised Learning
Up until now we have discussed how to train nets given a training set of input and target
values. The target value is often called the teacher signal because it represents the "right
answer". i.e. what the output of the net should be. Training with a teacher signal is
called supervised learning.
We can also train nets on inputs where there is no teacher signal. The purpose might be to
- discover underlying structure of the data
- encode the data
- compress the data
- transform the data
This kind of learning is called unsupervised learning because there is no explicit teacher
signal.
Examples of unsupervised learning
- hebbian learning
w(t+1) = w(t) + y(t) x(t)
This moves w toward inifinity in the direction of the eigevector with largest eigenvalue
of the correlation matrix
A more stable version is Oja's rule
w(t+1) = w(t) + (x(t) - y(t) w(t) ) y(t)
- principal component analysis
- competitive learning
- vector quantization
[108]
5.2 Linear Data Compression
Goal: To find a low dimensional representation of the data
Example: Saving Space
In general, the data does not lie perfectly on a linear subspace. In this case, some information
is lost when the data is compressed. The problem here is to find the compression direction that
results in the least amount of information that is lost.
Principal Component Analysis (PCA)
The
1
direction corresponds to
- the direction of largest variance of the data.
- the eigenvector associated with the largest eigenvalue of the correlation matrix ( <x x
T
>
).
If we have n dimensional data, we can compress it down to m dimensions by projecting it
down to the space spanned by eigen vectors of the m largest eigenvalues.
The methods that can be used for finding these directions is called Principal Component
Analysis (PCA).
[109]
Finding the Principal Components using an Autoassociative Network
An autoassociative network is a network whose inputs and targets are the same. That is, the
net must find a mapping from an input to itself.
Why do this? Well, when the number of hidden nodes is smaller than the number of input
node, the network is forced to learn an efficient low dimensional representation of the data.
See Maple example of the above network.
Example: Image Compression (Cottrell et al, 87)
- 64 inputs: 8x8 pixel regions of an image specified to 8 bit precision
- 16 hidden units
- 64 outputs: targets = inputs
Trained on randomly selected patches of an image (150,000 training steps). It was then tested
on the entire image patch by patch using the entire set of non overlapping patches See
"Fundamentals of Artificial Neural Networks", Hassoun, pp247-253.
They found that nonlinearity in the hidden units gave no advantage (this was later confirmed
theoretically).
[110]
5.3 Nonlinear Compression Techniques
Two layer networks perform a projection of the data onto a linear subspace. In this case, the
encoding and decoding portions of the network are really single layer linear networks.
This works well in some cases. However, many datasets lie on lower dimensional subspaces
that are not linear.
Example:
A helix is 1-D, however, it does not line on a 1-D linear subspace.
To solve this problem we can let the encoding and decoding portions each be multilayer
networks. In this way we obtain nonlinear projections of the data.
5-Layer Networks:
Example: Hemisphere
(from Fast Nonlinear Dimension Reduction, Nanda Kambhatla,NIPS93)
[111]
Compressing a hemisphere onto 2 dimensions
Example: Faces
(from Fast Nonlinear Dimension Reduction, Nanda Kambhatla,NIPS93)
In the examples below, the original images consisted of 64x64 8-bit/pixel grayscale images.
The first 50 principal components were extracted to from the image you see on the left. This
was reduced to 5 dimensions using linear PCA to obtain the image in the center. The same
imageon the left was also reduced to 5 dimensions using a 5-layer (50-40-5-40-50) network to
produce the image on the right.
[112]
Face 1:
50 principal components 5 principal components 5 nonlinear components
Face 2:
50 principal components 5 principal components 5 nonlinear components
[113]
5.4 Simple Competitive Learning
In competitive networks, output units compete for the right to respond.
Goal: method of clustering - divide the data into a number of clusters such that the inputs in
the same cluster are in some sense similar.
A basic competitive learning network has one layer of input nodes and one layer of output
nodes. Binary valued outputs are often (but not always) used. There are as many output nodes
as there are classes.
Often (but not always) there are lateral inhibitory connections between the output nodes.(in
simulations, the function of the lateral connections can be replaced with a different algorithm)
The output units are also often called grandmother cells. The term grandmother cell comes
from discussions as to whether your brain might contain cells that fire only when you
encounter your maternal grandmother, or whether such higher level concepts are more
distributed.
Vector Quantization (VQ)
Vector quantization is one example of competitive learning.
The goal here is to have the network "discover" structure in the data by finding how the data is
clustered. The results can be used for data encoding and compression. One such method for
doing this is called vector quantization.
[114]
In vector quantization, we assume there is a codebook which is defined by a set of M
prototype vectors. (M is chosen by the user and the initial prototype vectors are chosen
arbitrarily).
An input belongs to cluster i if i is the index of the closest prototype (closest in the sense of
the normal euclidean distance). This has the effect of dividing up the input space into
a Voronoi tesselation
.
Implementing Vector Quantization with a Network
Algorithm:
- Choose the number of clusters M
- Initialize the prototypes w
*1
,... w
*m
(one simple method for doing this is to randomly
choose M vectors from the input data)
- Repeat until stopping criterion is satisfied:
[115]
o Randomly pick an input x
o Determine the "winning" node k by finding the prototype vector that satisfies
| w
*k
- x | <= | w
*i
- x | ( for all i )
note: if the prototypes are normalized, this is equivalent to maximizing w
*i
x
o Update only the winning prototype weights according to
w
*k
(new) = w
*k
(old) + ( x - w
*k
(old) )
This is called the standard competitive learning rule
See Maple Example.
VQ and Data Compression
Vector quantization can be used for (lossy) data compression. If we are sending information
over a phone line, we
- initially send the codebook vectors
- for each input, we send the index of the class that the input belongs
For a large amount of data, this can be a significant reduction. If M=64, then it takes only 6
bits to encode the index. If the data itself consists of floating point numbers (4 bytes) there is
an 80% reduction ( 100*(1 - 6/32) ).
Learning Vector Quantization (LVQ)
This is a supervised version of vector quantization. Classes are predefined and we have a set
of labelled data. The goal is to determine a set of prototypes the best represent each class.
[116]
5.6 Kohonen's Self-Organizing Map (SOM)
Kohonon's SOMs are a type of unsupervised learning. The goal is to discover some
underlying structure of the data. However, the kind of structure we are looking for is very
different than, say, PCA or vector quantization.
Kohonen's SOM is called a topology-preserving map because there is a topological structure
imposed on the nodes in the network. A topological map is simply a mapping that preserves
neighborhood relations.
In the nets we have studied so far, we have ignored the geometrical arrangements of output
nodes. Each node in a given layer has been identical in that each is connected with all of the
nodes in the upper and/or lower layer. We are now going to take into consideration that
physical arrangement of these nodes. Nodes that are "close" together are going to interact
differently than nodes that are "far" apart.
What do we mean by "close" and "far"? We can think of organizing the output nodes in a line
or in a planar configuration.
The goal is to train the net so that nearby outputs correspond to nearby inputs.
E.g. if x1 and x2 are two input vectors and t1 and t2 are the locations of the corresponding
winning output nodes, then t1 and t2 should be close if x1 and x2 are similar. A network that
performs this kind of mapping is called a feature map.
In the brain, neurons tend to cluster in groups. The connections within the group are much
greater than the connections with the neurons outside of the group. Kohonen's network tries to
mimick this in a simple way.
Algorithm for Kohonon's Self Organizing Map
- Assume output nodes are connected in an array (usually 1 or 2 dimensional)
- Assume that the network is fully connected - all nodes in input layer are connected to
all nodes in output layer.
- Use the competitive learning algorithm as follows:
[117]
- Randomly choose an input vector x
- Determine the "winning" output node i, where w
i
is the weight vector connecting
the inputs to output node i.
Note: the above equation is equivalent to w
i
x >= w
k
x only if the weights are
normalized.
- Given the winning node i, the weight update is
where is called the neighborhood function that has value 1 when i=k and
falls off with the distance |r
k
- r
i
| between units i and k in the output array. Thus,
units close to the winner as well as the winner itself, have their weights updated
appreciably. Weights associated with far away output nodes do not change
significantly. It is here that the toplogical information is supplied. Nearby units
receive similar updates and thus end up responding to nearby input patterns.
The above rule drags the weight vector wi and the weights of nearby units
towards the input x.
Example of the neighborhood function is:
where o
2
is the width parameter that can gradually be decreased over time.
[118]
[119]
Chapter 6 Reinforcement Learning
6.1 Reinforcement Learning
Learning with a Critic
In supervised learning we have assumed that there is a target output value for each input
value. However, in many situations, there is less detailed information available. In extreme
situations, there is only a single bit of information after a long sequence of inputs telling
whether the output is right or wrong. Reinforcement learning is one method developed to deal
with such situations.
Reinforcement learning (RL) is a kind of supervised learning in that some feedback from the
environment is given. However the feedback signal is only evaluative, not instructive.
Reinforcement learning is often called learning with a critic as opposed to learning with a
teacher.
Learning from Interaction
Humans learn by interacting with the environment. When a baby plays, it waves its arms
around, touches things, tastes things, etc. There is no explicit teacher but there is a sensori-
motor connection to its environment. Such a connection provides information about cause and
effect, the consequence of actions, and what to do to achieve goals.
Learning from interaction with our environment is a fundamental idea underlying most
theories of learning.
RL has rich roots in the psychology of animal learning, from where it gets its name.
The growing interest in RL comes in part from the desire to build intelligent systems that must
operate in dynamically changing real- world environments. Robotics is the common example.
Environment
In RL, it is common to think explicitly of a network functioning in an environment. The
environment supplies inputs to the network, receives output, and then provides a
reinforcement signal.
In the most general case, the environment may itself be governed by a complicated dynamical
process. Both reinforcement signals and input patterns may depend arbitrarily on the past
history of the networks's output.
[120]
The classic problem is in game theory, where the "environment" is actually another player or
players.
Temporal Credit Assignment Problem
A network designed to play chess would receive a reinforcement signal (win or lose) after a
long sequence of moves. The question that arises is: How do we assign credit or blame
individually to each move in a sequence that leads to an eventual victory or loss?
This is called the temporal credit assignment problem in contrast with the structural credit
problemwhere we must attribute network error to different weights.
Learning and Planning
So far in this course we have not discussed the issue of planning. The networks we have seen
are simply learning a direct relationship between an input and an output. RL is our first look at
networks that in some sense decide a course of action by considering possible future actions
before they are actually experienced.
Related Work
RL is closely related to
- dynamic programming methods
- state-space planning methods used in AI
Exploration vs Exploitation
RL is learning what to do - how to map situations to actions - so as to maximize a scalar
reward signal.
There are two important features:
- trial-and-error search:
the learner is not told what actions to take
- delayed reward:
actions can affect not only the immediate reward but also all subsequent rewards
There is always a trade-off in
- exploration: discovery new actions, and
- exploitation: using what it currently knows to obtain the a reward
[121]
6.2 Components of Reinforcement Learning
Reinforcement learning has 3 basic components:
- agent: the learner or the decision maker
- environment: everything the agent interacts with, i.e. everything outside the agent
- actions: what the agent can do.
Each action is associated with a reward. The objective is for the agent to choose actions so as
to maximize the expected reward over some period of time.
Example: The n-Armed Bandit
Java Simulation
There are n levers that can be pulled.
The action at each step is to choose a lever to pull.
The rewards are the payoffs for hitting the jackpot. Each arm has some average reward, called
it's value. If you know the value then the solution is trivial: always pick the lever with the
largest value.
What if you don't know the values of any of the arms? What is the best approach for
estimating the value while at the same time maximizing your reward?
Greedy Approach: Policy: Always pick the arm with the largest estimated value. This is
called exploiting your current knowledge.
Non-Greedy Approach: If you select a nongreedy approach then you are said to
be exploring.
Balanced Approach: Choose a balance between exploration and exploitation. The balance
partly depends on how many plays you get. If you have 1 play then the best approach is
exploitation. However, there are many plays you will need some combination. The reward
will be lower in the short term but higher in the long run.
Let:
Q*(a) = true actual value of taking an action a
Q
t
(a) = estimated value of taking an action a = (sum of rewards)/(number of steps)
As t->infinity, Q
t
(a) -> Q*(a)
[122]
Example: A simple policy would be to take the greedy choice most of the time but every now
and then (with probability e), randomly select an action. How do we choose e? select
Components of the Agent
A reinforcement learning agent generally has 4 basic components:
- a policy,
- a reward function,
- a value function, and
- a model of the environment
Policy
The policy is the decision making function of the agent. It specifies what action the agent
should take in any of the situations it might encounter. This is the core of the agent. The other
components serve only to change and improve the policy.
Reward Function
The reward function defines the goal of the RL agent. It maps the state of the environment to a
single number, a reward, indicating the intrinsic desirability of the state. The agent's objective
is to maximize the total reward it receives in the long run.
Value function
The value function specifies what is good in the long run. Roughly speaking, the value of a
state is the total amount of reward the agent can expect to accumulate over the future when
starting from the current state.
Rewards determine immediate desirability while value indicates the long term desirability.
In analogy to humans, rewards are immediate pleasure (if high reward) or pain (if low)
whereas values correspond to more refined far-sighted judgement of how pleased or
displeased we are that our environment is in a particular state.
Most of the methods we will discuss are centered around forming and improving approximate
value functions.
Model
The model of the environment or external world should mimic the behavior of the
environment. For example, given a situation and action, the model might predict the resultant
next state and next reward. The model often takes up the largest storage space. If there are S
states and A actions then a complete model will take up a space proportional to S x S x A
[123]
because it maps state-action pairs to probability distributions over states. By contrast, the
reward and value functions might just map states to real numbers and thus be of size S.
[124]
6.3 Terminology
Reinforcement Learning is about learning a mapping from states to a probability distribution
over actions. This is called the policy.
Policy = t(s,a) = probability of taking action a when in state s
S = set of all states (assume finite)
s
t
= state at time t
A(s
t
) = set of all possible actions given agent is in state s
t
c S
a
t
= action at time t
r
t
c R (reals) = reward at time t
At each timestep t=1,2,3,...
- the agent finds itself in a state s
t
c S and
- on that basis chooses an action a
t
c A(s
t
).
- One timestep later, the agent receives a reward r
t
+1 and
- finds itself in a new state s
t
+1.
The return, ret
t
, is the total reward received starting at time t+1:
ret
t
= r
t+1
+ r
t+2
+ r
t+3
.... + r
f
where r
f
is the reward at the final time step (can be infinite)
and the discounted return is
ret
t
= r
t+1
+ r
t+2
+
2
r
t+3
....
where 0 <= <= 1 is called the discount factor.
We assume that the number of states and actions is finite. We then define the state transition
probabilities to be:
This is just the probability of transitioning from state s to state s' when action a has been
taken.
[125]
Expected Rewards
The value function for policy t is
The action-value function for policy t is
Bellman's Equation for V
t
(s) (Recursion on V
t
(s)) is
Bellman Optimality Equations
Goal: Find the policy that gives the greatest return over the long run. We say a policy t is
better than or equal to policy t' if V
t
(s) >= V
t'
(s) for all s. There is always at least one such
policy. Such a policy it is called an optimal policy and is denoted by t*. Its corresponding
value function is called V*:
V*(s) = V
t*
(s) = max_t V
t
(s) , for all s
[126]
and the optimal action-value function
Q*(s,a) = Q
t*
(s,a) = max_t Q
t
(s,a) , for all s, a
The Bellman optimality equation is then
This equation has a unique solution. It is a system of equations with |S| equations and |S|
unknowns. If P and R were known then, in principle, it can be solved using some method for
solving systems of nonlinear equations. Once V* is known, the optimal policy is determined
by always choosing the action that produces the largest V*.
[127]
Chapter 7 Advanced Topics
7.1 Momentum and Learning Rate Adaptation
Local Minima
In gradient descent we start at some point on the error function defined over the weights, and
attempt to move to the global minimum of the function. In the simplified function of Fig 1a
the situation is simple. Any step in a downward direction will take us closer to the global
minimum. For real problems, however, error surfaces are typically complex, and may more
resemble the situation shown in Fig 1b. Here there are numerous local minima, and the ball is
shown trapped in one such minimum. Progress here is only possible by climbing higher before
descending to the global minimum.
(Fig. 1a) (Fig.
1b)
We have already mentioned one way to escape a local minimum: use online learning. The
noise in thestochastic error surface is likely to bounce the network out of local minima as long
as they are not too severe.
Momentum
Another technique that can help the network out of local minima is the use of
a momentum term. This is probably the most popular extension of the backprop algorithm; it
is hard to find cases where this is not used. With momentum m, the weight update at a given
time t becomes
[128]
(1)
where 0 < m < 1 is a new global parameter which must be determined by trial and error.
Momentum simply adds a fraction m of the previous weight update to the current one. When
the gradient keeps pointing in the same direction, this will increase the size of the steps taken
towards the minimum. It is otherefore often necessary to reduce the global learning rate
when using a lot of momentum (m close to 1). If you combine a high learning rate with a lot
of momentum, you will rush past the minimum with huge steps!
When the gradient keeps changing direction, momentum will smooth out the variations. This
is particularly useful when the network is not well-conditioned. In such cases the error surface
has substantially different curvature along different directions, leading to the formation of
long narrow valleys. For most points on the surface, the gradient does not point towards the
minimum, and successive steps of gradient descent can oscillate from one side to the other,
progressing only very slowly to the minimum (Fig. 2a). Fig. 2b shows how the addition of
momentum helps to speed up convergence to the minimum by damping these oscillations.
(Fig. 2a) (Fig.
2b)
To illustrate this effect in practice, we trained 20 networks on a simple problem (4-2-4
encoding), both with and without momentum. The mean training times (in epochs) were
momentum Training time
0 217
0.9 95
Learning Rate Adaptation
In the section on preconditioning, we have employed simple heuristics to arrive at reasonable
guesses for the global and local learning rates. It is possible to refine these values significantly
[129]
once training has commenced, and the network's response to the data can be observed. We
will now introduce a few methods that can do so automatically by adapting the learning rates
during training.
Bold Driver
A useful batch method for adapting the global learning rate is the bold driver algorithm. Its
operation is simple: after each epoch, compare the network's loss E(t) to its previous value,
E(t-1). If the error has decreased, increase by a small proportion (typically 1%-5%). If the
error has increased by more than a tiny proportion (say, 10
-10
), however, undo the last weight
change, and decrease sharply - typically by 50%. Thus bold driver will keep growing
slowly until it finds itself taking a step that has clearly gone too far up onto the opposite slope
of the error function. Since this means that the network has arrived in a tricky area of the error
surface, it makes sense to reduce the step size quite drastically at this point.
Annealing
Unfortunately bold driver cannot be used in this form for online learning: the stochastic
fluctuations in E(t) would hopelessly confuse the algorithm. If we keep fixed, however,
these same fluctuations prevent the network from ever properly converging to the minimum -
instead we end up randomly dancing around it. In order to actually reach the minimum, and
stay there, we must anneal (gradually lower) the global learning rate. A simple, non-adaptive
annealing schedule for this purpose is the search-then-convergeschedule
(t) = (0)/(1 + t/T) (2)
Its name derives from the fact that it keeps nearly constant for the first T training patterns,
allowing the network to find the general location of the minimum, before annealing it at a
(very slow) pace that is known from theory to guarantee convergence to the minimum. The
characteristic time T of this schedule is a new free parameter that must be determined by trial
and error.
Local Rate Adaptation
If we are willing to be a little more sophisticated, we go a lot further than the above global
methods. First let us define an online weight update that uses a local, time-varying learning
rate for each weight:
(3)
[130]
The idea is to adapt these local learning rates by gradient descent, while simultaneously
adapting the weights. At time t, we would like to change the learning rate (before changing
the weight) such that the loss E(t+1) at the next time step is reduced. The gradient we need is
(4)
Ordinary gradient descent in
ij
, using the meta-learning rate q (a new global parameter),
would give
(5)
We can already see that this would work in a similar fashion to momentum: increase the
learning rate as long as the gradient keeps pointing in the same direction, but decrease it when
you land on the opposite slope of the loss function.
Problem:
ij
might become negative! Also, the step size should be proportional to
ij
so that it
can be adapted over several orders of magnitude. This can be achieved by performing the
gradient descent on log(
ij
) instead:
(6)
Exponentiating this gives
(7)
where the approximation serves to avoid an expensive exp function call. The multiplier is
limited below by 0.5 to guard against very small (or even negative) factors.
Problem: the gradient is noisy; the product of two of them will be even noisier - the learning
rate will bounce around a lot. A popular way to reduce the stochasticity is to replace the
gradient at the previous time step (t-1) by an exponential average of past gradients. The
exponential average of a time series u(t) is defined as
(8)
where 0 < m < 1 is a new global parameter.
[131]
Problem: if the gradient is ill-conditioned, the product of two gradients will be even worse -
the condition number is squared. We will need to normalize the step sizes in some way. A
radical solution is to throw away the magnitude of the step, and just keep the sign, giving
(9)
where r = e
q
. This works fine for batch learning, but...
(Fig. 3)
Problem: Nonlinear normalizers such as the sign function lead to systematic errors in
stochastic gradient descent (Fig. 3): a skewed but zero-mean gradient distribution (typical for
stochastic equilibrium) is mapped to a normalized distribution with non-zero mean. To avoid
the problems this is casuing, we need a linear normalizer for online learning. A good method
is to divide the step by , an exponential average of the squared gradient. This gives
(10)
Problem: successive training patterns may be correlated, causing the product of stochastic
gradients to behave strangely. The exponential averaging does help to get rid of short-term
correlations, but it cannot deal with input that exhibits correlations across long periods of
time. If you are iterating over a fixed training set, make sure you permute (shuffle) it before
[132]
each iteration to destroy any correlations. This may not be possible in a true online learning
situation, where training data is received one pattern at a time.
To show that all these equations actually do something useful, here is a typical set of online
learning curves (in postscript) for a difficult benchmark problem, given
either uncorrelated training patterns, or patterns with strong short-term or long-
term correlations. In these figures "momentum" corresponds to using equation (1) above, and
"s-ALAP" to equation (10). "ALAP" is like "s-ALAP" but without the exponential averaging
of past gradients, while "ELK1" and "SMD" are more advanced methods (developed by one
of us).
[133]
7.2 Classification
Discriminants
Neural networks can also be used to classify data. Unlike regression problems, where the goal
is to produce a particular output value for a given input, classification problems require us to
label each data point as belonging to one of n classes. Neural networks can do this by learning
a discriminant function which separates the classes. For example, a network with a single
linear output can solve a two-class problem by learning a discriminant function which is
greater than zero for one class, and less than zero for the other. Fig. 6 shows two such two-
class problems, with filled dots belonging to one class, and unfilled dots to the other. In each
case, a line is drawn where a discriminant function that separates the two classes is zero.
(Fig. 6)
On the left side, a straight line can serve as a discriminant: we can place the line such that all
filled dots lie on one side, and all unfilled ones lie on the other. The classes are said to
be linearly separable. Such problems can be learned by neural networks without any hidden
units. On the right side, a highly non-linear function is required to ensure class separation.
This problem can be solved only by a neural network with hidden units.
Binomial
To use a neural network for classification, we need to construct an equivalent function
approximation problem by assigning a target value for each class. For a binomial (two-class)
problem we can use a network with a single output y, and binary target values: 1 for one class,
and 0 for the other. We can thus interpret the network's output as an estimate of the
[134]
probability that a given pattern belongs to the '1' class. To classify a new pattern after training,
we then employ the maximum likelihood discriminant, y > 0.5.
A network with linear output used in this fashion, however, will expend a lot of its effort on
getting the target values exactly right for its training points - when all we actually care about is
the correct positioning of the discriminant. The solution is to use an activation function at the
output that saturates at the two target values: such a function will be close to the target value
for any net input that is sufficiently large and has the correct sign. Specifically, we use
the logistic sigmoid function
Given the probabilistic interpretation, a network output of, say, 0.01 for a pattern that is
actually in the '1' class is a much more serious error than, say, 0.1. Unfortunately the sum-
squared loss function makes almost no distinction between these two cases. A loss function
that is appropriate for dealing with probabilities is the cross-entropy error. For the two-class
case, it is given by
When logistic output units and cross-entropy error are used together in backpropagation
learning, the error signal for the output unit becomes just the difference between target and
output:
In other words, implementing cross-entropy error for this case amounts to nothing more than
omitting the f'(net) factor that the error signal would otherwise get multiplied by. This is not
an accident, but indicative of a deeper mathematical connection: cross-entropy error and
logistic outputs are the "correct" combination to use for binomial probabilities, just like linear
outputs and sum-squared error are for scalar values.
Multinomial
If we have multiple independent binary attributes by which to classify the data, we can use a
network with multiple logistic outputs and cross-entropy error. For multinomial classification
problems (1-of-n, where n > 2) we use a network with n outputs, one corresponding to each
class, and target values of 1 for the correct class, and 0 otherwise. Since these targets are not
[135]
independent of each other, however, it is no longer appropriate to use logistic output units.
The corect generalization of the logistic sigmoid to the multinomial case is
the softmax activation function:
where o ranges over the n output units. The cross-entropy error for such an output layer is
given by
Since all the nodes in a softmax output layer interact (the value of each node depends on the
values of all the others), the derivative of the cross-entropy error is difficult to calculate.
Fortunately, it again simplifies to
so we don't have to worry about it.
[136]
7.3 Non-Supervised Learning
It is possible to use neural networks to learn about data that contains neither target outputs nor
class labels. There are many tricks for getting error signals in such non-supervised settings;
here we'll briefly discuss a few of the most common approaches: autoassociation, time series
prediction, and reinforcement learning.
Autoassociation
Autoassociation is based on a simple idea: if you have inputs but no targets, just use the inputs
as targets. An autoassociator network thus tries to learn the identity function. This is only non-
trivial if the hidden layer forms an information bottleneck - contains less units than the input
(output) layer, so that the network must perform dimensionality reduction (a form of data
compression).
A linear autoassociator trained with sum-squared error in effect performs principal
component analysis(PCA), a well-known statistical technique. PCA extracts the subspace
(directions) of highest variance from the data. As was the case with regression, the linear
neural network offers no direct advantage over known statistical methods, but it does suggest
an interesting nonlinear generalization:
[137]
This nonlinear autoassociator includes a hidden layer in both the encoder and the decoder
part of the network. Together with the linear bottleneck layer, this gives a network with at
least 3 hidden layers. Such a deep network should be preconditioned if it is to learn
successfully.
Time Series Prediction
When the input data x forms a temporal series, an important task is to predict the next point:
the weather tomorrow, the stock market 5 minutes from now, and so on. We can (attempt to)
do this with a feedforward network by using time-delay embedding: at time t, we give the
network x(t), x(t-1), ... x(t-d) as input, and try to predict x(t+1) at the output. After
propagating activity forward to make the prediction, we wait for the actual value of x(t+1) to
come in before calculating and backpropagating the error. Like all neural network architecture
parameters, the dimension d of the embedding is an important but difficult choice.
A more powerful (but also more complicated) way to model a time series is to
use recurrent neural networks.
Reinforcement Learning
Sometimes we are faced with the problem of delayed reward: rather than being told the
correct answer for each input pattern immediately, we may only occasionally get a positive or
negative reinforcement signal to tell us whether the entire sequence of actions leading up to
this was good or bad. Reinforcement learning provides ways to get a continuous error signal
in such situations.
Q-learning associates an expected utility (the Q-value) with each action possible in a
particular state. If at time t we are in state s(t) and decide to perform action a(t), the
corresponding Q-value is updated as follows:
where r(t) is the instantaneous reward resulting from our action, s(t+1) is the state that it led
to, a are all possible actions in that state, and gamma <= 1 is a discount factor that leads us to
prefer instantaneous over delayed rewards.
[138]
A common way to implement Q-learning for small problems is to maintain a table of Q-values
for all possible state/action pairs. For large problems, however, it is often impossible to keep
such a large table in memory, let alone learn its entries in reasonable time. In such cases a
neural network can provide a compact approximation of the Q-value function. Such a network
takes the state s(t) as its input, and has an output y
a
for each possible action. To learn the Q-
value Q(s(t), a(t)), it uses the right-hand side of the above Q-iteration as a target:
Note that since we require the network's outputs at time t+1 in order to calculate its error
signal at time t, we must keep a one-step memory of all input and hidden node activity, as
well as the most recent action. The error signal is applied only to the output corresponding to
that action; all other output nodes receive no error (they are "don't cares").
TD-learning is a variation that assigns utility values to states alone rather than state/action
pairs. This means that search must be used to determine the value of the best successor state.
TD( ) replaces the one-step memory with an exponential average of the network's gradient;
this is similar to momentum, and can help speed the transport of delayed reward signals across
large temporal distances.
One of the most successful applications of neural networks is TD-Gammon, a network that
used TD( ) to learn the game of backgammon from scratch, by playing only against itself.
TD-Gammon is now the world's strongest backgammon program, and plays at the level of
human grandmasters.
[139]
7.4 Learning Time Sequences
There are many tasks that require learning a temporal sequence of events. These problems can
be broken into 3 distinct types of tasks:
- Sequence Recognition: Produce a particular output pattern when a specific input
sequence is seen. Applications: speech recognition
- Sequence Reproduction: Generate the rest of a sequence when the network sees only
part of the sequence. Applications: Time series prediction (stock market, sun spots, etc)
- Temporal Association: Produce a particular output sequence in response to a specific
input sequence. Applications: speech generation
Some of the methods that are used include
- Tapped Delay Lines (time delay networks)
- Context Units (e.g. Elman Nets, Jordan Nets)
- Back propagation through time (BPTT)
- Real Time Recurrent Learning (RTRL)
Tapped Delay Lines / Time Delay Neural Networks
One of the simplest ways of performing sequence recognition because conventional
backpropagation algorithms can be used.
Downsides: Memory is limited by length of tapped delay line. If a large number of input units
are needed then computation can be slow and many examples are needed.
[140]
A simple extension to this is to allow non-uniform sampling:
where e
i
is the integer delay assoicated with component i. Thus if there are n input units, the
memory is not limited simply the previous n timesteps.
Another extension that deals is for each "input" to really be a convolution of the original input
sequence.
In the case of the delay line memories:
Other variations for c are shown graphically below:
[141]
This figure is taken from "Neural Net Architectures for Temporal Sequence Processing", by
Mike Moser.
[142]
7.5 Recurrent Networks I
Consider the following two networks:
(Fig. 1)
The network on the left is a simple feed forward network of the kind we have already met.
The right hand network has an additional connection from the hidden unit to itself. What
difference could this seemingly small change to the network make?
Each time a pattern is presented, the unit computes its activation just as in a feed forward
network. However its net input now contains a term which reflects the state of the network
(the hidden unit activation) before the pattern was seen. When we present subsequent patterns,
the hidden and output units' states will be a function of everything the network has seen so far.
The network behavior is based on its history, and so we must think of pattern presentation as
it happens in time.
Network topology
Once we allow feedback connections, our network topology becomes very free: we can
connect any unit to any other, even to itself. Two of our basic requirements for computing
activations and errors in the network are now violated. When computing activations, we
required that before computing y
i
, we had to know the activations of all units in the posterior
set of nodes, P
i
. For computing errors, we required that before computing , we had to know
the errors of all units in its anterior set of nodes, A
i
.
[143]
For an arbitrary unit in a recurrent network, we now define its activation at time t as:
y
i
(t) = f
i
(net
i
(t-1))
At each time step, therefore, activation propagates forward through one layer of connections
only. Once some level of activation is present in the network, it will continue to flow around
the units, even in the absence of any new input whatsoever. We can now present the network
with a time series of inputs, and require that it produce an output based on this series. These
networks can be used to model many new kinds of problems, however, these nets also present
us with many new difficult issues in training.
Before we address the new issues in training and operation of recurrent neural networks, let us
first look at some sample tasks which have been attempted (or solved) by such networks.
- Learning formal grammars
Given a set of strings S, each composed of a series of symbols, identify the strings
which belong to a language L. A simple example: L = {a
n
,b
n
} is the language
composed of strings of any number of a's, followed by the same number of b's. Strings
belonging to the language include aaabbb, ab, aaaaaabbbbbb. Strings not belonging to
the language include aabbb, abb, etc. A common benchmark is the language defined by
the reber grammar. Strings which belong to a language L are said to
be grammatical and are ungrammatical otherwise.
- Speech recognition
In some of the best speech recognition systems built so far, speech is first presented as
a series of spectral slices to a recurrent network. Each output of the network represents
the probability of a specific phone (speech sound, e.g. /i/, /p/, etc), given both present
and recent input. The probabilities are then interpreted by a Hidden Markov Model
which tries to recognize the whole utterance. Details are providedhere.
- Music composition
A recurrent network can be trained by presenting it with the notes of a musical score.
It's task is to predict the next note. Obviously this is impossible to do perfectly, but the
network learns that some notes are more likely to occur in one context than another.
Training, for example, on a lot of music by J. S. Bach, we can then seed the network
with a musical phrase, let it predict the next note, feed this back in as input, and repeat,
generating new music. Music generated in this fashion typically sounds fairly
[144]
convincing at a very local scale, i.e. within a short phrase. At a larger scale, however,
the compositions wander randomly from key to key, and no global coherence arises.
This is an interesting area for further work.... The original work is described here.
The Simple Recurrent Network
One way to meet these requirements is illustrated below in a network known variously as
an Elman network (after Jeff Elman, the originator), or as a Simple Recurrent Network. At
each time step, a copy of the hidden layer units is made to a copy layer. Processing is done as
follows:
1. Copy inputs for time t to the input units
2. Compute hidden unit activations using net input from input units and from copy layer
3. Compute output unit activations as usual
4. Copy new hidden unit activations to copy layer
In computing the activation, we have eliminated cycles, and so our requirement that the
activations of all posterior nodes be known is met. Likewise, in computing errors, all trainable
weights are feed forward only, so we can apply the standard backpropagation algorithm as
before. The weights from the copy layer to the hidden layer play a special role in error
computation. The error signal they receive comes from the hidden units, and so depends on
the error at the hidden units at time t. The activations in the hidden units, however, are just the
activation of the hidden units at time t-1. Thus, in training, we are considering a gradient of an
[145]
error function which is determined by the activations at the present and the previous time
steps.
A generalization of this approach is to copy the input and hidden unit activations for a number
of previous timesteps. The more context (copy layers) we maintain, the more history we are
explicitly including in our gradient computation. This approach has become known as Back
Propagation Through Time. It can be seen as an approximation to the ideal of computing a
gradient which takes into consideration not just the most recent inputs, but all inputs seen so
far by the network. The figure below illustrates one version of the process:
The inputs and hidden unit activations at the last three time steps are stored. The solid arrows
show how each set of activations is determined from the input and hidden unit activations on
the previous time step. A backward pass, illustrated by the dashed arrows, is performed to
determine separate values of delta (the error of a unit with respect to its net input) for each
unit and each time step separately. Because each earlier layer is a copy of the layer one level
up, we introduce the new constraint that the weights at each level be identical. Then the partial
derivative of the negative error with respect to w
i,j
is simply the sum of the partials calculated
for the copy of w
i,j
between each two layers.
Elman networks and their generalization, Back Propagation Through Time, both seek to
approximate the computation of a gradient based on all past inputs, while retaining the
standard back prop algorithm. BPTT has been used in a number of applications (e.g. ecg
[146]
modeling). The main task is to to produce a particular output sequences in response to specific
input sequences. The downside of BPTT is that it requires a large amount of storage,
computation, and training examples in order to work well. In the next section we will see how
we can compute the true temporal gradient using a method known as Real Time Recurrent
Learning.
[147]
7.6 Real Time Recurrent Learning
In deriving a gradient-based update rule for recurrent networks, we now make network
connectivity very very unconstrained. We simply suppose that we have a set of input units, I =
{x
k
(t), 0<k<m}, and a set of other units, U = {y
k
(t), 0<k<n}, which can be hidden or output
units. To index an arbitrary unit in the network we can use
(1)
Let W be the weight matrix with n rows and n+m columns, where w
i,j
is the weight to
unit i (which is inU) from unit j (which is in I or U). Units compute their activations in the
now familiar way, by first computing the weighted sum of their inputs:
(2)
where the only new element in the formula is the introduction of the temporal index t. Units
then compute some non-linear function of their net input
y
k
(t+1) = f
k
(net
k
(t)) (3)
Usually, both hidden and output units will have non-linear activation functions. Note that
external input at time t does not influence the output of any unit until time t+1. The network is
thus a discrete dynamical system.
Some of the units in U are output units, for which a target is defined. A target may not be
defined for every single input however. For example, if we are presenting a string to the
network to be classified as either grammatical or ungrammatical, we may provide a target
only for the last symbol in the string. In defining an error over the outputs, therefore, we need
to make the error time dependent too, so that it can be undefined (or 0) for an output unit for
which no target exists at present. Let T(t) be the set of indices kin U for which there exists a
target value d
k
(t) at time t. We are forced to use the notation d
k
instead of there, as t now refers
to time. Let the error at the output units be
(4)
[148]
and define our error function for a single time step as
(5)
The error function we wish to minimize is the sum of this error over all past steps of the
network
(6)
Now, because the total error is the sum of all previous errors and the error at this time step, so
also, the gradient of the total error is the sum of the gradient for this time step and the gradient
for previous steps
(7)
As a time series is presented to the network, we can accumulate the values of the gradient, or
equivalently, of the weight changes. We thus keep track of the value
(8)
After the network has been presented with the whole series, we alter each weight w
ij
by
(9)
We therefore need an algorithm that computes
(10)
[149]
at each time step t. Since we know e
k
(t) at all times (the difference between our targets and
outputs), we only need to find a way to compute the second factor .
IMPORTANT
The key to understanding RTRL is to appreciate what this factor expresses. It is essentially a measure of
the sensitivity of the value of the output of unit k at time t to a small change in the value of w
ij
, taking
into account the effect of such a change in the weight over the entire network trajectory from t
0
to t.
Note that w
ij
does not have to be connected to unit k. Thus this algorithm is non-local, in that we need
to consider the effect of a change at one place in the network on the values computed at an entirely
different place. Make sure you understand this before you dive into the derivation given next
Derivation of
This is given here for completeness, for those who wish perhaps to implement RTRL. Make
sure you at least know what role the factor plays in computing the gradient.
From Equations 2 and 3, we get
(11)
where is the Kronecker delta
(12)
[Exercise: Derive Equation 11 from Equations 2 and 3]
Because input signals do not depend on the weights in the network,
(13)
Equation 11 becomes:
[150]
(14)
This is a recursive equation. That is, if we know the value of the left hand side for time 0, we
can compute the value for time 1, and use that value to compute the value at time 2, etc.
Because we assume that our starting state (t = 0) is independent of the weights, we have
(15)
These equations hold for all .
We therefore need to define the values
(16)
for every time step t and all appropriate i, j and k. We start with the initial condition
p
ij
k
(t
0
) = 0 (17)
and compute at each time step
(18)
The algorithm then consists of computing, at each time step t, the quantities p
ij
k
(t) using
equations 16 and 17, and then using the differences between targets and actual outputs to
compute weight changes
(19)
and the overall correction to be applied to w
ij
is given by
[151]
(20)
7.7 Dynamics and RNNs
Consider the recurrent network illustrated below. A single input unit is connected to each of
the three "hidden" units. Each hidden unit in turn is connected to itself and the other hidden
units. As in the RTRL derivation, we do not distinguish now between hidden and output units.
Any activation which enters the network through the input node can flow around from one
unit to the other, potentially forever. Weights less than 1.0 will exponentially reduce the
activation, weights larger than 1.0 will cause it to increase. The non-linear activation functions
of the hidden units will hopefully prevent it from growing without bound.
As we have three hidden units, their activation at any given time t describes a point in a 3-
dimensional state space. We can visualize the temporal evolution of the network state by
watching the state evolve over time.
In the absence of input, or in the presence of a steady-state input, a network will usually
approach a fixed point attractor. Other behaviors are possible, however. Networks can be
[152]
trained to oscillate in regular fashion, and chaotic behavior has also been observed. The
development of architectures and algorithms to generate specific forms of dynamic behavior
is still an active research area.
Some limitations of gradient methods and RNNs
The simple recurrent network computed a gradient based on the present state of the network
and its state one time step ago. Using Back Prop Through Time, we could compute a gradient
based on some finite ntime steps of network operation. RTRL provided a way of computing
the true gradient based on the complete network history from time 0 to the present. Is this
perfection?
Unfortunately not. With feedforward networks which have a large number of layers, the
weights which are closest to the output are the easiest to train. This is no surprise, as their
contribution to the network error is direct and easily measurable. Every time we back
[153]
propagate an error one layer further back, however, our estimate of the contribution of a
particular weight to the observed error becomes more indirect. You can think of error flowing
in the top of the network in distinct streams. Each pack propagation dilutes the error, mixing
up error from distinct sources, until, far back in the network, it becomes virtually impossible
to tell who is responsible for what. The error signal has become completely diluted.
With RTRL and BPTT we face a similar problem. Error is now propagated back in time, but
each time step is exactly equivalent to propagating through an additional layer of a feed
forward network. The result, of course, is that it becomes very difficult to assess the
importance of the network state at times which lie far back in the past. Typically, gradient
based networks cannot reliably use information which lies more than about 10 time steps in
the past. If you now imagine an attempt to use a recurrent neural network in a real life
situation, e.g. monitoring an industrial process, where data are presented as a time series at
some realistic sampling rate (say 100 Hz), it becomes clear that these networks are of limited
use. The next section shows a recent model which tries to address this problem.
[154]
7.8 Long Short-Term Memory
In a recurrent network, information is stored in two distinct ways. The activations of the units
are a function of the recent history of the model, and so form a short-term memory. The
weights too form a memory, as they are modified based on experience, but the timescale of
the weight change is much slower than that of the activations. We call those a long-term
memory. The Long Short-Term Memory model [1]is an attempt to allow the unit activations
to retain important information over a much longer period of time than the 10 to 12 time steps
which is the limit of RTRL or BPTT models.
The figure below shows a maximally simple LSTM network, with a single input, a single
output, and a single memory block in place of the familiar hidden unit.
This figure below shows a maximally
simple LSTM network, with a single
input, a single output, and a single
memory block in place of the familiar
hidden unit. Each block has two
associated gate units (details below).
Each layer may, of course, have
multiple units or blocks. In a typical
configuration, the first layer of weights
is provided from input to the blocks and
gates. There are then recurrent
connections from one block to other
blocks and gates. Finally there are
weights from the blocks to the outputs.
The next figure shows the details of the
memory block in more detail.
[155]
The hidden units of a conventional
recurrent neural network have now been
replaced by memory blocks, each of
which contains one or more
memory cells. At the heart of the cell is
a simple linear unit with a single self-
recurrent connection with weight set to
1.0. In the absence of any other input,
this connection serves to preserve the
cell's current state from one moment to
the next. In addition to the self-recurrent
connection, cells receive input from
input units and other cell and gates.
While the cells are responsible for
maintaining information over long
periods of time, the responsibility for
deciding what information to store, and
when to apply that information lies with
an input and output gating unit,
respectively.
The input to the cell is passed through a non-linear squashing function (g(x), typically the
logistic function, scaled to lie within [-2,2]), and the result is then multiplied by the output of
the input gating unit. The activation of the gate ranges over [0,1], so if its activation is near
zero, nothing can enter the cell. Only if the input gate is sufficiently active is the signal
allowed in. Similarly, nothing emerges from the cell unless the output gate is active. As the
internal cell state is maintained in a linear unit, its activation range is unbounded, and so the
cell output is again squashed when it is released (h(x), typical range [-1,1]). The gates
themselves are nothing more than conventional units with sigmoidal activation functions
ranging over [0,1], and they each receive input from the network input units and from other
cells.
Thus we have:
- Cell output: y
c
j
(t) is
y
c
j
(t) = y
out
j
(t) h(s
cj
(t))
- where y
out
j
(t) is the activation of the output gate, and the state, s
cj
(t) is given by
s
cj
(0) = 0, and
s
cj
(t) = s
cj
(t-1) + y
in
j
(t) g(net
cj
(t)) for t > 0.
[156]
This division of responsibility---the input gates decide what to store, the cell stores
information, and the output gate decides when that information is to be applied---has the
effect that salient events can be remembered over arbitrarily long periods of time. Equipped
with several such memory blocks, the network can effectively attend to events at multiple
time scales.
Network training uses a combination of RTRL and BPTT, and we won't go into the details
here. However, consider an error signal being passed back from the output unit. If it is
allowed into the cell (as determined by the activation of the output gate), it is now trapped,
and it gets passed back through the self-recurrent connection indefinitely. It can only affect
the incoming weights, however, if it is allowed to pass by the input gate.
On selected problems, an LSTM network can retain information over arbitrarily long periods
of time; over 1000 time steps in some cases. This gives it a significant advantage over RTRL
and BPTT networks on many problems. For example, a Simple Recurrent Network can learn
the Reber Grammar, but not theEmbedded Reber Grammar. An RTRL network can
sometimes, but not always, learn the Embedded Reber Grammar after about 100 000 training
sequences. LSTM always solves the Embedded problem, usually after about 10 000 sequence
presentations.
One of us is currently training LSTM networks to distinguish between different spoken
languages based on speech prosody (roughly: the melody and rhythm of speech).
References
Hochreiter, Sepp and Schmidhuber, Juergen, (1997) "Long Short-Term Memory", Neural
Computation, Vol 9 (8), pp: 1735-1780
[157]
Appendix
Summary of Linear Nets
Characteristics of Networks
- number of layers
- number of nodes per layer
- activation function (linear, binary, softwmax)
- error function (mean squared error (MSE), cross entropy)
type of learning algorithms (gradient descent, perceptron, delta rule)
Types of Applications and Associated Nets
- Regression:
o uses a one-layer linear network (activation function is identity)
o uses MSE cost function
o uses gradient decent learning
- Classification - Perceptron Learning
o uses a one-layer network with a binary step activation function
o uses MSE cost function
o uses the perceptron learning algorithm (identical with gradient descent when
targets are +1 and -1)
- Classification - Delta Rule
o uses a one-layer network with a linear activation function
o uses MSE cost function
o uses gradient descent
o the network chooses the class by picking the output node with the largest output
- Classification - Gradient Descent (the right way)
o uses a one-layer network with a softmax activation function
o uses the cross entropy error function
o outputs are interpreted as probabilities
o the network chooses the class with the highest probability
Modes of Learning for Gradient Descent
- Batch
o At each iteration, the gradient is computed by averaging over all inputs
- Online (stochastic)
o At each iteration, the gradient is estimated by picking one (or a small number) of
inputs.
[158]
o Because the gradient is only being esitimated, there is a lot of noise in the weight
updates. The error comes down quicly but then tends to jiggle around. To
remove this noise one can switch to batch at the point where the error levels out
and or to continue to use online but to decrease the learning rate (called
annealing the learning rate). One way annealing is to use =
0
/t where
0
us the
originial learning rate and t is the number of timesteps after annealing is turned
on.
Picking Learning Rates
- Learning rates that are too big cause the algorithm to diverge
- Learning rates that are too small cause the algorithm to converge very slowly.
- The optimal learning rate for linear networks is /(H
-1
) where H is the Hessian and is
defined as the second derivative of the cost function with respect to the weights.
Unfortunately, this is a matrix whose inverse can be costly to compute.
- The best learning rate for batch is the inverse Hessian.
- More details if you are interested:
o The next best thing is to use a separate learning rate for each weight. If the
Hessian is diagonal these learning rates are just one over the eigenvalues of the
Hessian. Fat chance that the hessian is diagonal though!
o If using a single scalar learning then the best one to use is 1 over the largest
eigenvalue of the Hessian. There are fairly inexpensive algorithms for estimating
this. However, many people just use the ol' brute force method of picking the
learning rate - trial and error.
o For linear networks the Hessian is < x x
T
> and is independent of the weights. For
nonlinear networks (i.e. any network that has an activation function that isn't the
identity), the Hessian depends on the value of the weights and so changes
everytime the weights are updated - arrgh! That is why people love the trial and
error approach.
Limitations of Linear Networks
- For regression, we can only fit a straight line through the data points. Many problems
are not linear.
- For classification, we can only lay down linear boundaries between classes. This is
often inadequate for most real world problems.
[159]
Summary of Nonlinear Networks and Applications
Backpropagation
- Implementing backprop
- characteristics of cost surfaces
Activation Functions
- linear
- threshold: binary, bipolar
- sigmoid: bipolar (symmetric), sigmoid
- softmax
Cost Functions
- Mean Squared Error (MSE)
Cross Entropy
Improving Generalization
- using noise to improve learning, annealing
- what does it mean to overtrain?
- early stopping
- weight decay
- pruning (e.g. optimal brain damage)
Speed-up Techniques
- momentum
- delta-bar-delta
Unsupervised Learning
- Dimension Reduction for Compression using Autoassociative Networks
o Principal Component Analysis (PCA) using 3 layer nets
o Nonlinear PCA using 5-layer nets
- Clustering for Compression
- Kohonen's Self-Organizing Maps (SOMs)
Misc Terminology
- correlation matrix vs Hessian
[160]
- linear separability
- bias
- decision boundary
- clustering
- dimension reduction
- overtraining
Experimental Design
- What techniques would you use to understand the data? (graphing data, examining
correlation matrix, dimension reduction,...)
- What type of architecture would you use? (number of layers, number of nodes,
activation functions) Why?
- What learning algorithm would you use (speed-up technique)? Why?
- What do you do to insure the net is trained adequately? (but not overtrained)