Académique Documents
Professionnel Documents
Culture Documents
There are many different methods around for doing multivariate statistical analysis,
function fitting or prediction tasks and ANN represents only a small subset of these.
From a statistical modeling point of view, ANN models belong to the general class of
non-parametric methods that do not make any assumption about the parametric
form of the function they model. In this sense they are more powerful than
parametric methods that try to fit reality into a specific parametric form. However,
non-parametric methods like ANN contain more free parameters and hence require
more training data than parametric ones in order to achieve good generalization
performance (Geman et al., 1992)1. Due to their generality, ANN methods also have
some drawbacks, the most prominent one being long training times.
Working with Matlab NN toolbox
Visit (http://www.mathworks.com/products/neuralnet/index.html)
The overall objective of training a MLP network in prediction task is that the network so
designed will generalize. That means the input/output mapping computed by the network
is correct (or nearly zero) for test data.
The critical issue in developing a neural network is generalization: how well will the
network make predictions for cases that are not in the training set? NNs can suffer from
either underfitting or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting. A network that is
too complex may fit the noise, not just the signal, leading to overfitting.
A model designed to generalize well will produce a corresponding input output mapping
even when the input is slightly different from the examples used to train the network.
When, however, a neural network learns too many input output examples, the network
may end up memorizing the training data. It may do so by finding a feature (e.g. noise)
1
Geman, S., Bienenstock, E., and Doursat, R., 1992. ``Neural Networks and the Bias/Variance Dilemma'',
Neural Comput. 4, 1.
30
that is present in the training data but not true of the underlying function that is to be
model. Such a phenomenon is refereed to as overfitting.
Overfitting is especially dangerous because it can easily lead to predictions that are far
beyond the range of the training data with many of the common types of NNs. Overfitting
can also produce wild predictions in multilayer perceptrons even with noise-free data.
The best way to avoid overfitting is to use lots of training data. Given a fixed amount of
training data, there are various approaches to avoiding underfitting and overfitting, and
hence getting good generalization: Model selection, Jittering, Early stopping, Network
pruning etc.
Model Selection
The complexity of a network is related to both the number of weights and the size of the
weights. Model selection is concerned with the number of weights, and hence the number
of hidden units and layers. The more weights there are, relative to the number of training
cases, the more overfitting amplifies noise in the targets (Moody 1992) 2. The other
approaches are concerned, directly or indirectly, with the size of the weights.
A standard tool of network (model) selection within a set of candidate model structures
(parameterizations) is cross-validation. In cross-validation approach of model selection,
firstly the available data set is randomly partitioned into a training set and a test set. The
training set is further partitioned into two disjoint subsets:
Estimation subset, used to select the model.
Validation subset, used to test or validate the model.
The motivation here is to validate the model on a data set different from the one used for
parameter estimation. In this way we may use the training set to assess the performance
of various candidate models and thereby choose the best one. There is however a
distinct possibility that the model with best-performing parameter values so selected may
end up overfitting the validation subset. To guard against this possibility, the
generalization performance of the selected model is measured on the test set which is
different from the validation subset. The objective here will be to select the one that
minimizes the generalization error. The use of validation data set is to determine the
termination point for the training. This validation set is actually the data set not used
directly in the training, i.e. not presented to the network, but used indirectly to monitor
the performance on unknown data. A deteriorating performance on the validation set
signals that the ANN is overlearning the training data and that training should be stopped.
When the training is stopped, a test set can be used to estimate the generalization
performance
Early stopping method of training
2
31
Here, the training is performed by periodically stopping the training session. After a
session of training, the performance is measured by validation data set. Then again the
training is resumed as needed. In cases where data is scarce and the use of a validation set
is too costly, one can instead use a threshold value on the training error.
The figure (Fig.1) shows the conceptual forms of two learning curves. The estimation
learning (training) curve decreases monotonically for an increasing number of epochs in
the usual manner. In contrast, the validation learning curve decreases monotonically to a
minimum, it then starts to increase as the training continues. In reality, what the network
is learning beyond the minimum point is essentially noise contained in the training data.
This heuristic suggests that the minimum point on the validation learning curve be used
as a sensible criterion for stopping the training session (which is early stopping method of
training).
Fig.1 Illustration of the earlystopping rule based on cross validation (Haykin, 1999)3
Jittering
Jittering is a technique of adding noise term into the input data. Adding noise or jitter to
the inputs during training is also found empirically to improve network generalization.
This is because the noise will smear out each data point and make it difficult for the
network to fit the individual data points precisely, and consequently reduce over-fitting.
[Note: Detail is not covered].
Network Pruning
As the complexity of the system increases, governing the system by ANN requires the
use of highly structured networks of a rather large size. A practical issue may arise on
3
Haykin, S., 1999. Neural Networks A comprehensive Foundation, Second Edition. Pearson Education.
32
such situation is that of minimizing the size of the network while maintaining good
performance. A neural network with minimum size is less likely to learn the idiosyncrasy
or noise in the training data, and may thus generalize better to new data. A way of
achieving such design objective is Network Pruning.
In case of Network Pruning, a multilayer perceptron with an adequate performance for
the problem at hand is considered initially and the network is pruned by weakening or
eliminating certain synaptic weights in a selective and orderly fashion.
Complexity-Regularization Network Pruning Approach
In Complexity Regularization approach, total risk of unfit of the network model of a
complex system is expressed as:
R( w) s ( w) c ( w)
(1)
The term s(w) is the standard performance measure which depends on network model
and the input data i.e. Least mean square (LMS) in case of back propagation learning.
The second term c(w) is complexity penalty, which depends on the network (model)
alone. is a regularization parameter, which represents the relative importance of the
complexity penalty term with respect to the performance measure term.
When = 0, the network is completely determined by training examples (input).
When or very large, complexity penalty is itself sufficient to specify the network
or the training examples are unreliable.
In a general setting, one choice of complexity penalty term c(w) is the kth order
smoothing integral
c ( w, k )
1
k
F ( x, w) ( x )dx
2
x k
(2)
Where F(x,w) is the input-output mapping performed by the model, and (x) is some
weighting function that determines the region of the input space over which the function
F(x,w) is required to smooth. Higher the value of k, smoother (less complex) the function
F(x,w) will be.
There are three different complexity regularization procedures:
a)
Weight Decay
In the weight-decay procedure, the complexity penalty is defined as the squared
norm of the weight vector, as given below:
c ( w) w
wi2
i
(3)
This procedure operates by forcing some of the synaptic weights in the network to
take values close to zero while permitting other weights to retain their relatively
large values. Consequently, the weights of the network are grouped roughly into
two categories: those that have a large influence on the network (model), and
33
those that have little or no influence on it. The weights on this latter category are
referred to as excess weights. In absence of Complexity Regularization, excess
weights result in poor generalization by virtue of their high likelyhood of taking
on completely arbitrary values causing the network to overfit the data in order to
produce a slight reduction in the training error. The use of complexity
regularization encourages the excess weights to assume values close to zero and
thereby improve generalization.
b)
Weight Elimination
In case of Weight Elimination procedure of complexity regularization, the
complexity penalty is defined as
( wi / wo ) 2
c ( w)
(4)
2
i 1 ( wi / wo )
Where wo is a preassigned parameter, and wi refers to the weight of some synapse
i in the network.
When |wi| >> wo,
When |wi| << wo
c(w) 1
c(w) 0
Approximate Smoother
34
c ( w) woj2 w j
j 1
(5)
Where, woj are the weights in the ouput layer, and wj is the weight vector for the jth
neuron in the hidden layer. The power p is defined by
(6)
35