Académique Documents
Professionnel Documents
Culture Documents
Lecturer: Primo Potonik University of Ljubljana Faculty of Mechanical Engineering Laboratory of Synergetics www.neural.si primoz.potocnik@fs.uni-lj.si +386-1-4771-167
#1
TABLE OF CONTENTS
0. 1. 2. 3. 4. 5. 6. 7. 8. Organization of the Study Introduction to Neural Networks Neuron Model Network Architectures Learning Perceptrons and linear filters Backpropagation Dynamic Networks Radial Basis Function Networks Self-Organizing Maps Practical Considerations
#2
Learning outcomes
Understand the concept of nonparametric modelling by NN Explain the most common NN architectures
Feedforward networks Dynamic networks Radial Basis Function Networks Self-organized networks
#4
2. Teaching methods
Teaching methods:
1. Lectures 4 hours weekly, clasical & practical (MATLAB) Tuesday 9:15 - 10:45 Friday 9:15 - 10:45 2. Homeworks home projects 3. Consultations with the lecturer
Location
Institute for Sustainable Innovative Technologies, (Pot za Brdom 104, Ljubljana)
#5
3. Assessment
ECTS credits:
EURHEO (II): 6 ECTS
Final mark:
Homework Written exam 50% final mark 50% final mark
Important dates
Homework presentations: Written exam: Tue, 8 Jan 2013 Fri, 11 Jan 2013 Fri, 18 Jan 2013
#6
#7
4. Backpropagation
4.1 4.2 4.3 4.4 4.5 Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons
#8
#9
#10
8. Practical considerations
8.1 8.2 8.3 8.4 8.5 8.6 8.7 Designing the training data Preparing data Selection of inputs Data encoding Principal component analysis Invariances and prior knowledge Generalization
#11
5. Books
1. 2. Neural Networks and Learning Machines, 3/E Simon Haykin (Pearson Education, 2009) Neural Networks: A Comprehensive Foundation, 2/E Simon Haykin (Pearson Education, 1999)
3. 4. 5. 6.
Neural Networks for Pattern Recognition Chris M. Bishop (Oxford University Press, 1995) Practical Neural Network Recipes in C++ Timothy Masters (Academic Press, 1993) Advanced Algorithms for Neural Networks Timothy Masters (John Wiley and Sons, 1995) Signal and Image Processing with Neural Networks Timothy Masters (John Wiley and Sons, 1994)
#12
6. SLO Books
1. Nevronske mree Andrej Dobnikar, (Didakta 1990) 2. Modeliranje dinaminih sistemov z umetnimi nevronskimi mreami in sorodnimi metodami Ju Kocijan, (Zaloba Univerze v Novi Gorici, 2007)
#13
7. E-Books (1/2)
List of links at www.neural.si An Introduction to Neural Networks Ben Krose & Patrick van der Smagt, 1996
Recommended as an easy introduction
Neural Networks - Methodology and Applications Gerard Dreyfus, 2005 Metaheuristic Procedures for Training Neural Networks Enrique Alba & Rafael Marti (Eds.), 2006 FPGA Implementations of Neural Networks Amos R. Omondi & Mmondi J.C. Rajapakse (Eds.), 2006 Trends in Neural Computation Ke Chen & Lipo Wang (Eds.), 2007
2012 Primo Potonik NEURAL NETWORKS (0) Organization of the Study #14
7. E-Books (2/2)
Neural Preprocessing and Control of Reactive Walking Machines Poramate Manoonpong, 2007 Artificial Neural Networks for the Modelling and Fault Diagnosis of Technical Processes Krzysztof Patan, 2008 Speech, Audio, Image and Biomedical Signal Processing using Neural Networks [only two chapters], Bhanu Prasad & S.R. Mahadeva Prasanna (Eds.), 2008
#15
8. Online resources
List of links at www.neural.si
Neural FAQ by Warren Sarle, 2002 How to measure importance of inputs by Warren Sarle, 2000 MATLAB Neural Networks Toolbox (User's Guide) latest version Artificial Neural Networks on Wikipedia.org Neural Networks online book by StatSoft Radial Basis Function Networks by Mark Orr Principal components analysis on Wikipedia.org libsvm Support Vector Machines library
#16
9. Simulations
Recommended computing platform
MATLAB R2010b (or later) & Neural Network Toolbox 7 http://www.mathworks.com/products/neuralnet/ Acceptable older MATLAB release: MATLAB 7.5 & Neural Network Toolbox 5.1 (Release 2007b)
Introduction to Matlab
Get familiar with MATLAB M-file programming Online documentation: Getting Started with MATLAB
#17
10. Homeworks
EURHEO students (II)
1. Practical oriented projects 2. Based on UC Irvine Machine Learning Repository data http://archive.ics.uci.edu/ml/ 3. Select data set and discuss with lecturer 4. Formulate problem 5. Develop your solution (concept & Matlab code) 6. Describe solution in a short report 7. Submit results (report & Matlab source code) 8. Present results and demonstrate solution
Presentation (~10 min) Demonstration (~20 min)
#18
Video links
Robots with Biological Brains: Issues and Consequences Kevin Warwick, University of Reading http://videolectures.net/icannga2011_warwick_rbbi/ Computational Neurogenetic Modelling: Methods, Systems, Applications Nikola Kasabov, University of Auckland http://videolectures.net/icannga2011_kasabov_cnm/
#19
#20
10
#21
Artificial neurons
Simple mathematical approximations of biological neurons
#22
11
#23
Zurada (1992)
Artificial neural systems, or neural networks, are physical cellular systems which can acquire, store, and utilize experiential knowledge.
Pinkus (1999)
The question 'What is a neural network?' is ill-posed.
#24
12
0.1 mm
This complex network forms the nervous system, which relays information through the body
#25
Biological neuron
#26
13
Interaction of neurons
Action potentials arriving at the synapses stimulate currents in its dendrites These currents depolarize the membrane at its axon, provoking an action potential Action potential propagates down the axon to its synaptic knobs, releasing neurotransmitter and stimulating the post-synaptic neuron (lower left)
#27
Synapses
Elementary structural and functional units that mediate the interaction between neurons Chemical synapse: pre-synaptic electric signal chemical neurotransmitter post-synaptic electrical signal
#28
14
Action potential
Spikes or action potential
Neurons encode their outputs as a series of voltage pulses Axon is very long, high resistance & high capacity Frequency modulation Improved signal/noise ratio
#29
Stimulus
Receptors
Effectors
Response
Receptors
collect information from environment (photons on retina, tactile info, ...)
Effectors
generate interactions with the environment (muscle activation, ...)
Flow of information
feedforward & feedback
#30
15
Human brain
Human activity is regulated by a nervous system: Central nervous system
Brain Spinal cord
1010 neurons in the brain 104 synapses per neuron 1 ms processing speed of a neuron Slow rate of operation Extrem number of processing units & interconnections Massive parallelism
#31
16
Network of neurons
#33
In practice
NN are especially useful for classification and function approximation problems which are tolerant of some imprecision Almost any finite-dimensional vector function on a compact set can be approximated to arbitrary precision by feedforward NN Need a lot of training data Difficulties to apply hard rules (such as used in an expert system)
#34
17
2. Adaptivity
Neural networks have natural capability to adapt to the changing environment Train neural network, then retrain Continuous adaptation in nonstationary environment
#35
4. Fault tolerance
Capable of robust computation Graceful degradation rather then catastrophic failure
#36
18
6. Neurobiological analogy
NN design is motivated by analogy with brain NN are research tool for neurobiologists Neurobiology inspires further development of artificial NN
#37
www.stanford.edu/group/brainsinsilicon/
#38
19
1943 1949
Hebb
Published his book The Organization of Behavior Introduced Hebbian learning rule
1958
1969
#39
1982 1982
Hopfield
Published a series of papers on Hopfield networks
Kohonen
Developed the Self-Organising Maps
1990s Radial Basis Function Networks were developed 2000s The power of Ensembles of Neural Networks and Support Vector Machines becomes apparent
#40
20
Current NN research
Topics for the 2013 International Joint Conference on NN
Neural network theory and models Computational neuroscience Cognitive models Brain-machine interfaces Embodied robotics Evolutionary neural systems Self-monitoring neural systems Learning neural networks Neurodynamics Neuroinformatics Neuroengineering Neural hardware Neural network applications Pattern recognition Machine vision Collective intelligence Hybrid systems Self-aware systems Data mining Sensor networks Agent-based systems Computational biology Bioinformatics Artificial life
#41
Automotive
Automobile automatic guidance systems, warranty activity analyzers
Banking
Check and other document readers, credit application evaluators
Defense
Weapon steering, target tracking, object discrimination, facial recognition, new kinds of sensors, sonar, radar and image signal processing including data compression, feature extraction and noise suppression, signal/image identification
Electronics
Code sequence prediction, integrated circuit chip layout, process control, chip failure analysis, machine vision, voice synthesis, nonlinear modeling
#42
21
Manufacturing
Manufacturing process control, product design and analysis, process and machine diagnosis, real-time particle identification, visual quality inspection systems, welding quality analysis, paper quality prediction, computer chip quality analysis, analysis of grinding operations, chemical product design analysis, machine maintenance analysis, project planning and management, dynamic modelling of chemical process systems
Medical
Breast cancer cell analysis, EEG and ECG analysis, prothesis design, optimization of transplant times, hospital expense reduction, hospital quality improvement, emergency room test advisement
#43
Speech
Speech recognition, speech compression, vowel classification, text to speech synthesis
Securities
Market analysis, automatic bond rating, stock trading advisory systems
Telecommunications
Image and data compression, automated information services, real-time translation of spoken language, customer payment processing systems
Transportation
Truck brake diagnosis systems, vehicle scheduling, routing systems
#44
22
iteration, time step time input .................................. p network output ................... a desired (target) output ....... t activation function induced local field .............. n synaptic weight bias error
#45
#46
23
#47
Adjustable parameters
synaptic weight w bias b
2012 Primo Potonik
f (wx b)
#48
24
Weight vector
w = [w1, w2, ... wR ]
Activation potential
v=wx+b product of input vector and weight vector
x1 xR
w1
wR
f ( wx b) f ( w1 x1 w2 x2 ... wR xR b)
#49
y (v )
1 if v 0 0 if v 0
y(v) v
y
y (v )
1 1 exp( v)
#50
25
#51
y (v) sgn( wx b)
x1 xR
v
y
The output is binary, depending on whether the input meets a specified threshold
y 1 if wx y 0 if wx
y f (wx b)
b b
#52
26
Matlab notation
Presentation of more complex neurons and networks
Input vector p is represented by the solid dark vertical bar Weight vector is shown as single-row, R-column matrix W p and W multiply into scalar Wp [R x 1] [1 x R]
#53
Matlab Demos
nnd2n1 One input neuron nnd2n2 Two input neuron
#54
27
3. Recurrent networks
Contains at least one feedback loop Powerfull temporal learning capabilities
#55
#56
28
#57
#58
29
Delay
Feedback loop
#59
#60
30
Learning process
1. Neural network is stimulated by an environment 2. Neural network undergoes changes in its free parameters as a result of this stimulation 3. Neural network responds in a new way to the environment because of its changed internal structure
Learning algorithm
Prescribed set of defined rules for the solution of a learning problem
1. 2. 3. 4. Error correction learning Memory-based learning Hebbian learning Competitive learning
#61
e(t)
1. Neural network is driven by input x(t) and responds with output y(t) 2. Network output y(t) is compared with target output d(t) Error signal = difference of network output and target output
e(t )
y(t ) d (t )
#62
31
(t )
1 2 e (t ) 2
w(t )
Comments
e(t ) x(t )
Error signal must be directly measurable Key parameter: Learnign rate Closed loop feedback system Stability determined by learning rate
#63
Memory-based learning
All (or most) past experiences are stored in a memory of input-output pairs (inputs and target classes)
( xi , yi )
N i 1
#64
32
Hebbian learning
The oldest and most famous learning rule (Hebb, 1949)
Formulated as associative learning in a neurobiological context
When an axon of a cell A is near enough to exite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that As efficiency as one of the cells firing B, is increased.
Strong physiological evidence for Hebbian learning in hippocampus, important for long term memory and spatial navigation
w(t )
2012 Primo Potonik
y(t ) x(t )
#65
Competitive learning
Inputs
Individual neurons specialize on ensambles of similar patterns feature detectors for different classes of input patterns
#66
33
Learning paradigm
Manner in which a neural network relates to its environment 1. Supervised learning 2. Unsupervised learning 3. Reinforcement learning
#67
Supervised learning
Learning with a teacher
Teacher has a knowledge of the environment Knowledge is represented by a set of input-output examples Environment Teacher
Target response = optimal action
+ Learning system
Error signal
Learning algorithm
Error-correction learning Memory-based learning
#68
34
Unsupervised learning
Unsupervised or self-organized learning
No external teacher to oversee the learning process Only a set of input examples is available, no output examples Learning system
Environment
Unsupervised NNs usually perform some kind of data compression, such as dimensionality reduction or clustering
Learning algorithms
Hebbian learning Competitive learning
#69
Reinforcement learning
No teacher, environment only offers primary reinforcement signal System learns under delayed reinforcement
Temporal sequence of inputs which result in the generation of a reinforcement signal
Goal is to minimize the expectation of the cumulative cost of actions taken over a sequence of steps RL is realized through two neural networks: Critic and Learning system
Primary reinforcement
Environment
Critic
Heuristic reinforcement
Critic network converts primary reinforcement signal (obtained directly from environment) into a higher quality heuristic reinforcement signal which solves temporal credit assignment problem
Actions
Learning system
#70
35
Autoassociation
Neural network stores a set of patterns by repeatedly presenting them to the network Then, when presented a distored pattern, neural network is able to recall the original pattern Unsupervised learning algorithms
Heteroassociation
Set of input patterns is paired with arbitrary set of output patterns Supervised learning algorithms
#71
#72
36
Neural network mapping F(x) can be realized by supervised learning (error-correction learning algorithm) Important function approximation tasks System identification Inverse system
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #73
+ -
Inverse system
Inputs from the environment
#74
37
#75
#76
38
1. Filtering
Extraction of information at discrete time n by using measured data up to and including time n Examples: Cocktail party problem, Blind source separation
o o o o o o x o o o
2. Smoothing
Differs from filtering in: a) Data need not be available at time n b) Data measured later than n can be used to obtain this information
3. Prediction
Deriving information about the quantity in the future at time n+h, h>0, by using data measured up to including n Example: Forecasting of energy consumption, stock market prediction
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #77
#78
39
Adaptation
Learning has spatio-temporal nature
Space and time are fundamental dimensions of learning (control, beamforming)
1. Stationary environment
Learning under the supervision of a teacher, weights then frozen Neural network then relies on memory to exploit past experiences
2. Nonstationary environment
Statistical properties of environment change with time Neural network should continuously adapt its weights in real-time Adaptive system continuous learning
3. Pseudostationary environment
Changes are slow over a short temporal window
Speech stationary in interval 10-30 ms Ocean radar stationary in interval of several seconds
#79
#80
40
Knowledge representation
1. 2. Good solution depends on a good representation of knowledge Knowledge of the world consists of: Prior information facts about what is and what has been known Observations of the world measurements, obtained through sensors designed to probe the environment
Observations can be: 1. Labeled input signals are paired with desired response 2. Unlabeled input signals only
#81
Knowledge representation in NN
Design of neural networks based directly on real-life data
Examples to train the neural network are taken from observations
#82
41
#84
42
Original
Size
Rotation
Shift
Incomplete image
Techniques
1. 2. 3. Invariance by neural network structure Invariance by training Invariant feature space
#85
Features
Characterize the essential information content of an input data Should be invariant to transformations of the input
Benefits
1. Dimensionality reduction number of features is small compared to the original input space 2. Relaxed design requirements for a neural network 3. Invariances for all objects can be assured (for known transformations) Prior knowledge is required!
#86
43
Classifier design
Invariant feature extractor Neural network classifier Class estimate: A, B
Image representation
Grid of pixels (typically 256x256) with gray level [0..1] (typically 8-bit coding)
#87
Curse of dimensionality increasing input dimensionality leads to sparse data and this provides very poor representation of the mapping problems with correct classification and generalization
Possible solution
Combining inputs into features Goal is to obtain just a few features instead of 65536 inputs
44
F1
Decision boundary
F1
Neural network can be used for classification in the feature space (F1, F2)
2 inputs instead of 65536 original inputs Improved generalization and classification ability
#90
45
Optimal classifier ?
Best generalization is achieved by a model whose complexity is neither too small nor too large Occams razor principle: we should prefer simpler models to more complex models Tradeoff: modeling simplicity vs. modeling capacity
#91
Most NN that can learn to generalize effectively from noisy data are similar or identical to statistical methods
Single-layered feedforward nets are basically generalized linear models Two-layer feedforward nets are closely related to projection pursuit regression Probabilistic neural nets are identical to kernel discriminant analysis Kohonen nets for adaptive vector quantization are similar to k-means cluster analysis Kohonen self-organizing maps are discrete approximations to principal curves and surfaces Hebbian learning is closely related to principal component analysis
46
Statistical Jargon
Generalizing from noisy data .................................... Statistical inference Neuron, unit, node .................................................... A simple linear or nonlinear computing element that accepts one or more inputs and computes a function thereof Neural networks ....................................................... A class of flexible nonlinear regression and discriminant models, data reduction models, and nonlinear dynamical systems Architecture .............................................................. Model Training, Learning, Adaptation ................................. Estimation, Model fitting, Optimization Classification ............................................................ Discriminant analysis Mapping, Function approximation ............................ Regression Competitive learning ................................................. Cluster analysis Hebbian learning ...................................................... Principal components Training set ............................................................... Sample, Construction sample Input ......................................................................... Independent variables, Predictors, Regressors, Explanatory variables, Carriers Output ....................................................................... Predicted values Generalization .......................................................... Interpolation, Extrapolation, Prediction Prediction ................................................................. Forecasting
NEURAL NETWORKS (2) Neuron Model, Network Architectures and Learning #94
47
MATLAB example
nn02_neuron_output
#95
MATLAB example
nn02_custom_nn
#96
48
MATLAB example
nnstart
#97
#98
49
#99
Introduction
Pioneering neural network contributions
McCulloch & Pits (1943) the idea of neural networks as computing machines Rosenblatt (1958) proposed perceptron as the first supervised learning model Widrow and Hoff (1961) least-mean-square learning as an important generalization of perceptron learning
Perceptron
Layer of McCulloch-Pits neurons with adjustable synaptic weights Simplest form of a neural network for classification of linearly separable patterns Perceptron convergence theorem for two linearly separable classes
Adaline
Similar to perceptron, trained with LMS learning Used for linear adaptive filters
#100
50
1 if v 0 0 if v 0
y
v
x2
f (wx b)
f (w1 x1 w2 x2 b)
w1 x1 w2 x2 b 0
Geometric representation
x2
x2
w1 x1 w2
b w2
x1
#102
51
#103
w j (n 1) b(n 1)
w j ( n) b ( n)
w j ( n) b( n)
x1 x2 xR
#104
52
xi
, di
0,1
Objective:
Reduce error e between target class d and neuron response y (error-correction learning)
e=d-y
Learning procedure
1. 2. 3. 4. Start with random weights for the connections Present an input vector xi from the set of training samples If perceptron response is wrong: yd, e0, modify all connections w Go back to 2
#105
CASE 3: Neuron output is 1 instead of 0 (y=1, d=0, e=d-y=-1) Input x is subtracted from weight vector w
This makes the weight vector point farther away from the input vector, increasing the chance that the input vector will be classified as a 0 in the future.
#106
53
#107
Convergence theorem
For the perceptron learning rule there exists a convergence theorem:
Theorem 1 If there exists set of connection weights w which is able to perform the transformation d=y(x), the perceptron learning rule will converge to some solution in a finite number of steps for any initial choice of the weights.
Comments
Theorem is only valid for linearly separable classes Outliers can cause long training times If classes are linearly separable, perceptron offers a powerfull pattern recognition tool
#108
54
#109
#110
55
#111
#112
56
#113
#114
57
#116
58
3.4 Adaline
ADALINE = Adaptive Linear Element Widrow and Hoff, 1961: LMS learning (Least mean square) or Delta rule Important generalization of perceptron learning rule Main difference with perceptron activation function
Perceptron: Threshold activation function ADALINE: Linear activation function
Both Perceptron and ADALINE can only solve linearly separable problems
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #117
Linear neuron
Basic ADALINE element
Linear transfer function
y
x1 xR
wx b
y(v) v
#118
59
Simple ADALINE
Simple ADALINE with two inputs
x1
v
x2
f (wx b)
w1 x1 w2 x2 b
w1 x1 w2 x2 b 0
see Perceptron decision boundary
xi
, di
Objective: reduce error e between target class d and neuron response y (error-correction learning)
e=dy
d ( n ) y ( n)
n 1
#120
60
mse
1 N
d ( n ) y ( n)
n 1
e2 (n)
d (n) y(n)
and change the network weights proportional to the negative derivative of error
w j ( n)
e 2 ( n) wj
#121
w j ( n)
e 2 ( n) wj
2 e( n )
e( n ) wj
2 e( n )
d ( n) y ( n ) wj
2
we finaly obtain the weight change at step n
w j (n)
e(n) x j (n)
#122
61
Learning is regulated by a learning rate Stable learning learning rate must be less then the reciprocal of the largest eigenvalue of the correlation matrix xTx of input vectors
Limitations
Linear network can only learn linear input-output mappings Proper selection of learning rate
#123
#124
62
#125
Learning rules
LMS learning Perceptron learning
w j (n 1) b(n 1)
w j ( n) b ( n)
w j (n 1) b(n 1)
#126
63
Input
#128
64
Adaptive filter
Adaptive filter = ADALINE combined with TDL
a(k ) Wp b
i
2012 Primo Potonik
wi p(k i 1) b
#129
a(t )
#130
65
Operation
p(t-2) p(t-1) p(t) p(t+1) Time
Learning
#131
#132
66
#133
#134
67
#135
v
x2
Discriminant function
x2
w1 x1 w2
b w2
#136
68
XOR solution
Extending single-layer perceptron to multi-layer perceptron by introducing hidden units
x1
w2,1
w1,1 1 w2,1 1 w2, 2 w2,3 1 b2 0.5 2
w2, 2
w1, 2 b1
1 0.5
x2
w2,3
XOR problem can be solved but we no longer have a learning rule to train the network Multilayer perceptrons can do everything How to train them?
#137
Homework
Create a two-layer perceptron to solve XOR problem
Create a custom network Demonstrate solution
#138
69
4. Backpropagation
Multilayer feedforward networks Backpropagation algorithm Working with backpropagation Advanced algorithms Performance of multilayer perceptrons
#139
Introduction
Single-layer networks have severe restrictions
Only linearly separable tasks can be solved
Werbos (1974)
Parker (1985), Cun (1985), Rumelhart (1986) Solved the problem of training multi-layer networks by back-propagating the output errors through hidden layers of the network
70
1 exp( v)
3. Massive connectivity
Neurons in successive layers are fully interconnected
#142
71
Matlab demo
nnd11nf Response of the feedforward network with one hidden layer
#143
About backpropagation
Multilayer perceptrons can be trained by backpropagation learning rule
Based on error-correction learning rule Generalization of LMS learnig rule (used to train ADALINE)
2. Backward pass
Error signal is propagated backwards from output to input Synaptic weights are adjusted according to the error gradient
#144
72
xn
, dn
e j ( n) 2
j 1
E ( n)
n 1
#145
Learning objective is to minimize average error energy E by minimizing free network parameters We use an approximation: pattern-by-pattern learning instead of epoch learning
Parameter adjustments are made for each pattern presented to the network Minimizing instantaneous error energy at each step instead of average error energy
#146
73
vj
yj
ej E
dj 1 2
R
yj ej
j 1 2
#147
e j ( n) y j ( n) y j ( n) v j ( n)
vj
yj
f (v j (n))
v j ( n)
j 0
2012 Primo Potonik
w ji (n) yi (n)
v j ( n) w ji (n)
yi (n)
#148
74
vj
yj
Learning rate
Local gradient
w ji (n)
2012 Primo Potonik
(n) yi (n)
#149
E ( n) w ji (n)
yi
w ji
vj
yj
yi ( n )
y j ( n) v j ( n)
f (v j (n))
( n)
E ( n) y j ( n) y j ( n) v j ( n)
E ( n) f (v j (n)) y j ( n)
#150
75
E ( n)
1 R ek (n) 2 2k1
E ( n) y j ( n)
ek
k
ek (n)
d k ( n) y k ( n) d k ( n)
vk (n)
j 0
f (vk (n))
M
ek
k
yj
wkj
vk
yk
wkj
#151
( n)
E ( n) f (v j (n)) y j ( n)
E ( n) y j ( n)
k k
wkj
( n)
f (v j (n))
k
wkj
#152
76
w ji (n)
Weight correction Learning rate
(n) yi (n)
Local gradient Input of neuron j
xi
w ji
vj
yj
wkj
vk
yk
( n)
f (v j (n))
k
wkj
#153
xi (n)
yj
w ji xi
xi
yk
w ji
f
vj
wkj y j
yj
wkj
vk
yk
2. Backward pass
Recursive computing of local gradients Output local gradients Hidden layer local gradients
k
( n)
f (v j (n))
k
wkj
wkj (n)
2012 Primo Potonik
(n) y j (n)
w ji (n)
(n) xi (n)
#154
77
3. Forward pass
Propagate training sample from network input to the output Calculate the error signal
4. Backward pass
Recursive computation of local gradients from output layer toward input layer Adaptation of synaptic weights according to generalized delta rule
5. Iteration
Iterate steps 2-4 until stopping criterion is met
#155
Matlab demo
nnd11bc Backpropagation calculation
#156
78
Matlab demo
nnd12sd1 Steepest descent
#157
f (v(n)) v(n)
f (v(n)) 1
Backpropagation rule
wi (n) (n) yi (n), yi xi (n) e(n) f (v(n)) e(n) wi (n) e(n) xi (n)
79
#159
Batch training
Weight updating after the presentation of a complete epoch
Sequential training
Weight updating after the presentation of each training example Stochastic nature of learning, faster convergence Important practical reasons for sequential learning:
Algorithm is easy to implement Provides effective solution to large and difficult problems
Therefore sequential training is preferred training mode Good practice is random order of presentation of training examples
#160
80
Activation function
Derivative of activation function f (v j (n)) is required for computation of local gradients
Only requirement for activation function: differentiability Commonly used: logistic function
f (v j (n))
1 1 exp( av j (n))
0, v j ( n)
f (v j (n))
y j ( n ) f ( v j ( n ))
Local gradient can be calculated without explicit knowledge of the activation function
#161
ck sin(kx
Equivalent to traditional Fourier analysis Network with sin() activation functions can be trained by backpropagation Example: Approximating periodic function by
#162
81
Learning rate
Learning procedure requires
Change in the weight space to be proportional to error gradient True gradient descent requires infinitesimal steps
Learning in practice
w ji (n) Factor of proportionality is learning rate j (n) yi (n) Choose a learning rate as large as possible without leading to oscillations
0.010 0.035
0.040
#163
Stopping criteria
Generally, backpropagation cannot be shown to converge
No well defined criteria for stopping its operation
#164
82
2. Activation function
Faster learning with antisimetric sigmoid activation functions Popular choice is:
#165
4. Preprocessing inputs
a) Normalizing mean to zero b) Decorrelating input variables (by using principal component analysis) c) Scaling input variables (variances should be approx. equal)
Original
a) Zero mean
b) Decorrelated
c) Equalized variance
#166
83
#167
Generalization
Neural network is able to generalize:
Input-output mapping computed by the network is correct for test data
Test data were not used during training Test data are from the same population as training data
Correct response even if input is slightly different than the training examples
Overfitting
Good generalization
#168
84
Improving generalization
Methods to improve generalization
1. Keeping the network small 2. Early stopping 3. Regularization
Early stopping
Available data are divided into three sets:
1. Training set used to train the network 2. Validation set used for early stopping, when the error starts to increase 3. Test set used for final estimation of network performance and for comparison of various models
Early stopping
#169
Regularization
Improving generalization by regularization
Modifying performance function
mse
1 N 1 M
(d j (n) y j (n)) 2
n 1
msw
msreg
mse (1
)msw
Using this performance function, network will have smaller weights and biases, and this forces the network response to be smoother and less likely to overfit
#170
85
Deficiencies of backpropagation
Some properties of backpropagation do not guarantee the algorithm to be universally useful: 1. Long training process
Possibly due to non-optimum learning rate (advanced algorithms address this problem)
2. Network paralysis
Combination of sigmoidal activation and very large weights can decrease gradients almost to zero training is almost stopped
3. Local minima
Error surface of a complex network can be very complex, with many hills and valleys Gradient methods can get trapped in local minima Solutions: probabilistic learning methods (simulated annealing, ...)
#171
#172
86
Momentum
A simple method of increasing learning rate yet avoiding the danger of instability Modified delta rule by adding momentum term
w ji (n)
j
(n) yi (n)
w ji (n 1)
#173
4%,
0.7,
1.05
#174
87
Resilient backpropagation
Slope of sigmoid functions approaches zero as the input gets large
This causes a problem when you use steepest descent to train a network Gradient can have a very small magnitude also changes in weights are small, even though the weights are far from their optimal values
Resilient backpropagation
Eliminates these harmful effects of the magnitudes of the partial derivatives Only sign of the derivative is used to determine the direction of weight update, size of the weight change is determined by a separate update value Resilient backpropagation rules:
1. Update value for each weight and bias is increased by a factor inc if derivative of the performance function with respect to that weight has the same sign for two successive iterations 2. Update value is decreased by a factor dec if derivative with respect to that weight changes sign from the previous iteration 3. If derivative is zero, then the update value remains the same 4. If weights are oscillating, the weight change is reduced
#175
E (n)
E (w(n))
E(w1,w2)
w2
w1
#176
88
E ( w(n)
Local gradient
g T ( n)
E ( w) w w
2
w( n )
Hessian matrix H ( n)
E ( w) w2 w
w( n )
#177
w(n)
g (n)
w(n)
H 1 (n) g (n)
gradient descent Newtons method
#178
89
Quasi-Newton algorithms
Problems with the calculation of Hessian matrix
Inverse Hessian H-1 is required, which is computationally expensive Hessian has to be nonsingular which is not guaranteed Hessian for neural network can be rank defficient No convergence guarantee for non-quadratic error surface
Quasi-Newton method
Only requires calculation of the gradient vector g(n) The method estimates the inverse Hessian directly without matrix inversion Quasi-Newton variants:
Davidon-Fletcher-Powell algorithm Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!
#179
#180
90
Levenberg-Marquardt algorithm
Levenberg-Marquardt algorithm (LM)
Like the quasi-Newton methods, LM algorithm was designed to approach second-order training speed without having to compute the Hessian matrix When the performance function has the form of a sum of squares (typical in neural network training), then the Hessian matrix H can be approximated by Jacobian matrix J
JT J
where Jacobian matrix contains first derivatives of the network errors with respect to the weights Jacobian can be computed through a standard backpropagation technique that is much less complex than computing the Hessian matrix
#181
#182
91
#183
#184
92
Learning set with 4 samples has small training error but gives very poor generalization Learning set with 20 samples has higher training error but generalizes well Low training error is no guarantee for a good network performance!
#185
A large number of hidden units leads to a small training error but not necessarily to a small test error Adding hidden units always leads to a reduction of the training error However, adding hidden units will first lead to a reduction of test error but then to an increase of test error ... (peaking efect, early stopping can be applied)
#186
93
#187
Matlab demo
nnd11fa Function approximation, variable number of hidden units
#188
94
Matlab demo
nnd11gn Generalization, variable number of hidden units
#189
#190
95