Vous êtes sur la page 1sur 44

Artificial Neural Networks

Dan Simon
Cleveland State University

1
Neural Networks
Artificial Neural Network (ANN): An
information processing paradigm that is
inspired by biological neurons
Distinctive structure: Large number of
simple, highly interconnected processing
elements (neurons); parallel processing
Inductive learning, that is, learning by
example; an ANN is configured for a
specific application through a learning
process
Learning involves adjustments to
connections between the neurons
2
Inductive Learning
Sometimes we cant explain how we know
something; we rely on our experience
An ANN can generalize from expert
knowledge and re-create expert behavior
Example: An ER doctor considers a
patients age, blood pressure, heart rate,
ECG, etc., and makes an educated guess
about whether or not the patient had a
heart attack

3
The Birth of ANNs
The first artificial neuron was
proposed in 1943 by
neurophysiologist Warren McCulloch
and the psychologist/logician Walter
Pitts
No computing resources at that time

4
Biological Neurons

5
A Simple Artificial Neuron

6
A Simple ANN
Pattern recognition: T versus H
x11
x12 f1(.)
x
x21 13
x22 f2(.) g(.)
x
x31 23
x32 f3(.)
x33

1 0

7
Examples:

1 0 1, 1, 1
1
x x x f1 f2 f3 g

0, 0, 0
1 2 3

0 0 0 0 1 1 1
0 0 1 0 ? 0 0
0
0 1 0 1 1 1 1
0 1 1 1 ? 1 1 1, ? 1
1 0 0 0 ? 0 0 1
1 0 1 0 0 0 0
1 1 0 1 ? 1 1 0, ?, 1
1 1 Truth
1 1 table
0 0 0 ? 8
Feedforward ANN

How many hidden layers should we use? How


many neurons should we use in each hidden
layer?
9
Recurrent ANN

10
Perceptrons
A simple ANN introduced by Frank
Rosenblatt in 1958
Discredited by Marvin Minsky and
Seymour Papert in 1969
Perceptrons have been widely publicized
as 'pattern recognition' or 'learning
machines' and as such have been discussed
in a large number of books, journal articles,
and voluminous 'reports'. Most of this
writing ... is without scientific value
11
Perceptrons
x0=1
w0
x1 1 if w x 0
w1 f ( x)
w2 0 otherwise
x2
w3
x3

Three-dimensional single-layer perceptron


Problem: Given a set of training data (i.e.,
(x, y) pairs), find the weight vector {w}
that correctly classifies the inputs.
12
The Perceptron Training Rule
t = target output, o = perceptron output
Training rule: wi = e xi, where e = t o,
and is the step size.
Note that e = 0, 1 or 1.
If e = 0, then dont update the weight.
If e = 1, then t = 1 and o = 0, so we need to
increase wi if xi > 0, and decrease wi if xi < 0.
Similar logic applies when e = 1.
is often initialized to 0.1 and decreases as
training progresses.

13
From Perceptrons to
Backpropagation
Perceptrons were dismissed because of:
Limitations of single layer perceptrons
The threshold function is not differentiable
Multi-layer ANNs with differentiable
activation functions allow much richer
behaviors.
A multi-layer perceptron
(MLP) is a feedforward
ANN with at least one
hidden layer.
14
Backpropagation
Derivative-based method
Derivative-based for optimizing ANN weights.
method for
optimizing ANN 1969: First described by
weights Arthur Bryson and Yu-Chi
Ho.
1970s-80s: Popularized by
David Rumelhart, Geoffrey
Hinton Ronald Williams,
Paul Werbos; led to a
renaissance in ANN
research.
15
The Credit Assignment
Problem

Output 1
Wanted 0

In a multi-layer ANN, how can we tell which


weight should be varied to correct an
output error? Answer: backpropagation.

16
Backpropagation
input neurons hidden neurons output neurons
x1 x1 v11 y1 w11 o1
a1 z1
v21 w21
v12 w12
x2 x2 v22 y2 w22 o2
a2 z2

a1 v11 x1 v21 x2 z1 w11 y1 w21 y2


y1 f (a1 ) o1 f ( z1 )
Similar for a2 and y2 Similar for z2 and o2
17
tk = desired (target) value of k-th output neuron
no = number of output neurons

1 no
E (tk ok ) 2
2 k 1
1 no
(tk f ( zk )) 2
2 k 1 Sigmoid transfer
x 1
f ( x) (1 e ) function
1

df x x 2
e (1 e )
f(x)

0.5
dx
[1 f ( x)] f ( x)
0
-5 0 5 18
x
dE
j
Output Neurons dz j
d 1 2
2 (t j o j )
1 no
E (tk f ( zk )) 2 dz j
2 k 1 do j
dE dE dz j (t j o j )
dz j
dwij dz j dwij
df ( z j )
dE (t j o j )
yi dz j
dz j
(t j o j )(1 f ( z j )) f ( z j )
j yi
(t j o j )(1 o j )o j
19
Hidden Neurons
dE dE dzk dy j da j

D( j ) = {output dvij kD ( j ) dzk dy j da j dvij
neurons whose inputs
come from the j-th dE dzk dy j
middle-layer neuron} xi
vij aj yj kD ( j ) dz k dy j da j

{ zk for all k D( j ) } dE dzk dy j


j
kD ( j ) dz k dy j da j


kD ( j )
k w jk y j (1 y j )

y j (1 y j )
kD ( j )
k w jk

20
The Backpropagation Training
Algorithm
1. Randomly initialize weights {w} and {v}.
2. Input sample x to get output o. Compute error E.
3. Compute derivatives of E with respect to output
weights {w} (two pages previous).
4. Compute derivatives of E with respect to hidden
weights {v} (previous page). Note that the results of
step 3 are used for this computation; hence the term
backpropagation).
5. Repeat step 4 for additional hidden layers as needed.
6. Use gradient descent to update weights {w} and
{v}. Go to step 2 for the next sample/iteration.

21
XOR Example
x2

y = sign(x1x2)
1 0
x1

0 1

Not linearly separable. This is a very simple


problem, but early ANNs were unable to solve it.

22
XOR Example
x1 x1 v11 y1 w11 o1
a1 z1
v12 w21
v21 y2
w31
x2 x2 v22
a2

v31 v32 1
1 1 1 Backprop.m

Bias nodes at both the input and hidden layer

23
XOR Example
x2

1 0

x1
0 1

Homework: Record the weights for the trained ANN, input


various (x1, x2) combinations to the ANN to see how well it
can generalize.

24
Backpropagation Issues
Momentum: wij wij j yi + wij,previous
What value of should we use?
Backpropagation is a local optimizer
Combine it with a global optimizer (e.g., BBO)
Run backprop with multiple initial conditions
Add random noise to input data and/or
weights to improve generalization

25
Backpropagation Issues
Batch backpropagation
Dont forget to
Randomly initialize weights {w} and {v}
adjust the
While not (termination criteria) learning rate!
For i = 1 to (number of training samples)
Input sample xi to get output oi. Compute
error Ei
Compute dEi / dw and dEi / dv
Next sample
dE / dw = dEi / dw and dE / dv = dEi / dv
Use gradient descent to update weights {w} and
{v}. 26
Backpropagation Issues
Weight decay
wij wij j yi dwij
This tends to decrease weight magnitudes
unless they are reinforced with backprop
d 0.001
This corresponds to adding a term to the
error function that penalizes the weight
magnitudes

27
Backpropagation Issues
Quickprop (Scott Fahlman, 1988)
Backpropagation is notoriously slow.
Quickprop has the same philosophy
as Newton-Raphson.
Assume the error surface is quadratic
and jump in one step to the
minimum of the quadratic.

28
Backpropagation Issues
Other activation functions
Sigmoid: f(x) = (1+ex)1
Hyperbolic tangent: f(x) = tanh(x)
Step: f(x) = U(x)
Tan Sigmoid: f(x) = (ecx ecx) / (ecx + ecx)
for some positive constant c
How many hidden layers should we
use?

29
Universal Approximation
Theorem
A feed-forward ANN with one hidden layer
and a finite number of neurons can
approximate any continuous function to any
desired accuracy.
The ANN activation functions can be any
continuous, nonconstant, bounded,
monotonically increasing functions.
The desired weights may not be obtainable
via backpropagation.
George Cybenko, 1989; Kurt Hornik, 1991

30
Termination Criterion

Error

Validation/Test Set
Training Set

If we train too long we begin to


memorize the training data and lose
the ability to generalize.
Train with a validation/test set.
31
Termination Criterion
Cross Validation
N data partitions
N training runs, each using (N1) partitions
for training and 1 partition for
validation/test
Each training run, store number of epochs ci
for the best test set performance (i=1,,N)
cave = mean{ci}
Train on all data for cave epochs

32
Adaptive Backpropagation
Recall standard weight update: wij wij
j yi
With adaptive learning rates, each weight
wij has its own rate ij
If the sign of wij is the same over several
backprop updates, then increase ij
If the sign of wij is not the same over
several backprop updates, then decrease
ij
33
Double Backpropagation
no
In addition to minimizing 1
E1 (tk ok ) 2
the training error: 2 k 1
Also minimize the 2
1 no
E1
sensitivity of training E2
error to input data: 2 k 1 xk
P = number of input training patterns. We want
an ANN that can generalize. So input changes
should not result in large error changes.

34
Other ANN Training Methods
Gradient-free approaches (GAs, BBO, etc.)
Global optimization
BBO.m
Combination with gradient descent
We can train the structure as well as the
weights
We can use non-differentiable activation
functions
We can use non-differentiable cost
functions
35
Classification Benchmarks
The Iris classification problem
150 data samples
Four input feature values (sepal length
and width, and petal length and
width)
Three types of irises: Setosa,
Versicolour, and Virginica

36
Classification Benchmarks
The two-spirals
classification problem
UC Irvine Machine
Learning Repository
http://archive.ics.uci.ed
u/ml

194 benchmarks!

37
Radial Basis Functions
N middle-layer neurons
Inputs x
Activation functions f (x, ci)
Output weights wik
yk = wik f (x, ci)
J. Moody and C. Darken, 1989
Universal approximators = wik ( ||xci|| )
(.) is a basis function
limx ( ||xci|| ) = 0
{ ci } are the N RBF centers

38
Radial Basis Functions
Common basis functions:
Gaussian: ( ||xci|| ) = exp(||xci||2 / 2)
is the width of the basis function
Many other proposed basis functions

39
Radial Basis Functions
Suppose we have the data set (xi, yi), i = 1, , N
Each xi is multidimensional, each yi is scalar
Set ci = xi, i = 1, , N
Define gik = ( || xi xk|| )
Input each xi to the RBF to obtain:

g11 L g1N w1 y1 Gw = y
M O M M M G is nonsingular if {xi}
are distinct
g N 1 L g NN w N y N Solve for w
Global minimum
(assuming fixed c and )
40
Radial Basis Functions
We again have the data set (xi, yi), i = 1, , N
Each xi is multidimensional, each yi is scalar
ck are given for (k = 1, , m), and m < N
Define gik = ( || xi ck|| )
Input each xi to the RBF to obtain:
g11 L g1m w1 y1 Gw = y
M O M M M w = (GTG)1GT = G+y

g N 1 L g Nm w m y N

41
Radial Basis Functions
How can we choose the RBF centers?
Randomly select them from the inputs
Use a clustering algorithm
Other options (BBO?)
How can we choose the RBF widths?

42
Other Types of ANNs
Many other types of ANNs
Cerebellar Model Articulation Controller (CMAC)
Spiking neural networks
Self-organizing map (SOM)
Recurrent neural network (RNN)
Hopfield network
Boltzman machine
Cascade-Correlation
and many others

43
Sources
Neural Networks, by C. Stergiou and D. Siganos,
www.doc.ic.ac.uk/~nd/ surprise_96/journal/vol4/cs11/report.html
The Backpropagation Algorithm, by A. Venkataraman,
www.speech.sri.com/people/anand/771/html/node37.html
CS 478 Course Notes, by Tony Martinez,
http://axon.cs.byu.edu/~martinez

44