Académique Documents
Professionnel Documents
Culture Documents
y = ∑k wkxk for k = 1 to p
1 2
cost function = mean squared error = J = e
2
-1 w0 = θ
w1 output y
x1 ∑
.....
wp
xn
J
18
Properties of LMS:
19
4. Multilayer Feedforward Perceptron Training
• .....
.....
.....
neuron j
dj
yi
wij uj
ej
from previous ϕ(••) -1
layer
1
Instantaneous sum of squared errors: (n) = ∑ e2(n) over all j in o/p layer
2 j
20
1
For N patterns, average squared error: = ∑ (n) for n = 1 to N
av
N n
av
∂ wji(n)
Case 2: hidden node more complex, need to consider neuron j feeding neuron k,
where inmputs to neuron j are yi
∂yj(n) ∂yj(n)
∴ δj(n) = − ϕj′(υ
υj(n)) ∑k ek(n) ∂ek(n) ∂υk(n) = − ϕj′(υ
υj(n)) ∑k δk(n) wkj(n)
∂υk(n) ∂yj(n)
• Thus δj(n) is computed in terms of δk(n) which is closer to the output. After
calculating the network output in a forward pass, the error is computed and
recursively back-propagated through the network in a backward pass.
×(local gradient)×
weight correction = (learning rate)× ×(i/p signal neuron)
∆ wji(n) = η δj(n) yi(n)
21
4.2 Back-propagation training
Activation function:
yj(n) = ϕj(uj(n)) = 1
−uj(n))
1 + exp(−
22
4.3 Other perspectives for improving generalisation
• randomly updating weights after each pattern requires very little storage and
leads to a stochastic search which is less likely to get stuck in local minima
• updating after presentation of all training samples (an epoch) provides a more
accurate estimate of the gradient vector since it is based on the average
squared error
av
4.3.3 Initialisation
4.3.5 Cross-Validation
• measures generalisation on test set
• various parameters including no. of hidden nodes, learning rate and
training set size can be set based on cross-validation performance
23
4.3.6 Network Pruning by complexity regularisation
24
4.5 Example of learning XOR Problem
Decision Boundaries
x1 x2 target x2
0 0 0 neuron a
0 1 1 1 out
1 0 1 =1
1 1 0
out
=0
0
x1
-1 1
1.5 x2
1 neuron b
x•
a -2 out
1 1 1
1 c
1 out
0.5
• 1 b -1 =1
x2
0.5 out
-1 =0
0
1 x1
x1 x2 a b target x2
0 0 0 0 0 neuron c
0 1 0 1 1 1
out
1 0 0 1 1 =0
1 1 1 1 0 out
=1
out
=0
0 1 x1
25
4.6 Example: vehicle navigation
sharp sharp
left right
fully connected
26
5. Associative Memories
stimulus ak response bk
bk1 ak = ak1
bk = ak1 • w11 1 bk1
bk2 ak2 w12
. . w13
. .
bkp akp ak2 • 2 bk2
.....
.....
w11 ( k ) w12 ( k ) ... w1 p ( k )
w ( k ) w ( k ) ... w ( k ) akp • p bkp
W(k) =
22 22 2p
w p1 ( k ) w p 2 ( k ) ... w pp ( k )
response bk = W(k) ak
27
Pattern recall:
For recall of a stimulus pattern aj: b = W aj = ∑k (akTaj) bk
assuming that key patterns have been normalised, akTaj = 1
b = bj + vj where vj = ∑k (akTaj)bk for k = 1 to q, k ≠ j
i.e. vj results from interference from all other stimulus patterns
∴ (akTaj) = 0 for j ≠k → perfect recall (orthonormal patterns)
Main features:
distributed memory
The max. no. of patterns reliably stored is p, the dimension of input space
which is also the rank (no. of independent columns or rows) of W
noisy stimulus e.g. [0.8 -0.15 0.15 -0.2]T gives [4 1.25 -0.45]T
28
6. Radial Basis Functions
1
x1 • ϕ1 w0
w1
x2 • ϕ2 w21
.....
.....
.....
wp
xp • ϕp
ϕ(x) = exp(−
−||x − t||2)
t = centre of Gaussian
29
6.2 Example: XOR
ϕ1(x) = exp(−
−||x − t1||2), t1 = [1,1]T
ϕ2(x) = exp(−
−||x − t2||2), t2 = [0,0]T
ϕ2(x)
pattern
1 • (1,1)
decision
patterns boundary
(0,1) (1,0)
• pattern
(0,0)
•
0
1 ϕ1(x)
x1 x2 ϕ1(x) ϕ2(x)
0 0 e-√√2 1
0 1 e-1 e-1
1 0 e-1 e-1
1 1 1 e-√√2
30
6.3 Ill-posed Hypersurface Reconstruction
x F(x)
where s(F) is the standard error term and c(F) is the regularising term
31
• all layers usually nonlinear vs. nonlinear hidden but linear output
• computation of inner product of i/p vector & weight vector vs. Euclidean
norm between i/p vector and centre of appropriate unit
32
6.6 Example: curve fitting
−8, 12]
output shown for 200 inputs uniformly sampled in the range [−
x x σ = 1.0
x
x x
x
x
x σ = 0.5
x
x
x
x
x
x
σ = 1.5
x
33