Vous êtes sur la page 1sur 16

3.

Least Mean Square (LMS) Algorithm

3.1 Spatial Filtering

uses single linear neuron and can be understood as adaptive filtering

y = ∑k wkxk for k = 1 to p

error e = d − y where d = desired value

1 2
cost function = mean squared error = J = e
2

-1 w0 = θ
w1 output y
x1 ∑
.....

wp
xn
J

3.2 Steepest descent


∂J//∂wk = 0 to determine optimum weight


adjust weights iteratively and move gradient


= ∂J//∂w
along the error surface towards the
Jmin
optimum value

wk(n+1) = wk(n) − η (∂∂J(n)//∂wk) w0 single


weight
i.e. updated value is proportional to
negative of the gradient of the error surface
∴ wk(n+1) = wk(n) + η e(n) xk(n)

18
Properties of LMS:

• a stochastic gradient algorithm in that the gradient vector is ‘random’ in


contrast to steepest descent

• on average improves in accuracy for increasing values of n.

• reduces storage requirement to information present in its current set of


weights, and can operate in a nonstationary environment.

3.2.1 Convergence ( proof not given)

• in the mean if weight vector → optimum value as n → ∞, requires:

0 < η < 2//λmax λmax is max eigenvalue of autocorrelation matrix Rx


Rx = E[x xT]
• in the mean square if mean-square of error signal → constant as n → ∞,
requires:

0 < η < 2//tr[Rx] where tr[Rx] = ∑k λk ≥ λmax

• Faster convergence is usually obtained by making η a function of n,


for example η(n) = c//n for some constant c

19
4. Multilayer Feedforward Perceptron Training

4.1 Back-propagation Algorithm

• .....
.....

.....

neuron j
dj
yi

wij uj
ej
from previous ϕ(••) -1
layer

Let wji be weight connected from neuron i to neuron j

error signal: ej(n) = dj(n) − yj(n)

net internal sum: υj(n) = ∑iwji(n)yi(n) for i = 0 to p

output: yj(n) = ϕj(υ


υj(n))

1
Instantaneous sum of squared errors: (n) = ∑ e2(n) over all j in o/p layer


2 j


20
1
For N patterns, average squared error: = ∑ (n) for n = 1 to N
 

av
N n
 

• Learning goal is to minimise av by adjusting weights, but instead of the


 

 

av

estimate (n) is used on a pattern-by-pattern basis




From the ∂ (n) = ∂ (n) ∂ej(n) ∂yj(n) ∂υj(n)


 

 

chain rule: ∂wji(n) ∂ej(n) ∂yj(n) ∂υj(n) ∂wji(n)

∴ weight correction: ∆ wji(n) = − η ∂ (n) i.e. steepest descent




∂ wji(n)

= η δj (n) yi(n) where δj (n) = - ∂ (n) / ∂υj(n)




Case 1: Output node, local gradient easily calculated

Case 2: hidden node more complex, need to consider neuron j feeding neuron k,
where inmputs to neuron j are yi

δj(n) = − ∂ (n) ϕj′(υ


υj(n)) = −∑k ek ∂ek(n) ϕj′(υ
υj(n))


∂yj(n) ∂yj(n)

∴ δj(n) = − ϕj′(υ
υj(n)) ∑k ek(n) ∂ek(n) ∂υk(n) = − ϕj′(υ
υj(n)) ∑k δk(n) wkj(n)
∂υk(n) ∂yj(n)

• Thus δj(n) is computed in terms of δk(n) which is closer to the output. After
calculating the network output in a forward pass, the error is computed and
recursively back-propagated through the network in a backward pass.

×(local gradient)×
weight correction = (learning rate)× ×(i/p signal neuron)
∆ wji(n) = η δj(n) yi(n)

21
4.2 Back-propagation training

Activation function:
yj(n) = ϕj(uj(n)) = 1
−uj(n))
1 + exp(−

∂yj(n) = ϕj′(uj(n)) = exp(−−uj(n)) = yj(n) [ 1 − yj(n)]


∂uj(n) −uj(n))]
[1 + exp(− 2

Note that max value of ϕj′(υ


υj(n)) occurs at yj(n) = 0.5 and
min value of 0 occurs at yj(n) = 0 or 1

Momentum term: + α ∆ wji(n − 1) 0 ≤ |α


α| < 1
helps locate more desirable local minimum in complex error surface
example error surface 

no change in error sign ⇒ ∆ wji(n)


increases and descent is accelerated


changes in error sign ⇒ ∆ wji(n) decreases


and stabilises oscillations


large enough α can stop process


terminating in shallow local minima
single weight


with momentum, η can be larger

22
4.3 Other perspectives for improving generalisation

4.3.1 Pattern vs Batch Mode


Choice depends on particular problem:

• randomly updating weights after each pattern requires very little storage and
leads to a stochastic search which is less likely to get stuck in local minima

• updating after presentation of all training samples (an epoch) provides a more
accurate estimate of the gradient vector since it is based on the average
squared error

av

4.3.2 Stopping criteria


e.g. gradient vector threshold and/or change in average squared error per epoch

4.3.3 Initialisation

• default is uniform distribution inside a small range of values


• too large values can lead to premature saturation (neuron outputs close to
limits) which gives small weight adjustments even though error is large

4.3.4 Training Set Size


worst-case formula N > W//ε where:
N = no. of examples, W = no. of synaptic weights,
ε = fraction of errors permitted on test

4.3.5 Cross-Validation
• measures generalisation on test set
• various parameters including no. of hidden nodes, learning rate and
training set size can be set based on cross-validation performance

23
4.3.6 Network Pruning by complexity regularisation

(two possibilities: network growing and network pruning)


goal is to find weight vector that minimises R(w) = s(w) +λ c(w)

where s(w) is standared error measure e.g. mean square error


λ is the regularisation parameter


c(w) is the complexity penalty that depends on the network e.g. ||w||2

• regularisation term allows identification of weights having insignificant effect

4.3.7 Other ways of minimising cost function

• Back-propagation uses a relatively simple, quick approach to minimising cost


function by obtaining an instantaneous estimate of the gradient

• methods and techniques from nonlinear optimum filtering and nonlinear


function optimisation have been used to provide more sophisticated approach
to minimising the cost function e.g. Kalman filtering, conjugate-gradient
method

4.4 Universal Approximation Theorem


single hidden layer with suitable ϕ gets arbitrarily close to any continuous
function
• logistic function satisfies ϕ(⋅⋅) definition
• single hidden layer sufficient, but no clue on synthesis
• single hidden layer is restrictive in that hierarchical features not supported

24
4.5 Example of learning XOR Problem

Decision Boundaries

x1 x2 target x2
0 0 0 neuron a
0 1 1 1 out
1 0 1 =1
1 1 0
out
=0

0
x1
-1 1
1.5 x2
1 neuron b
x•
a -2 out
1 1 1
1 c
1 out
0.5
• 1 b -1 =1
x2
0.5 out
-1 =0
0
1 x1

x1 x2 a b target x2
0 0 0 0 0 neuron c
0 1 0 1 1 1
out
1 0 0 1 1 =0
1 1 1 1 0 out
=1
out
=0
0 1 x1

25
4.6 Example: vehicle navigation

sharp sharp
left right

....................... 45 output units

fully connected

............. 9 hidden units


fully connected

video input retina

network computes steering angle


training examples from human driver


obstacles detected by laser range finder

26
5. Associative Memories

5.1 linear associative memory

stimulus ak response bk
bk1 ak = ak1
bk = ak1 • w11 1 bk1
bk2 ak2 w12
. . w13
. .
bkp akp ak2 • 2 bk2

.....

.....
 w11 ( k ) w12 ( k ) ... w1 p ( k )
 w ( k ) w ( k ) ... w ( k ) akp • p bkp
W(k) =  
22 22 2p

 
   

 
 w p1 ( k ) w p 2 ( k ) ... w pp ( k )

response bk = W(k) ak

Design of weight matrix for storing q pattern associations ak bk

estimate of weight matrix = ∑k bk akT for k = 1 to q


(Hebbian learning principle)

bk1 [ak1, ak2, ...,akp]


where bk akT is the outer product = bk2
.
.
bkp

27
Pattern recall:
For recall of a stimulus pattern aj: b = W aj = ∑k (akTaj) bk
assuming that key patterns have been normalised, akTaj = 1
b = bj + vj where vj = ∑k (akTaj)bk for k = 1 to q, k ≠ j
i.e. vj results from interference from all other stimulus patterns
∴ (akTaj) = 0 for j ≠k → perfect recall (orthonormal patterns)

Main features:


distributed memory


auto- and hetero-associative




content addressable and resistant to noise and damage




interaction between stored patterns may lead to error on recall




The max. no. of patterns reliably stored is p, the dimension of input space
which is also the rank (no. of independent columns or rows) of W


For an auto-associative memory ideally W ak = ak showing that


stimulus patterns are eigenvectors of W with all unity eigenvalues

Example: a1 = [1 0 0 0]T, a2 = [0 1 0 0]T, a3 = [0 0 1 0]T


b1 = [5 1 0]T, b2 = [-2 1 6]T, b3 = [-2 4 3]T

memory weight matrix = 5 -2 -2 0 giving perfect recall


1 1 4 0 since stimulus patterns
0 6 3 0 are orthonormal

noisy stimulus e.g. [0.8 -0.15 0.15 -0.2]T gives [4 1.25 -0.45]T


which is closer to b1 than b2 or b3

28
6. Radial Basis Functions

6.1 Separability of patterns


Separability theorem (Cover) states that if mapping ϕ(x) is nonlinear and hidden-
unit space is high relative to input space then it is more likely to be non-separable

1
x1 • ϕ1 w0
w1

x2 • ϕ2 w21
.....
.....

.....

wp
xp • ϕp

Example of RBF is a Gaussian

ϕ(x) = exp(−
−||x − t||2)

t = centre of Gaussian


output neuron is linear weighted sum




ϕ(x) is nonlinear and hidden-unit space [ϕ


ϕ1(x), ϕ2(x),..., ϕp(x)] is usually
high dimension relative to input space and more likely to be separable


a difficult nonlinear optimisation problem has been converted to a linear


optimisation problem that can be solved by LMS algorithm


if a different RBF is centred on each training pattern, then the training


set can be learned perfectly

29
6.2 Example: XOR

use two hidden Gaussian functions

ϕ1(x) = exp(−
−||x − t1||2), t1 = [1,1]T

ϕ2(x) = exp(−
−||x − t2||2), t2 = [0,0]T
ϕ2(x)
pattern
1 • (1,1)

decision
patterns boundary
(0,1) (1,0)
• pattern
(0,0)

0
1 ϕ1(x)

x1 x2 ϕ1(x) ϕ2(x)
0 0 e-√√2 1
0 1 e-1 e-1
1 0 e-1 e-1
1 1 1 e-√√2

30
6.3 Ill-posed Hypersurface Reconstruction

Inverse problem of finding unknown mapping F from domain X and range Y is


well-posed if:
1. for every x ∈ X there exists y ∈ Y (existence)
2. for every pair of inputs x, t ∈ X, F(x) = F(t) iff x = t (uniqueness)
3. mapping is continuous (continuity) X Y

x F(x)

Learning is ill-posed because of sparsity of information & noise in training set

Regularisation Theory for solving ill-posed problems (Tikhonov) uses a


modified cost functional, that includes a complexity term:
  

(F) = s(F) +λ c(F)


  

 

where s(F) is the standard error term and c(F) is the regularising term
 

one regularised solution is given by a linear superposition of


multivariate Gaussian basis functions
• one regularised is given by a linear superposition of multivariate Gaussian
basis functions, with centres xi and widths σi

F(x) = ∑i wi exp( − ||x − xi||2 ) for i = 1 to N


σi2

practical ways of regularising:




reduce number of RBFs




change σ of the RBFs

31


choose position of centres

6.4 RBF Networks vs. MLP

• Single vs possibly multiple hidden layers

• common computation nodes vs. fundamentally different in hidden & o/p


layers

• all layers usually nonlinear vs. nonlinear hidden but linear output

• computation of inner product of i/p vector & weight vector vs. Euclidean
norm between i/p vector and centre of appropriate unit

• global approximation and therefore good at extrapolation vs. local


approximation with fast learning but poor extrapolation

6.5 Learning Strategies

variety of possibilities since a nonlinear optimisation strategy for hidden layer is


combined with linear optimisation strategy in output layer. For the hidden layer
the main choice involves how the centres are learned:

• Fixed Centres selected at random, e.g. choose Gaussian exp(−


− M d-2 ||x − ti||2)
where M = no. of centres and d = distance between them

• Self-organised selection centres, e.g. k-n-n or self-organising NN

• Supervised Selection of centres, e.g. error-correction learning 2with suitable


cost function using modified gradient descent

32
6.6 Example: curve fitting

−2)(2x+1)(1+x2)-1 from 15 noise-free examples


RBF for approximating (x−


15 Gaussian hidden units with same σ




Three designs are generated for σ = 0.5, σ = 1.0, σ = 1.5




−8, 12]
output shown for 200 inputs uniformly sampled in the range [−

x x σ = 1.0
x
x x
x
x
x σ = 0.5
x
x
x
x
x
x
σ = 1.5
x

best compromise is σ = 1.0

33

Vous aimerez peut-être aussi