LMS Algorithm Optimization

3.
Least Mean Square (LMS) Algorithm
3.1 Spatial Filtering
uses single linear neuron and can be understood as adaptive filtering
y = ∑k wkxk for k = 1 to p
error e = d − y where d = desired value
1 2
cost function = mean squared error = J = e
2
-1 w0 = θ
w1 output y
x1 ∑
.....
wp
xn
J
3.2 Steepest descent

∂J//∂wk = 0 to determine optimum weight
adjust weights iteratively and move gradient

= ∂J//∂w
along the error surface towards the
Jmin
optimum value
wk(n+1) = wk(n) − η (∂∂J(n)//∂wk) w0 single

weight
i.e. updated value is proportional to
negative of the gradient of the error surface
∴ wk(n+1) = wk(n) + η e(n) xk(n)
18
Properties of LMS:
• a stochastic gradient algorithm in that the gradient vector is ‘random’ in

contrast to steepest descent
• on average improves in accuracy for increasing values of n.
• reduces storage requirement to information present in its current set of

weights, and can operate in a nonstationary environment.
3.2.1 Convergence ( proof not given)
• in the mean if weight vector → optimum value as n → ∞, requires:
0 < η < 2//λmax λmax is max eigenvalue of autocorrelation matrix Rx

Rx = E[x xT]
• in the mean square if mean-square of error signal → constant as n → ∞,
requires:
0 < η < 2//tr[Rx] where tr[Rx] = ∑k λk ≥ λmax
• Faster convergence is usually obtained by making η a function of n,

for example η(n) = c//n for some constant c
19
4. Multilayer Feedforward Perceptron Training
4.1 Back-propagation Algorithm
• .....
.....
.....
neuron j
dj
yi
wij uj
ej
from previous ϕ(••) -1
layer
Let wji be weight connected from neuron i to neuron j
error signal: ej(n) = dj(n) − yj(n)
net internal sum: υj(n) = ∑iwji(n)yi(n) for i = 0 to p
output: yj(n) = ϕj(υ

υj(n))
1
Instantaneous sum of squared errors: (n) = ∑ e2(n) over all j in o/p layer

2 j

20
1
For N patterns, average squared error: = ∑ (n) for n = 1 to N

av
N n

• Learning goal is to minimise av by adjusting weights, but instead of the

av
estimate (n) is used on a pattern-by-pattern basis

From the ∂ (n) = ∂ (n) ∂ej(n) ∂yj(n) ∂υj(n)

chain rule: ∂wji(n) ∂ej(n) ∂yj(n) ∂υj(n) ∂wji(n)
∴ weight correction: ∆ wji(n) = − η ∂ (n) i.e. steepest descent

∂ wji(n)
= η δj (n) yi(n) where δj (n) = - ∂ (n) / ∂υj(n)

Case 1: Output node, local gradient easily calculated
Case 2: hidden node more complex, need to consider neuron j feeding neuron k,
where inmputs to neuron j are yi
δj(n) = − ∂ (n) ϕj′(υ

υj(n)) = −∑k ek ∂ek(n) ϕj′(υ
υj(n))

∂yj(n) ∂yj(n)
∴ δj(n) = − ϕj′(υ
υj(n)) ∑k ek(n) ∂ek(n) ∂υk(n) = − ϕj′(υ
υj(n)) ∑k δk(n) wkj(n)
∂υk(n) ∂yj(n)
• Thus δj(n) is computed in terms of δk(n) which is closer to the output. After
calculating the network output in a forward pass, the error is computed and
recursively back-propagated through the network in a backward pass.
×(local gradient)×
weight correction = (learning rate)× ×(i/p signal neuron)
∆ wji(n) = η δj(n) yi(n)
21
4.2 Back-propagation training
Activation function:
yj(n) = ϕj(uj(n)) = 1
−uj(n))
1 + exp(−
∂yj(n) = ϕj′(uj(n)) = exp(−−uj(n)) = yj(n) [ 1 − yj(n)]

∂uj(n) −uj(n))]
[1 + exp(− 2
Note that max value of ϕj′(υ

υj(n)) occurs at yj(n) = 0.5 and
min value of 0 occurs at yj(n) = 0 or 1
Momentum term: + α ∆ wji(n − 1) 0 ≤ |α

α| < 1
helps locate more desirable local minimum in complex error surface
example error surface
no change in error sign ⇒ ∆ wji(n)

increases and descent is accelerated

changes in error sign ⇒ ∆ wji(n) decreases

and stabilises oscillations

large enough α can stop process

terminating in shallow local minima
single weight

with momentum, η can be larger
22
4.3 Other perspectives for improving generalisation
4.3.1 Pattern vs Batch Mode

Choice depends on particular problem:
• randomly updating weights after each pattern requires very little storage and
leads to a stochastic search which is less likely to get stuck in local minima
• updating after presentation of all training samples (an epoch) provides a more
accurate estimate of the gradient vector since it is based on the average
squared error

av
4.3.2 Stopping criteria

e.g. gradient vector threshold and/or change in average squared error per epoch
4.3.3 Initialisation
• default is uniform distribution inside a small range of values

• too large values can lead to premature saturation (neuron outputs close to
limits) which gives small weight adjustments even though error is large
4.3.4 Training Set Size

worst-case formula N > W//ε where:
N = no. of examples, W = no. of synaptic weights,
ε = fraction of errors permitted on test
4.3.5 Cross-Validation
• measures generalisation on test set
• various parameters including no. of hidden nodes, learning rate and
training set size can be set based on cross-validation performance
23
4.3.6 Network Pruning by complexity regularisation
(two possibilities: network growing and network pruning)

goal is to find weight vector that minimises R(w) = s(w) +λ c(w)

where s(w) is standared error measure e.g. mean square error

λ is the regularisation parameter

c(w) is the complexity penalty that depends on the network e.g. ||w||2

• regularisation term allows identification of weights having insignificant effect
4.3.7 Other ways of minimising cost function
• Back-propagation uses a relatively simple, quick approach to minimising cost

function by obtaining an instantaneous estimate of the gradient
• methods and techniques from nonlinear optimum filtering and nonlinear

function optimisation have been used to provide more sophisticated approach
to minimising the cost function e.g. Kalman filtering, conjugate-gradient
method
4.4 Universal Approximation Theorem

single hidden layer with suitable ϕ gets arbitrarily close to any continuous
function
• logistic function satisfies ϕ(⋅⋅) definition
• single hidden layer sufficient, but no clue on synthesis
• single hidden layer is restrictive in that hierarchical features not supported
24
4.5 Example of learning XOR Problem
Decision Boundaries
x1 x2 target x2
0 0 0 neuron a
0 1 1 1 out
1 0 1 =1
1 1 0
out
=0
0
x1
-1 1
1.5 x2
1 neuron b
x•
a -2 out
1 1 1
1 c
1 out
0.5
• 1 b -1 =1
x2
0.5 out
-1 =0
0
1 x1
x1 x2 a b target x2
0 0 0 0 0 neuron c
0 1 0 1 1 1
out
1 0 0 1 1 =0
1 1 1 1 0 out
=1
out
=0
0 1 x1
25
4.6 Example: vehicle navigation
sharp sharp
left right
....................... 45 output units
fully connected
............. 9 hidden units

fully connected
video input retina
network computes steering angle

training examples from human driver

obstacles detected by laser range finder
26
5. Associative Memories
5.1 linear associative memory
stimulus ak response bk
bk1 ak = ak1
bk = ak1 • w11 1 bk1
bk2 ak2 w12
. . w13
. .
bkp akp ak2 • 2 bk2
.....
.....
 w11 ( k ) w12 ( k ) ... w1 p ( k )
 w ( k ) w ( k ) ... w ( k ) akp • p bkp
W(k) =  
22 22 2p
 

 
 w p1 ( k ) w p 2 ( k ) ... w pp ( k )
response bk = W(k) ak
Design of weight matrix for storing q pattern associations ak bk
estimate of weight matrix = ∑k bk akT for k = 1 to q

(Hebbian learning principle)
bk1 [ak1, ak2, ...,akp]

where bk akT is the outer product = bk2
.
.
bkp
27
Pattern recall:
For recall of a stimulus pattern aj: b = W aj = ∑k (akTaj) bk
assuming that key patterns have been normalised, akTaj = 1
b = bj + vj where vj = ∑k (akTaj)bk for k = 1 to q, k ≠ j
i.e. vj results from interference from all other stimulus patterns
∴ (akTaj) = 0 for j ≠k → perfect recall (orthonormal patterns)
Main features:

distributed memory

auto- and hetero-associative

content addressable and resistant to noise and damage

interaction between stored patterns may lead to error on recall

The max. no. of patterns reliably stored is p, the dimension of input space
which is also the rank (no. of independent columns or rows) of W

For an auto-associative memory ideally W ak = ak showing that

stimulus patterns are eigenvectors of W with all unity eigenvalues
Example: a1 = [1 0 0 0]T, a2 = [0 1 0 0]T, a3 = [0 0 1 0]T

b1 = [5 1 0]T, b2 = [-2 1 6]T, b3 = [-2 4 3]T
memory weight matrix = 5 -2 -2 0 giving perfect recall

1 1 4 0 since stimulus patterns
0 6 3 0 are orthonormal
noisy stimulus e.g. [0.8 -0.15 0.15 -0.2]T gives [4 1.25 -0.45]T

which is closer to b1 than b2 or b3
28
6. Radial Basis Functions
6.1 Separability of patterns

Separability theorem (Cover) states that if mapping ϕ(x) is nonlinear and hidden-
unit space is high relative to input space then it is more likely to be non-separable
1
x1 • ϕ1 w0
w1
x2 • ϕ2 w21
.....
.....
.....
wp
xp • ϕp
Example of RBF is a Gaussian
ϕ(x) = exp(−
−||x − t||2)
t = centre of Gaussian

output neuron is linear weighted sum

ϕ(x) is nonlinear and hidden-unit space [ϕ

ϕ1(x), ϕ2(x),..., ϕp(x)] is usually
high dimension relative to input space and more likely to be separable

a difficult nonlinear optimisation problem has been converted to a linear

optimisation problem that can be solved by LMS algorithm

if a different RBF is centred on each training pattern, then the training

set can be learned perfectly
29
6.2 Example: XOR
use two hidden Gaussian functions
ϕ1(x) = exp(−
−||x − t1||2), t1 = [1,1]T
ϕ2(x) = exp(−
−||x − t2||2), t2 = [0,0]T
ϕ2(x)
pattern
1 • (1,1)
decision
patterns boundary
(0,1) (1,0)
• pattern
(0,0)
•
0
1 ϕ1(x)
x1 x2 ϕ1(x) ϕ2(x)
0 0 e-√√2 1
0 1 e-1 e-1
1 0 e-1 e-1
1 1 1 e-√√2
30
6.3 Ill-posed Hypersurface Reconstruction
Inverse problem of finding unknown mapping F from domain X and range Y is

well-posed if:
1. for every x ∈ X there exists y ∈ Y (existence)
2. for every pair of inputs x, t ∈ X, F(x) = F(t) iff x = t (uniqueness)
3. mapping is continuous (continuity) X Y
x F(x)
Learning is ill-posed because of sparsity of information & noise in training set
Regularisation Theory for solving ill-posed problems (Tikhonov) uses a

modified cost functional, that includes a complexity term:

(F) = s(F) +λ c(F)

where s(F) is the standard error term and c(F) is the regularising term

one regularised solution is given by a linear superposition of

multivariate Gaussian basis functions
• one regularised is given by a linear superposition of multivariate Gaussian
basis functions, with centres xi and widths σi
F(x) = ∑i wi exp( − ||x − xi||2 ) for i = 1 to N

σi2
2σ
practical ways of regularising:

reduce number of RBFs

change σ of the RBFs
31

choose position of centres
6.4 RBF Networks vs. MLP
• Single vs possibly multiple hidden layers
• common computation nodes vs. fundamentally different in hidden & o/p

layers
• all layers usually nonlinear vs. nonlinear hidden but linear output
• computation of inner product of i/p vector & weight vector vs. Euclidean
norm between i/p vector and centre of appropriate unit
• global approximation and therefore good at extrapolation vs. local

approximation with fast learning but poor extrapolation
6.5 Learning Strategies
variety of possibilities since a nonlinear optimisation strategy for hidden layer is

combined with linear optimisation strategy in output layer. For the hidden layer
the main choice involves how the centres are learned:
• Fixed Centres selected at random, e.g. choose Gaussian exp(−

− M d-2 ||x − ti||2)
where M = no. of centres and d = distance between them
• Self-organised selection centres, e.g. k-n-n or self-organising NN
• Supervised Selection of centres, e.g. error-correction learning 2with suitable

cost function using modified gradient descent
32
6.6 Example: curve fitting
−2)(2x+1)(1+x2)-1 from 15 noise-free examples

RBF for approximating (x−

15 Gaussian hidden units with same σ

Three designs are generated for σ = 0.5, σ = 1.0, σ = 1.5

−8, 12]
output shown for 200 inputs uniformly sampled in the range [−
x x σ = 1.0
x
x x
x
x
x σ = 0.5
x
x
x
x
x
x
σ = 1.5
x
best compromise is σ = 1.0
33

LMS Algorithm Optimization

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

LMS Algorithm Optimization

Transféré par

Droits d'auteur :

Formats disponibles

3.

Least Mean Square (LMS) Algorithm

3.1 Spatial Filtering

uses single linear neuron and can be understood as adaptive filtering

error e = d − y where d = desired value

3.2 Steepest descent

adjust weights iteratively and move gradient

wk(n+1) = wk(n) − η (∂∂J(n)//∂wk) w0 single

• a stochastic gradient algorithm in that the gradient vector is ‘random’ in

• on average improves in accuracy for increasing values of n.

• reduces storage requirement to information present in its current set of

3.2.1 Convergence ( proof not given)

• in the mean if weight vector → optimum value as n → ∞, requires:

0 < η < 2//λmax λmax is max eigenvalue of autocorrelation matrix Rx

0 < η < 2//tr[Rx] where tr[Rx] = ∑k λk ≥ λmax

• Faster convergence is usually obtained by making η a function of n,

4.1 Back-propagation Algorithm

Let wji be weight connected from neuron i to neuron j

error signal: ej(n) = dj(n) − yj(n)

net internal sum: υj(n) = ∑iwji(n)yi(n) for i = 0 to p

output: yj(n) = ϕj(υ

• Learning goal is to minimise av by adjusting weights, but instead of the

estimate (n) is used on a pattern-by-pattern basis

From the ∂ (n) = ∂ (n) ∂ej(n) ∂yj(n) ∂υj(n)

chain rule: ∂wji(n) ∂ej(n) ∂yj(n) ∂υj(n) ∂wji(n)

∴ weight correction: ∆ wji(n) = − η ∂ (n) i.e. steepest descent

= η δj (n) yi(n) where δj (n) = - ∂ (n) / ∂υj(n)

Case 1: Output node, local gradient easily calculated

δj(n) = − ∂ (n) ϕj′(υ

∂yj(n) = ϕj′(uj(n)) = exp(−−uj(n)) = yj(n) [ 1 − yj(n)]

Note that max value of ϕj′(υ

Momentum term: + α ∆ wji(n − 1) 0 ≤ |α

no change in error sign ⇒ ∆ wji(n)

changes in error sign ⇒ ∆ wji(n) decreases

large enough α can stop process

with momentum, η can be larger

4.3.1 Pattern vs Batch Mode

4.3.2 Stopping criteria

• default is uniform distribution inside a small range of values

4.3.4 Training Set Size

(two possibilities: network growing and network pruning)

where s(w) is standared error measure e.g. mean square error

λ is the regularisation parameter

• regularisation term allows identification of weights having insignificant effect

4.3.7 Other ways of minimising cost function

• Back-propagation uses a relatively simple, quick approach to minimising cost

• methods and techniques from nonlinear optimum filtering and nonlinear

4.4 Universal Approximation Theorem

....................... 45 output units

............. 9 hidden units

video input retina

network computes steering angle

training examples from human driver

obstacles detected by laser range finder

5.1 linear associative memory

Design of weight matrix for storing q pattern associations ak bk

estimate of weight matrix = ∑k bk akT for k = 1 to q

bk1 [ak1, ak2, ...,akp]

auto- and hetero-associative

content addressable and resistant to noise and damage

interaction between stored patterns may lead to error on recall

For an auto-associative memory ideally W ak = ak showing that