Académique Documents
Professionnel Documents
Culture Documents
l.j
(x
p
) = activation output (for pattern P) for node j in layer l (Prior to nonlinearity),
P = training patterns and x
p
the pth training pattern.
To illustrate the back-propagation learning procedure, assume that node i of the (l + 1)th
layer receive signals from node j in the lth layer via the weights .
N
l
nodes in the lth layer, the output signal from node i of the (l + 1)th node, and the
kth input pattern to the network are expresses by
(k) = f
= f
where the threshold term has been included in the summation.
j
l
i
w
( )
=
+
l
N
j
l
i
l
j
l
ij
k O w
1
1
) ( u
1 + l
i
O
( )
+
=
1
1
, ) (
l
N
j
l
j
l
ij
k O w
1 + l
i
u
1
j
l
N
1
i
1 + l
N
l
ij
W
) ( /
1
k O
+
1
/
+
W
O/(k)
Learning procedure with back propagation at a node
l layer l+1 layer
If the sigmoid function f(x) = 1 / (1 + exp(-x)) is used, its derivative is
(x) = f (x) (1 f (x)).
The total error, E, for the network and for all patterns K is defined as the sum of
squared differences between the actual network output and the target (or desired)
output at the output layer L:
E = =
The goal is to evaluate a set of weights in all layers of the network that minimize E.
The learning rule is specified by setting the change in the weights proportional to the
negative derivative of the error with respect to weights:
=
K
k
k
E
1
= =
|
|
.
|
\
|
K
k
N
i
L
i i
L
k O k T
1 1
2
] ) ( ) ( [
2
1
l
nm
l
nm
Ek
e
e
c
c
~ A
To calculate the dependence of the error E
k
on the nmth weight of a neuron in the lth
layer, we use the chain rule:
Then
If we introduce the sigmoid function and its derivative into the latter relationship and
for l = L 1 ( i.e., weights of the outout layer), then
l
nm
L
i
L
i
l
nm
k O
k O
Ek Ek
e e c
c
c
c
=
c
c ) (
) (
=
c
c
=
c
c
L
N
i
l
nm
L
i L
J
l
nm
k O
k O k Ti
Ek
1
) (
)) ( ) ( (
e e
1
) 1 ( ) (
=
c
c
L
m
L
n
L
n
L
n n
l
nm
O O O O T
Ek
|
e
Thus, the procedure for adjusting weights of the output layer is
Where is a proportionality factor known as the learning rate.
However, if l L 1, then
still depends on , and the error dependency on
weights, again by applying the chain rule, is
Now, if l = L 2 (i.e., weights of neurons in the last hidden layer), then the latter is
expressed by
1
)] 1 ( ) [(
= A
L
m
L
n
L
n
L
n n
l
nm
O O O O T n e
1 L
m
O
l
nm
e
=
+
=
c
c
' =
c
c
L L
N
i
N
J
l
nm
L
j
L
ij
L
i
L
i i
l
nm
O
O f O T
Ek
1
1
1
1
1
1
) ( ) (
e
e
e
=
' '
=
c
c
l
i
L
m
L
n
L
in
L
i
L
i i
l
nm
O O f O f O T
Ek
1
2 1 1
) ( ) ( ) ( e
e
2
1
1 1
) ( ) ( ) (
=
'
'
=
L
m
N
i
L
in
L
i
L
i i
L
m
O O f O T O f e
Consequently, the procedure for adjusting the weights of the last hidden layer is
The latter is summarized as
Where, for the weights at the output layer,
And, for the weights of the hidden layer,
2
1
1 2
] ) ( ) ( ) ( [
=
'
'
= A
L
in
N
i
L
i
L
i
L
i i
L
n
L
nm
O O f O T O f
L
e n e
,
1
= A
l
j
l
i
l
ij
O no e
), 1 ( ) (
L
i
L
i
L
i i
L
i
O O O T = o
). 1 (
1
1 1 l
i
l
i
N
r
l
ri
l
r
l
i
O O
l
|
|
.
|
\
|
=
=
+ +
e o o
The process of computing the gradient and adjusting the weights is repeated until a
minimum error is found. In practice, one develops an algorithm termination criterion so
that the algorithm does not continue this interative process forever.
It is apparent that for nodes in layer l the computation of depends on the errors
computed at layer l+1: that is, the computating of the differences is computed
backwords.
Applications:
Before applying the algorithm, one need to
1.Decide on the function of the network to be performed (i.e., recognition,
prediction, or generalization).
2. Have a complete set of input and output training patterns.
3. Determine the number of layer in the network and the number of nodes per layer
4. Select the nonlinerity function (typically a sigmoid) and a value for the learning
rate.
5. Determine the algorithm termination criteria.
l
i
o
The learning algorithm can now be applied as follows:
1. Initialize all weights to small random values.
2. Choose a training pair (x(k), T(k)).
3. Calculate the actual output from each neuron in a layer starting with the input
layer and proceeding layer by layer toward the output layer L:
4. Compute the gradient and the difference for each input of the neuron in a layer
starting with the output layer and backtracking layer by layer toward the input.
5. Update the weights.
6. Repeat steps 2-5.
|
|
.
|
\
|
=
1
0
1
) (
l
N
m
l
m
l
jm
l
j
O f k O e
Application of hand-free telephone
We consider the case of a mobile telephone hung onto the vehicle control panel. The
motor creates a noise p that will superimpose itself, with a multiplier, onto the drivers
voice that constitutes the useful signal x.
The noise p is measured by a sensor located close to the engine. The microphone of the
telephone receives a part of parasitic, e.g., k p (k<1) that is added to useful signal x,
We send the engine noise to the input of the neuron and we wish to predict the noisy
signal y. The neuron will take out the part of the signal there correlated with the noise p,
either a near signal of. The error of prediction, the difference between the target (noisy
signal) and the output signal of the neuron will constitute the valued x useful signal.
%interference cancellation
clear all, close all
t = 0:0.0005:1;
x = sin(sin(20*t).*t*200);
% plotting of useful signal x
figure(1), plot(t, x)
axis([0 1 -1.2 1.2])
xlabel('time'), title('useful voice signal x'), grid
% parasitic signal
p = (sin(20*pi*t)/2)+(0.2*sin(100*pi*t-100));
figure(2)
plot(t, p)
axis([0 1 -1.2 1.2])
xlabel('time'), title('the engine noise p'), grid
% noisy signal
figure(3)
z = x + 0.833*p;
plot(t, x)
xlabel('time'), title('noisy signal z = x + 0.833 p'), grid
% random initialization of the weights w and bias b
w = randn(1, 1); b = randn(1, 1);
weights_w = w; bias_b = b;
eta = 0.1;
% adaptation gain
for i=1:length(t)
% the neuron output
y(i) = w * p(i) + b;
% error output
e(i) = z(i) - y(i);
[dw,db] = learnwh(p(i),e(i),eta);
%Updating the weights w and bias b matrices
w = w + dw; b = b + db;
% saving the weights and bias matrices
weights_w = [weights_w, w];
bias_b = [bias_b, b];
end
% extracted signal
figure(4), plot(t, e), axis([0, 1 -1.2 1.2])
xlabel('time'), title('extracted signal'), grid
% extraction error
figure(5), plot(t, x-e), xlabel('time'),
title('extraction error'), grid
axis([0 1 -1 1])