Vous êtes sur la page 1sur 6

Improving the rate of convergence of the Backpropagation Algorithm for

Neural Networks using Boosting with Momentum

Nikhil Ratna Shakya


School of Engineering and Sciences,
Jacobs University,
Bremen. Germany
November 2014

Abstract
This paper discusses the problem of slow convergence of the gradient descent backpropagation algorithm and proposes using a combination of Nesterovs Accelerated Gradient
Descent algorithm (Nesterov 1983) and parallel coordinate descent in order to alleviate this
problem. This method of accelerated gradient, called BOOM, for boosting with momentum
was developed under Google Research that has been applied to large- scale data sets at
Google.
Background to Neural Network and Backpropagation algorithm
Divide et impera, Latin for Divide and conquer is a strategy that has been seen to be
very powerful in the field of politics and economics. Networks attempt to apply the same
lemma in the field of computations by decomposing a complex problem to simpler parts for
an efficient way of problem solving. One such model is the Artificial Neural Network model
inspired by the functioning of neurons in the animal nervous system. The neurons consist of
dendrites that carry impulses from various sources towards a cell body. If a certain threshold
is met, the neuron fires and the impulses are carried away to other neurons or eventually
the spinal cord or the brain.
Neural networks are in layers that are made up of interconnected nodes. The input layer
receives a pattern communicates to one or more hidden layers where the actual processing
is done. Every connection has a weight associated with it. After processing, the result is
then passed to the output layer.

Figure 1: A simple three-layer neural network


1

The backpropagation algorithm is a classic technique used as a learning method for


feedforward multilayer neural networks. The input is first fed forward and then compared
with the desired output and tries to adjust the weights going backwards layer by layer
performing a gradient descent iteratively towards the global minimum of the vector space
of the solution i.e. adjust the weights such that the distance between the generated solution
and the desired solution is the least. The algorithm is largely successful however the slow
convergence of this method is still debilitating the efficiency of this method.

Matlab implementation for an XOR gate

1
2
3
4
5
6
7
8
9
10
11
12
13
14

clc
% XOR input for x1 and x2
input = [0 0; 0 1; 1 0; 1 1];
% Desired output of XOR
output = [0;1;1;0];
% Bias initialization
bias = [1 1 1];
% Learning coefficient
coeff = 0.8;
% Number of learning iterations
iterations = 5000;
% Calculate weights randomly
weights = 2.*rand(3,3) - 1
err= zeros(iterations,1);

15
16
17
18
19
20
21

for i = 1:iterations
out = zeros(4,1);
numIn = length (input(:,1));
for j = 1:numIn
% Hidden layer
H1 = bias(1,1)*weights(1,1)+ input(j,1)*weights(1,2) + input(j,2)*weights
(1,3);

22
23
24
25
26

27

% Send data through sigmoid function 1/1+e-x


% sigma(x) = 1/(1+exp(-x)) defined on a seperate file;
x2(1) = sigma(H1);
H2 = bias(1,2)*weights(2,1)+ input(j,1)*weights(2,2)+ input(j,2)*weights
(2,3);
x2(2) = sigma(H2);

28
29
30
31

% Output layer
x3 1 = bias(1,3)*weights(3,1)+ x2(1)*weights(3,2)+ x2(2)*weights(3,3);
out(j) = sigma(x3 1);

32
33
34
35
36
37

% Adjust delta values of weights


% For output layer:
% delta(wi) = xi*delta,
% delta = (1-actual output)*(desired output - actual output)
delta3 1 = out(j)*(1-out(j))*(output(j)-out(j));

38
39
40
41

% Propagate the delta backwards into hidden layers


delta2 1 = x2(1)*(1-x2(1))*weights(3,2)*delta3 1;
delta2 2 = x2(2)*(1-x2(2))*weights(3,3)*delta3 1;

42
43
44

% Make adjustments to weights


% And use the new weights to repeat process.

45
46
47
48
49
50
51
52
53
54
55
56
57

for k = 1:3
if k == 1 % for
weights(1,k)
weights(2,k)
weights(3,k)
else
weights(1,k)
weights(2,k)
weights(3,k)
end
end
end

Bias
= weights(1,k) + coeff*bias(1,1)*delta2 1;
= weights(2,k) + coeff*bias(1,2)*delta2 2;
= weights(3,k) + coeff*bias(1,3)*delta3 1;
= weights(1,k) + coeff*input(j,1)*delta2 1;
= weights(2,k) + coeff*input(j,2)*delta2 2;
= weights(3,k) + coeff*x2(k-1)*delta3 1;

58

59
60

error = out - output;


err(i) = (sum(error.2))0.5;

61
62
63
64
65
66
67
68

end
weights
out
figure(1);
plot(err)

Matlab results
Initial weights =
0.8251 0.2134 -0.4771
0.2768 -0.7028 0.5780
-0.1327 -0.3299 0.4744
Final weights =
2.8090 -6.8838 -7.5744
9.1216 -6.7884 -5.5076
-4.0778 -9.1860 8.7378
Final Output =
0.0179
0.9873
0.9767
0.0235

Figure 2: Demosntration of Backpropagation

Figure 3: A simple three-layer neural network


Recognizing this problem, there has beein research done in order to develop algorithms
to accelerate the learning speed:
Adaptive step size: For the standard backpropagation, a learning rate is determined
at the start and the same is used throughout the process. However there have been
algorithms that compute the best step-size for each iteration such as theone proposed
by Barzilai and Borwein, 1998 in which has shown to be able to perform better for an
extra cost of storing an additionational iteration and gradient.
Adding a downward estimation: At the narrow valleys, the convergence tends to be
slow due to oscillations. In order to overcome this Ihm and Park proposed in 1999 to
add an additional estimated downward direction at the valleys.
Using second order : Several methods that exploit the second derivative such as QuasiNewton (like the Newton method but uses a quadratic approximation) and LevenbergMarquadt (interpolates between the Gauss-Newton and gradient descent but is more
robust that the Gauss Newton as it tends to approach the minimum even if it starts
more far off) have been observed to be very fast. However, therse methods are quite
expensive when applied to large scale networks. There has been ongoing research to
reduce the cost (LeCun 1998, Schraudolph 2002).
In this paper

Vous aimerez peut-être aussi