Vous êtes sur la page 1sur 4

Adaptive Learning Rate and Limited Error

Signal for Multilayer Perceptrons with n-th


Order Cross-Ent ropy Error
Sang-Hoon Oh, Soo-Young Lee*, Sungmoon Shin, and Hun Lee
Mobile Protocol and Signaling Section,
Electronics and Telecommunications Research Institute, Yusong P.O. Box 106, Taejon, Korea
Department of Electrical Engineering, KAIST, 373-1 Kusong-dong, Yusong-gu,Taejon, Korea*

Abst~actAlthough an n-th order cross-entropy(nCE) error function resolves the incorrect saturation problem of conventional error backpropagation(EBP) algorithm, performance of multilayer perceptrons(MLPs) using the nCE
function depends; heavily on the order of nCE. In this paper, we propose an adaptive learning rate to make the
MLP performance insensitive t o the order of nCE. .Additionally, we propose to limit error signal values at output
nodes for stable llearning with the adaptive learning rate.
The effectiveness of the proposed method is demonstrated
in a handwritten digit recognition task.

I. INTRODUCTION
Multilayer perceptrons (MLPs) use the error backpropagation (EBP) algorithm for training[l]. Training
is usually done by iterative updating of weights according to the error :signal, which is the negative gradient of
the mean-squared error(MSE) function. In the output
layer, the error signal is the difference between desired
and actual output values of M L P multiplied by the slope
of sigmoid activation function. Then, the error signal is
back-propagated1 to the hidden layers.
In pattern recognition applications, the desired output
value of MLP is one of the two extreme values of the
sigmoid function. If the weighted sum to any output
node is near the wrong extreme value, we say the node
is "incorrectly sizturated[2] ."
When an output node is incorrectly saturated, the
amount of weight change is small due to the small gradient of the sigmoid activation function and the error
remains nearly unchanged[2], [ 3 ] , [4],[5]. This incorrect
saturation probliem is a major reason for slow learning
speed of EBP algorithm.
In order to resolve this problem, an n-th order crossentropy (nCE) error function was proposed so that the
error signal is strong for incorrectly saturated output
node and weak for correctly saturated output node/:6]. In
addition to resching the incorrect saturation problem,
the nCE function prevented overspecialization of MLPs
for training patterns. Generalization performance of the
trained MLPs, however, depends heavily on the order

0-7803-4859-1198 $10.0001998 IEEE

of nCE function and we should find an optimum order


of nCE function to attain good training result of MLPs
with fast learning speed.
This paper proposes an adaptive learning rate to make
the performance of MLPs insensitive to the order of nCE
error. Additionally, we propose to limit error signal. values of output node to prevent unstable characteristic of
learning due to the ada.ptive learning rate. In section 2,
we briefly review the n@E error function for EBP dgorithm. In section 3 , we describe the adaptive learning
rate and limited error signal. In section 4,we demonstrate the effectiveness of the proposed method in the
handwritten digit recognition task using the CEDAR
database[7], and section 5 concludes this paper.

11.

~ T ORDER
H
CROSS-ENTROPY
ERROR

Consider an MLP consisting of L layers of which the


Zth layer has Nl nodes. Let the state vector of nodes in
layer 1 be x(') = [x(,'),xt', . . ., xN,],
( 1 ) and x(O)and )
'
(
x be
the input and output vectors, respectively. Here, ey)(Z #
0) has value between .-1 and 1. Also, let the desired
output vector corresponding to a training pattern x be
t = [tl,t2,. . .,tp~,].When x is presented to the network,
the state xj') in the Ith layer is

( 1 ) denotes the weight connecting x$'-l) to xi( 9


Here, wid
and wjo
('1 denotes the bias to xi').

The conventional MSE function[l] is


N,
j=1

To minimize E m ( x ) ,each weight is updated by


32

vpzy-l).

2357

Authorized licensed use limited to: KOLEJ UNIVERSITI TEKNOLOGI TUN HUSSEIN ONN. Downloaded on July 30, 2009 at 23:29 from IEEE Xplore. Restrictions apply.

(3)

layer is

ap

-1

Fig. 1. The error signal of output node with t j = 1. zfi is the


jth output value and 6:L) is the error signal of zy.

The nCE error function with n = 1 corresponds to the


CE method. As shown in Fig. 1, the nCE error signal
with n 2 2 can satisfy the above criterion, which requests a strong error signal for an incorrectly saturated
output node and a weak error signal for a correctly saturated output node.
If we increase n from 2 to higher value, the error signal will more effectively reduce the incorrect saturation
of output nodes and prevent the overspecialization for
training patterns[6]. However, with increasing n, the
very weak error signal for the output node near the desired value will delay learning. After all, the classification ratio for patterns and learning speed vary seriously on the order of nCE error. Therefore, we should
determine an optimum order of nCE function for good
training result of MLPs with fast learning speed.

Here,

111. ADAPTIVE
LEARNINGRATEWITH LIMITED
ERRORSIGNAL
To resolve the fore-mentioned problems, we propose
an adaptive learning rate at each epoch s as
where 1 5 15 L - 1

(4)
is the error signal and q is the learning rate.
In the above EBP algorithm, the output node
which has extreme value opposite to t j can not make
a strong error signal for adjusting the weights significantly[2], [5], as shown in Fig. 1. This incorrect saturation retards the search for a minimum in the error
surface.
To resolve the incorrect saturation problem, a strong
error signal is necessary for the incorrectly saturated output nodes, like the cross-entropy(CE) method[5]. For
correctly saturated output nodes, a weak error signal
needs to be generated so that the weight update associated with a training pattern scarcely perturbs the
weights trained for all training patterns. The weak error signal is also necessary to prevent overspecialization
of learning for training patterns, like the classification
figure of merit method[8].
In this sense, Oh proposed an n-th order crossentropy(nCE) error function[6]

where

t j

= 3 1 and n = 1 , 2 , ...

Using the above error function, the error signal of output

Here, E { ( t j ( s )- Z ~ ( S ) ) ~ and
}
E { c $ ~ ~ ( are
s ) }the expected values considering all output nodes in the sth
epoch. Then, the expected intensity of q ( s ) b f ) ( s )is

Thus, we can use the functional characteristic of Sj(L)


for updating weights while the expected intensity of
q(s)bjL(s)has the same value as that in the CE method.
In simulations, we calculate q(s) using E { ( t j ( s ) Z:~)(S))~}
and E{b~L2(s)}estimated at the (s - 1)th
epoch since we can not derive them at beginning of sth
epoch.
The adaptive learning rate will alleviate the dependency of training results on the order of nCE function.
When x r ) approaches t k according to progress of learning, however, q(s) takes very large value and it results
in drastic change of weights. As a result, the adaptive learning rate may make an unstable characteristic
of learning.
In order to suppress drastic change of weights due to
the large learning rate, we propose to limit error signal

2358

Authorized licensed use limited to: KOLEJ UNIVERSITI TEKNOLOGI TUN HUSSEIN ONN. Downloaded on July 30, 2009 at 23:29 from IEEE Xplore. Restrictions apply.

values of output node as


A

E
g
m

ol
c

.e

e
.-

(9)

0.09
0.08
0.07

0.06
0.05

Here,
sgn(z) =

+1,
-1,

if 3: 2 0
otherwise.

Using the limited error signal, we can prevent the


unstable learning since the maximum value of weight
change in output layer is

0.04

-f
%
f

0.03
0.02
0.01
n

-0

50 100 150 200 250 300 350 400 450 500


Epoch Number

Fig. 2. The simulation results for training patterns.


0.lITQ:l;

I.

= 3 ~ ~ l ~ ~ - 1 ) l d E [ ( tk (3:r(s))2].
S)
0.09

In hidden layer, the backpropagated error signal also


prevents drastic change of weights.
If 6f)(s)
is Gaussian with zero mean, (T R

0.08
0.07

d-

and 99.7 % of 6r)(s)will be in &3g. Thus,


the limited error signal (9) will change only a small protion of Sr). Although b r ) ( s ) is not Gaussian in real
problems, we will show in the simulation section that
E [ c ~ ~ ~ M( s 0) ]and major portion of Sp) is within

0.06
0.05

f3 J
JP,
T
(41.

50 100 150 200 250 300 350 400 450 500


Epoch Number

Fig. 3. The simulation results for test patterns.

IV. SIMULATION

A handwritten digit recognition problem is used to


verify the effectiveness of the adaptive learning rate and
the limited error signal. A total of 18,468 handwritten digitized images from the CEDAR database[7] are
used for training after size normalization. A digit image
consists of 12x 12 pixels and each pixel takes on integer
values from 0 to 15. The MLP consists of 144 inputs, 30
hidden nodes, and 10 output nodes. The initial weights
are drawn at random from a uniform distribution on
[-1 x lov4, 1 x lo-*]. Nine simulations are conducted
using each order of nCE, and the results are averaged to
draw figures.
Firstly we train MLPs using the nCE error and draw
the misclassification ratio for traning patterns in Fig. 2.
Since no fair comparison is possible if the learning rate
is kept the same for all order of nCE[5], we derive the
learning rates q = 0.001 x ( n 1) so that E{qSi( L )} has
the same value in each method. Here, we assume that
has uniform distribution on [-1, +1].
As shown in Fig. 2, the misclassification ratio curves
with n = 2 and 3 decrease more rapidly than one with
n = 1. However, those with n 2 4 decrease more slowly
when n increases. Fig. 3 shows the misclassification

ratio for untrained 2,213 test patterns. The misclassification ratio for the test patterns with n = 1 shows poor
generalization since the CE method makes the MLP specialized too much for training patterns[6]. With increasing n until 5, we can get more improved results for the
test patterns since the weak error signal near the desired
value prevents overspecialization for training patterns.
With n >_ 6, however, very weak error signal near the
desired value retards learning and the curve decreases
very slowly. From these results, we can take n = 3 or
4 as an optimum order of nCE in viewpoints of training
speed and generalization performance.
To remove the performance variation on the order of
nCE, we adopt the proposed method and draw the simulation results in Fig. 4 and 5. Comparing Fig. 4 with
Fig. 2 which corresponds to the misclassification ratio
for the training patterns, we can find that the proposed
method successfully decreases the learning speed dependency on the order of nCE. Fig. 5 shows the simulation
results for the test patterns. The curve with n = 1 shows
poor generalization performance since this curve corresponds to the CE method with fixed learning rate. With
n 2 2, the curves show better classification ratio for the

2359

Authorized licensed use limited to: KOLEJ UNIVERSITI TEKNOLOGI TUN HUSSEIN ONN. Downloaded on July 30, 2009 at 23:29 from IEEE Xplore. Restrictions apply.

I ,

0.8

0.07

p.df - - Mean o
ThresholdingValue

-B

0.06
0.05

0.04
0.03

0.02

------_________

0.01

0
50 100 150 200 250 300 350 400 450 500
Epoch Number

Misclassification ratio for training patterns with the adaptive learning rate and the limited error signal.

0.09,

0
0

Fig. 6 . The probability density function and limitation of h i L ) at


450th epoch.

V . CONCLUSION

3 ( 1 1 1
50 100 150 200 250 300 350 400 450 500
Epoch Number

Fig. 5. Misclassification ratio for test patterns with the adaptive


learning rate and the limited error signal.

This paper proposed an adaptive learning rate to


make the classification ratio and the learning speed insensitive t o the order of nCE error function. Additionally, we proposed a limited error signal of output node
to prevent unstable characteristic of learning due to the
adaptive learning rate.
We have demonstrated the effectiveness of the proposed method through the simulation of classifying
handwritten digits in the CEDAR database[7]. In the
simulation, we showed that the proposed method reduced the performance variation on the the order of nCE
error, while it maintained the effect of nCE on preventing overspecialization of MLP for training patterns.

REFERENCES
[l] D. E. Rumelhart and J. L. McClelland, Parallel Distributed

test patterns than that with the CE error. Comparing


Fig. 5 with Fig. 3, we can say that the proposed method
reduces the variation of generalization performance on
the order of nCE error. Thus we can argue that the
proposed method alleviates the performance variation
on the order of nCE, while it maintains the effect of
nCE on preventing overspecialization of MLP for training patterns. Naturally, the proposed method maintains
the effect of nCE on reducing incorrect saturation of
output nodes.

Finally, we estimate the probability density function


and the mean of 6r(s), and the thresholding value in
Eq. (9). Fig. 6 shows the estimated results at the
450th epoch. As shown in this figure, 6f(s) is mainly in

[-346[6p32(s)],
346[6f(s)]]
with nearly zero mean.
Thus, the limited error signal in Eq. (9) makes a little
effect on the learning of MLPs with nCE error function.

Processing. MIT Press, Cambridge, MA, 1986.


[2] Y. Lee, S.-H. Oh, and M. W . Kim, An analysis of premature
saturation in back-propagation learning, Neural Networks,
V O ~ .6, pp. 719-728, 1993.

[3] J. R. Chen and P. Mars, Stepsize variation methods for accelerating the backpropagation algorithm, PTOC.I J C N N Jan.
15-19, 1990, Washington, D C , U S A , vol. I, pp. 601-604,1990.
[4] A. Rezgui and N. Tepedelenlioglu, The effect of the slope of
the activation function on the back propagation algorithm,
PTOC.I J C N N Jan. 15-19, 1990, Washington, D C , USA, vol. I,
pp. 707-710, 1990.
[5] A. van Ooyen and B. Nienhuis, Improving the convergence of
the back-propagation algorithm, Neural Networks, vol. 5, pp.
465-471, 1992.
[6] S.-H. Oh, Improving the error backpropagation algorithm
with a modified error function, IEEE Trans. Nevral Networks, vol. 8, no. 3, pp. 799-803, 1997.
[7] J. J. Hull, A database for handwritten text recognition research, I E E E Trans. Pat. Ana. Mach. Int., vol. 16, no. 5, pp.
550-554, May 1994.
[8] J. B. Hampshire II and A . H. Waibel, A novel objective function for improved phoneme recognition using time-delay neural
networks, I E E E Trans. Neural Networks, vol. 1, pp. 216-228,
June 1990.

2360

Authorized licensed use limited to: KOLEJ UNIVERSITI TEKNOLOGI TUN HUSSEIN ONN. Downloaded on July 30, 2009 at 23:29 from IEEE Xplore. Restrictions apply.

Vous aimerez peut-être aussi