Vous êtes sur la page 1sur 4

Notes on Backpropagation

Peter Sadowski
Department of Computer Science
University of California Irvine
Irvine, CA 92697
peter.j.sadowski@uci.edu

Abstract
This document derives backpropagation for some common error functions and
describes some other tricks.

1 Cross Entropy Error with Logistic Activation


For classification problems with two classes, the standard neural network architecture has a single
output unit that provides a predicted probability of being in one class over another. The logistic
activation function combined with the cross-entropy loss function gives us a nice probabilistic in-
terpretation (as opposed to a sum-of-squared loss). We can generalize this to the case where we
have multiple, independent, two-class outputs; we simply sum the loglikelihood of the independent
targets.

t1 t2 t3

Output x1 x2 x3
xi
Hidden
xj

Inputs xk

The cross entropy error for a single example with nout independent targets
nout
X
E=− (ti log(xi ) + (1 − ti ) log(1 − xi )) (1)
i=1

where t is the target, x is the output, indexed by i. The activation function is the logistic function
applied to the weighted sum of the neurons inputs,
1
xi = (2)
1 + e−si
X
si = xj wji . (3)
j=1

The backprop algorithm is simply the chain rule applied to the neurons in each layer. The first step
of the algorithm is to calculate the gradient of the training error with respect to the output layer
∂E
weights, ∂w ji
.
∂E ∂E ∂xi ∂si
= (4)
∂wji ∂xi ∂si ∂wji

1
We can compute each factor as
∂E −ti 1 − ti
= + , (5)
∂xi xi 1 − xi
xi − t i
= , (6)
xi (1 − xi )
∂xi
= xi (1 − xi ) (7)
∂si
∂si
= xj (8)
∂wji

where xj is the activation of the j node in the hidden layer. Combining things back together,
∂E
= xi − ti (9)
∂si

and
∂E
= (xi − ti )xj (10)
∂wji
.
This gives us the gradient for the weights in the last layer of the network. We now need to calculate
∂E
the error gradient for the weights of the lower layers. Here it is useful to calculate the quantity ∂s j
where j indexes the units in the second layer down.

nout
∂E X ∂E ∂si ∂xj
= (11)
∂sj i=1
∂si ∂xj ∂sj
nout
X
= (xi − ti )(wji )(xj (1 − xj )) (12)
i=1

∂E X ∂E ∂xi ∂Si
= (13)
∂xj i=1
∂xi ∂Si ∂xj
X ∂E
= xi (1 − xi )wji (14)
i
∂xi

Then a weight wkj connecting units in the second and third layers down has gradient
∂E ∂E ∂sj
= (15)
∂wkj ∂sj ∂wkj
nout
X
= (xi − ti )(wji )(xj (1 − xj ))(xk ) (16)
i=1

∂E ∂E
In conclusion, to compute ∂wij for a general multilayer network, we simply need to compute ∂sj
∂sj
recursively, then multiply by ∂wkj = xk .

2 Classification with Softmax Transfer and Cross Entropy Error


For classification problems with more than 2 classes, the softmax output layer provides a way of
assigning probabilities to each class. The cross-entropy error function is modified, but it turns out
to have the same gradient as for the case of summed cross-entropy on logistic outputs. The softmax
activation of the ith output unit is

2
esi
xi = Pnclass (17)
c esc

and the cross entropy error function for multi-class output is

nclass
X
E = − ti log(xi ) (18)
i

Thus, computing the gradient yields


∂E ti
= − (19)
∂xi xi
e si e si
)2
(
∂xi Pnclass
e sc
− ( Pnclass e sc
i=k
= c si sk
e e
c (20)
∂sk − (Pnclass esc )2
i 6= k
c

xi (1 − xi ) i = k
= (21)
−xi xk i=6 k

nclass
∂E X ∂E ∂xk
= (22)
∂si ∂xk ∂si
k
∂E ∂xi X ∂E ∂xk
= − (23)
∂xi ∂si ∂xk ∂si
k6=i
X
= −ti (1 − xi ) + tk xk (24)
k6=i
X
= −ti + xi tk (25)
k
= xi − ti (26)
(27)

the gradient for weights in the top layer is thus


∂E X ∂E ∂si
= (28)
∂wji i
∂si ∂wji
= (xi − ti )xj (29)

and for units in the second lowest layer indexed by j,


nclass
∂E X ∂E ∂si ∂xj
= (30)
∂sj i
∂si ∂xj ∂sj
nclass
X
= (xi − ti )(wji )(xj (1 − xj )) (31)
i

Notice that this gradient has the same formula as for the summed cross entropy case, but it is different
because the activation x takes on different values.

3 Algebraic trick for cross-entropy calculations


We can save some computation when doing cross-entropy error calculations, often an expensive part
of training a neural network.

3
For a single output neuron with logistic activation, the cross-entropy error is given by
E = − (t log o + (1 − t) log (1 − o)) (32)
 
o
= − t log ( ) + log(1 − o) (33)
1−o
1
!
1+e−x 1
= − t log ( ) + log (1 − ) (34)
1 − 1+e1−x 1 + e−x
 
1
= − tx + log ( ) (35)
1 + ex
= −tx + log (1 + ex ) (36)

For a softmax output, the cross-entropy error is given by


!
X exi
E = − ti log P xj (37)
i je
  
X X
= − ti xi − log exj  (38)
i j

(39)
(40)
Also note that in this softmax calculation, a constant can be added to each row of the output with no
effect on the error function.

Vous aimerez peut-être aussi