BackPropogationCrossEntNotes PDF

Notes on Backpropagation
Peter Sadowski
Department of Computer Science
University of California Irvine
Irvine, CA 92697
peter.j.sadowski@uci.edu
Abstract
This document derives backpropagation for some common error functions and
describes some other tricks.
1 Cross Entropy Error with Logistic Activation

For classification problems with two classes, the standard neural network architecture has a single
output unit that provides a predicted probability of being in one class over another. The logistic
activation function combined with the cross-entropy loss function gives us a nice probabilistic in-
terpretation (as opposed to a sum-of-squared loss). We can generalize this to the case where we
have multiple, independent, two-class outputs; we simply sum the loglikelihood of the independent
targets.
t1 t2 t3
Output x1 x2 x3
xi
Hidden
xj
Inputs xk
The cross entropy error for a single example with nout independent targets
nout
X
E=− (ti log(xi ) + (1 − ti ) log(1 − xi )) (1)
i=1
where t is the target, x is the output, indexed by i. The activation function is the logistic function
applied to the weighted sum of the neurons inputs,
1
xi = (2)
1 + e−si
X
si = xj wji . (3)
j=1
The backprop algorithm is simply the chain rule applied to the neurons in each layer. The first step
of the algorithm is to calculate the gradient of the training error with respect to the output layer
∂E
weights, ∂w ji
.
∂E ∂E ∂xi ∂si
= (4)
∂wji ∂xi ∂si ∂wji
1
We can compute each factor as
∂E −ti 1 − ti
= + , (5)
∂xi xi 1 − xi
xi − t i
= , (6)
xi (1 − xi )
∂xi
= xi (1 − xi ) (7)
∂si
∂si
= xj (8)
∂wji
where xj is the activation of the j node in the hidden layer. Combining things back together,
∂E
= xi − ti (9)
∂si
and
∂E
= (xi − ti )xj (10)
∂wji
.
This gives us the gradient for the weights in the last layer of the network. We now need to calculate
∂E
the error gradient for the weights of the lower layers. Here it is useful to calculate the quantity ∂s j
where j indexes the units in the second layer down.
nout
∂E X ∂E ∂si ∂xj
= (11)
∂sj i=1
∂si ∂xj ∂sj
nout
X
= (xi − ti )(wji )(xj (1 − xj )) (12)
i=1
∂E X ∂E ∂xi ∂Si
= (13)
∂xj i=1
∂xi ∂Si ∂xj
X ∂E
= xi (1 − xi )wji (14)
i
∂xi
Then a weight wkj connecting units in the second and third layers down has gradient
∂E ∂E ∂sj
= (15)
∂wkj ∂sj ∂wkj
nout
X
= (xi − ti )(wji )(xj (1 − xj ))(xk ) (16)
i=1
∂E ∂E
In conclusion, to compute ∂wij for a general multilayer network, we simply need to compute ∂sj
∂sj
recursively, then multiply by ∂wkj = xk .
2 Classification with Softmax Transfer and Cross Entropy Error

For classification problems with more than 2 classes, the softmax output layer provides a way of
assigning probabilities to each class. The cross-entropy error function is modified, but it turns out
to have the same gradient as for the case of summed cross-entropy on logistic outputs. The softmax
activation of the ith output unit is
2
esi
xi = Pnclass (17)
c esc
and the cross entropy error function for multi-class output is
nclass
X
E = − ti log(xi ) (18)
i
Thus, computing the gradient yields

∂E ti
= − (19)
∂xi xi
e si e si
)2
(
∂xi Pnclass
e sc
− ( Pnclass e sc
i=k
= c si sk
e e
c (20)
∂sk − (Pnclass esc )2
i 6= k
c

xi (1 − xi ) i = k
= (21)
−xi xk i=6 k
nclass
∂E X ∂E ∂xk
= (22)
∂si ∂xk ∂si
k
∂E ∂xi X ∂E ∂xk
= − (23)
∂xi ∂si ∂xk ∂si
k6=i
X
= −ti (1 − xi ) + tk xk (24)
k6=i
X
= −ti + xi tk (25)
k
= xi − ti (26)
(27)
the gradient for weights in the top layer is thus

∂E X ∂E ∂si
= (28)
∂wji i
∂si ∂wji
= (xi − ti )xj (29)
and for units in the second lowest layer indexed by j,

nclass
∂E X ∂E ∂si ∂xj
= (30)
∂sj i
∂si ∂xj ∂sj
nclass
X
= (xi − ti )(wji )(xj (1 − xj )) (31)
i
Notice that this gradient has the same formula as for the summed cross entropy case, but it is different
because the activation x takes on different values.
3 Algebraic trick for cross-entropy calculations

We can save some computation when doing cross-entropy error calculations, often an expensive part
of training a neural network.
3
For a single output neuron with logistic activation, the cross-entropy error is given by
E = − (t log o + (1 − t) log (1 − o)) (32)

o
= − t log ( ) + log(1 − o) (33)
1−o
1
!
1+e−x 1
= − t log ( ) + log (1 − ) (34)
1 − 1+e1−x 1 + e−x

1
= − tx + log ( ) (35)
1 + ex
= −tx + log (1 + ex ) (36)
For a softmax output, the cross-entropy error is given by

!
X exi
E = − ti log P xj (37)
i je
  
X X
= − ti xi − log exj  (38)
i j
(39)
(40)
Also note that in this softmax calculation, a constant can be added to each row of the output with no
effect on the error function.

BackPropogationCrossEntNotes PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

BackPropogationCrossEntNotes PDF

Transféré par

Droits d'auteur :

Formats disponibles

Notes on Backpropagation

1 Cross Entropy Error with Logistic Activation

2 Classification with Softmax Transfer and Cross Entropy Error

and the cross entropy error function for multi-class output is

Thus, computing the gradient yields

the gradient for weights in the top layer is thus

and for units in the second lowest layer indexed by j,

3 Algebraic trick for cross-entropy calculations

For a softmax output, the cross-entropy error is given by

Vous aimerez peut-être aussi