Académique Documents
Professionnel Documents
Culture Documents
Peter Sadowski
Department of Computer Science
University of California Irvine
Irvine, CA 92697
peter.j.sadowski@uci.edu
Abstract
This document derives backpropagation for some common error functions and
describes some other tricks.
t1 t2 t3
Output x1 x2 x3
xi
Hidden
xj
Inputs xk
The cross entropy error for a single example with nout independent targets
nout
X
E=− (ti log(xi ) + (1 − ti ) log(1 − xi )) (1)
i=1
where t is the target, x is the output, indexed by i. The activation function is the logistic function
applied to the weighted sum of the neurons inputs,
1
xi = (2)
1 + e−si
X
si = xj wji . (3)
j=1
The backprop algorithm is simply the chain rule applied to the neurons in each layer. The first step
of the algorithm is to calculate the gradient of the training error with respect to the output layer
∂E
weights, ∂w ji
.
∂E ∂E ∂xi ∂si
= (4)
∂wji ∂xi ∂si ∂wji
1
We can compute each factor as
∂E −ti 1 − ti
= + , (5)
∂xi xi 1 − xi
xi − t i
= , (6)
xi (1 − xi )
∂xi
= xi (1 − xi ) (7)
∂si
∂si
= xj (8)
∂wji
where xj is the activation of the j node in the hidden layer. Combining things back together,
∂E
= xi − ti (9)
∂si
and
∂E
= (xi − ti )xj (10)
∂wji
.
This gives us the gradient for the weights in the last layer of the network. We now need to calculate
∂E
the error gradient for the weights of the lower layers. Here it is useful to calculate the quantity ∂s j
where j indexes the units in the second layer down.
nout
∂E X ∂E ∂si ∂xj
= (11)
∂sj i=1
∂si ∂xj ∂sj
nout
X
= (xi − ti )(wji )(xj (1 − xj )) (12)
i=1
∂E X ∂E ∂xi ∂Si
= (13)
∂xj i=1
∂xi ∂Si ∂xj
X ∂E
= xi (1 − xi )wji (14)
i
∂xi
Then a weight wkj connecting units in the second and third layers down has gradient
∂E ∂E ∂sj
= (15)
∂wkj ∂sj ∂wkj
nout
X
= (xi − ti )(wji )(xj (1 − xj ))(xk ) (16)
i=1
∂E ∂E
In conclusion, to compute ∂wij for a general multilayer network, we simply need to compute ∂sj
∂sj
recursively, then multiply by ∂wkj = xk .
2
esi
xi = Pnclass (17)
c esc
nclass
X
E = − ti log(xi ) (18)
i
nclass
∂E X ∂E ∂xk
= (22)
∂si ∂xk ∂si
k
∂E ∂xi X ∂E ∂xk
= − (23)
∂xi ∂si ∂xk ∂si
k6=i
X
= −ti (1 − xi ) + tk xk (24)
k6=i
X
= −ti + xi tk (25)
k
= xi − ti (26)
(27)
Notice that this gradient has the same formula as for the summed cross entropy case, but it is different
because the activation x takes on different values.
3
For a single output neuron with logistic activation, the cross-entropy error is given by
E = − (t log o + (1 − t) log (1 − o)) (32)
o
= − t log ( ) + log(1 − o) (33)
1−o
1
!
1+e−x 1
= − t log ( ) + log (1 − ) (34)
1 − 1+e1−x 1 + e−x
1
= − tx + log ( ) (35)
1 + ex
= −tx + log (1 + ex ) (36)
(39)
(40)
Also note that in this softmax calculation, a constant can be added to each row of the output with no
effect on the error function.