Vous êtes sur la page 1sur 23

1 1

E  (d i  oi )  (d i  f ( wi x)) 2
2 t

2 2

E  (d i  oi ) f ' ( wi x) x
t

The components of the gradient vector are


E
 (d i  oi ) f ' ( wi x) x j j  1,2,...., n
t
for
wij
w i   E
w i   [ d i  oi ] f ' ( net i ) x
w    r  x
r  [d i  f ( wi x)] f ' ( wi x)
t t
w    r  x

w2  w1   [d i  oi ] f ' (neti ) x d=t


Example 1

X1=[1 –2 0 –1]1, X2=[0 1.5 -.5 -1]1, X3=[-1 1 .5 -1]1


d1 = -1 d2 = -1 d3 = 1
w1=[1 -1 0 .5]; 2
Net1=net1=[1 -1 0 .5]*[1 -2 0 -1]'=2.5 f ( net )  (  net )
1
1 e
2e (  net )
f ' ( net ) 
w1 [1  e (  net ) ]2
w2
2e(  net ) 1
o (  net ) 2
 (1  o 2
)
w3 [1  e ] 2
w4 net 1  2.5 O1  0.848
f ' ( net 1 )  0.140
Complete this problem
w 2   [ d i  oi ] f ' ( net 1 ) x1  w1
for one epoch

 [0.974  .948 0 0.526 ]'

net 2  1.948
Example 2

x1 2
0.982
x  d  1   0.1
 0.5  0
4 -3.93
d=t x2
1

The transfer function is unipolar continuous (logsig) 1


o
f ' (net)  (1  o)o 1  e net
w   (d  o)(1  o)o x net=2*0.982+4*0.5-3.93*1=0.034

w    x o=1/(1+exp(-0.04)) = 0.51
Error=1-0.51=0.49
  (1  .51)(1  .51)(.51)  .1225
w   *  * 0.982  0.1* .1225 * .982  0.012 4+0.1*0.1225*.5=4.0061
wnew  wold  .012  2  .012  2.012 -3.93+.1*.1225*1=-3.9178
net = 2.012*0.982+4.0061*0.5-3.9178*1=0.061
Error=1-0.5152=0.4848
o=1/(1+exp(-0.061)=0.5152
By chain rule:
∂E ∂E ∂oi ∂xi
---- = ---- ---- ----
∂wij ∂oi ∂xi ∂wij

∂E
---- = (1/2) 2 (di - oi) (-1) = (oi - ti) E = 1/2 ∑ (di - oi)2
∂oi i

∂oi ∂
---- = ---- [1 / (1 + e-xi)] = - [1 / (1 + e-xi)2] (- e-xi ) = e-xi / (1 + e-xi)2
∂xi ∂xi
(1 + e-xi) - 1 1
= ------------- • ----------- = [1 - 1 / (1 + e-xi)] • [1 / (1 + e-xi)]
(1 + e-xi) (1 + e-xi)

= (1 - oi) oi
∂xi
---- = aj xi = ∑ wijaj
∂wij j
∂E ∂E ∂oi ∂xi
---- = ---- ---- ----
∂wij ∂oi ∂xi ∂wij

= (oi - ti) (1 - oi)oi aj


}
}
}
raw error term due to incoming
(pre-synaptic) activation
due to sigmoid

∂E
Δwij = - η ----- (where η is an arbitrary learning rate)
∂wij

wijt+1 = wijt + η (ti - oi) (1 - oi) oi aj


A two layer network
1
x  d 1   0.1
0 

Transfer function is unipolar continuous


1
o
1  e net
net3=u3= 3*1+4*0+1*1=4 o3=1/(1+exp(-4))=0.982
net4=u4= 6*1+5*0+-6*1=0 o4=1/(1+exp(0))=0.5
net5=u5=2*0.982+4*0.5-3.93*1=0.034 o5=1/(1+exp(-0.04))
=0.51
f ' ( net )  ( 1  o )o
w   ( d  o )( 1  o )o x  w    x  5  (1  .51)(1  .51)(.51)  .1225
δ
w53   *  5 * 0.982  0.1* .1225 * .982  0.012
w53  w53  .012  2  .012  2.012
Performance Optimization
Taylor Series Expansion

d
F ( x ) = F ( x* ) + F (x) ( x – x* )
dx x = x*

2
1 d 2
+ --- F (x) ( x – x* ) + 
2 d x2
x = x*

n
1 d
( x – x* ) + 
n
nnd8ts + ----- F (x)
n! d x n
x = x*
Example
–x
F( x ) = e

Taylor series of F(x) about x* = 0 :

–x –0 –0 1 –0 2 1 –0 3
F (x ) = e = e – e ( x – 0 ) + ---e ( x – 0 ) – -- e ( x – 0 ) + 
2 6

1 2 1 3
F ( x ) = 1 – x + -- x – --- x + 
2 6

Taylor series approximations:

F ( x )  F0 ( x ) = 1

F ( x)  F 1 ( x) = 1 – x

1 2
F ( x )  F 2 ( x ) = 1 – x + --- x
2
Plot of Approximations
6

F2 ( x )
3

2 F1 ( x )

1
F0 ( x )

-2 -1 0 1 2
Vector Case
F ( x) = F ( x1 x 2   x n )

 
F ( x ) = F ( x* ) + F (x ) ( x 1 – x 1* ) + F (x ) ( x 2 – x 2* )
 x1 x = x *  x2 x=x *

2
 1  2
+ + F (x ) ( x – x * ) + --
- F ( x ) ( x – x * )
 xn x* x = x*
n n 2  x2 1 1
x =
1

2
1 
+ --- F (x ) *
( x 1 – x 1* ) ( x 2 – x 2* ) + 
2  x 1 x 2 x = x
Matrix Form
T
F ( x ) = F ( x* ) +  F ( x ) ( x – x* )
x = x*

1 T
+ --- ( x – x * ) 2F ( x ) ( x – x* ) + 
2 x = x*

Gradient Hessian

2 2 2
  
 F (x ) F (x )  F (x )
F (x ) 2
 x1  1 2
x x  1 n
x x
 x1
2 2 2
   
F (x ) F (x ) F (x )  F (x )
F ( x ) =  x2  (x ) =
2F
 2 1
x x 2
 x2  2 n
x x



 2 2 2
F (x )   
 xn F (x ) F (x )  F (x )
 n 1
x x  n 2
x x 2
 xn
Directional Derivatives
First derivative (slope) of F(x) along xi axis:  F ( x )   xi

(ith element of gradient)

2 2
Second derivative (curvature) of F(x) along xi axis:  F (x )  x i

(i,i element of Hessian)

T
p F ( x )
First derivative (slope) of F(x) along vector p: -----------------------
p

T
Second derivative (curvature) of F(x) along vector p: p 2 F ( x ) p
------------------------------
2
p
Example
2 2
F (x ) = x 1 + 2x 1 x2 + 2 x2

x* = 0.5 p = 1
0 –1


F( x )
 x1 2x 1 + 2x 2 1
 F ( x) = = =
x = x*  2x 1 + 4x 2 1
F( x )
 x2 x = x*
x = x*

1
T 1 – 1
p F ( x ) 1 0
----------------------- = ------------------------ = ------- = 0
p 1 2
–1
Plots
Directional
Derivatives
2

20

15
1
1.4
10
1.3
5
x2 0 1.0

0 0.5
2
1 2
-1
0.0
0 1
0
-1
x2 -2 -2
-1
x1
-2
-2 -1 0 1 2

x1

nnd8dd
Minima
Strong Minimum

The point x* is a strong minimum of F(x) if a scalar  > 0 exists, such that F(x*) <
F(x* + x) for all x such that  > ||x|| > 0.

Global Minimum

The point x* is a unique global minimum of F(x) if


F(x*) < F(x* + x) for all x  0.

Weak Minimum

The point x* is a weak minimum of F(x) if it is not a strong minimum, and a scalar  > 0
exists, such that F(x*) F(x* + x) for all x such that  > ||x|| > 0.
Scalar Example
4 2 1
F ( x ) = 3x – 7x – --- x + 6
2
8

Strong Maximum
6

2 Strong Minimum

Global Minimum
0
-2 -1 0 1 2
Quadratic Functions
1 T T
F ( x ) = -- x Ax + d x + c (Symmetric A)
2

Gradient of Quadratic Function:

F( x ) = Ax + d

Hessian of Quadratic Function:

 2F ( x ) = A
• If the eigenvalues of the Hessian matrix are all positive,
the function will have a single strong minimum.
• If the eigenvalues are all negative, the function will
have a single strong maximum.
• If some eigenvalues are positive and other eigenvalues
are negative, the function will have a single saddle
point.
• If the eigenvalues are all nonnegative, but some
eigenvalues are zero, then the function will either have
a weak minimum or will have no stationary point.
• If the eigenvalues are all nonpositive, but some
eigenvalues are zero, then the function will either have
a weak maximum or will have no stationary point.
Stationary point nature summary
xT Ax i Definiteness H Nature x*

0 Positive d. Minimum

0 Positive semi-d. Valley

Indefinite Saddlepoint
0
0 Negative semi-d. Ridge

0 Negative d. Maximum
Steepest Descent
2 2
F ( x ) = x1 + 2 x1 x 2 + 2x 2 + x1

x 0 = 0.5  = 0.1
0.5


F( x )
 x1 2x 1 + 2x2 + 1 g0 =  F (x ) = 3
F ( x ) = = x= x0
 2x 1 + 4x 2 3
F( x )
 x2

x 1 = x 0 – g 0 = 0.5 – 0.1 3 = 0.2


0.5 3 0.2

x2 = x1 – g1 = 0.2 – 0.1 1.8 = 0.02


0.2 1.2 0.08
2

-1

-2
-2 -1 0 1 2

If A is a symmetric matrix with eigenvalues λs, then eigenvalues of I- αA are


1- αλ
Stable Learning Rates (Quadratic)
1 T T
F ( x ) = -- x Ax + d x + c
2

F( x ) = Ax + d

x k + 1 = xk –  gk = x k –  ( Ax k + d ) xk + 1 =  I – A x k – d

Stability is determined
by the eigenvalues of
this matrix.

 I –  A  zi = z i –  Az i = z i –  iz i = ( 1 –  i) z i

(i - eigenvalue of A) Eigenvalues


of [I - A].

Stability Requirement:
2 2
( 1 –  i)  1   ----   ------------
i max
Example
  0.851     0.526  
A= 22 (
 1  = 0.764) z
 1 =  
 2 = 5.24 z
 2 = 
24   – 0.526     0.851  

2 2
  ------------ = ---------- = 0.38
max 5.24

 = 0.37  = 0.39
2 2

1 1

0 0

-1 -1

-2 -2
-2 -1 0 1 2 -2 -1 0 1 2