Académique Documents
Professionnel Documents
Culture Documents
THIRD EDITION
Neural
Networks
and
Learning Machines
Simon Haykin
and
Yanbo Xue
McMaster University
Canada
CHAPTER 1
Rosenblatts Perceptron
Problem 1.1
Problem 1.2
y = tanh ---
v
2
b + wi xi = y (1)
i
where
1
y = 2 tanh ( y )
Problem 1.3
x1 o w1 = 1
v
o o o y
w2 = 1 Hard Figure 1: Problem 1.3
+1 limiter
o o
x2 b = -1.5
v = w1 x1 + w2 x2 + b
= x 1 + x 2 1.5
x1 o w1 = 1
Hard
v limiter
o o o y
w2 = 1 Figure 2: Problem 1.3
+1
o o
x2 b = -0.5
v = x 1 + x 2 0.5
Hard
w1 = -1 v limiter
o o o o y
v = wx + b = x + 0.5
Problem 1.4
The Gaussian classifier consists of a single unit with a single weight and zero bias, determined in
accordance with Eqs. (1.37) and (1.38) of the textbook, respectively, as follows:
1
w = -----2- ( 1 2 )
= 20
1 2 2
b = --------2- ( 2 1 )
2
= 0
Problem 1.5
2
C = I
in Eqs. (1.37) and (1.38) of the textbook, we get the following formulas for the weight vector and
bias of the Bayes classifier:
1
w = -----2- ( 1 2 )
1 2 2
b = --------2- ( 1 2 )
2
CHAPTER 4
Multilayer Perceptrons
Problem 4.1
x1 +1
+1 1 -2 2
y2
+1
-1.5
x2 Figure 4: Problem 4.1
-0.5
Assume that each neuron is represented by a McCulloch-Pitts model. Also assume that
xi = 1 if the input bit is 1
0 if the input bit is 0
v 1 = x 1 + x 2 1.5
v 2 = x 1 + x 2 2 y 1 0.5
1
From this table we observe that the overall output y2 is 0 if x1 and x2 are both 0 or both 1, and it is
1 if x1 is 0 and x2 is 1 or vice versa. In other words, the network of Fig. P4.1 operates as an
EXCLUSIVE OR gate.
Problem 4.2
Figure 1 shows the evolutions of the free parameters (synaptic weights and biases) of the neural
network as the back-propagation learning process progresses. Each epoch corresponds to 100 iter-
ations. From the figure, we see that the network reaches a steady state after about 25 epochs. Each
neuron uses a logistic function for its sigmoid nonlinearity. Also the desired response is defined as
Figure 2 shows the final form of the neural network. Note that we have used biases (the negative
of thresholds) for the individual neurons.
2
b1 = 1.6
w11= -4.72 1
x1 w31 = -6.80
3
w21 = -3.51
Output
w12 = -4.24 +1
2
x2 w32 = 6.44
b3 = -2.85
w22 = -3.52 +1
b2 = 5.0
Figure 2: Problem 4.2
Problem 4.3
n
E ( t )
w ji ( n ) =
n-t
------------------
w ji ( t )
t=0
n
E ( t )
= ( 1 )
n-t n-t
------------------
w ji ( t )
t=0
Now we find that if the derivative E w ji has the same algebraic sign on consecutive iterations
of the algorithm, the magnitude of the exponentially weighted sum is reduced. The opposite is
true when E w ji alternates its algebraic sign on consecutive iterations. Thus, the effect of the
momentum constant is the same as before, except that the effects are reversed, compared to the
case when is positive.
Problem 4.4
n
E ( t )
w ji ( n ) =
n-t
------------------ (1)
w ji ( t )
t=1
2
E = k 1 ( w w0 ) + k 2
3
Hence, the application of (1) to this case yields
n
w ( n ) = 2k 1
n-t
( w ( t ) w0 )
t=1
In this case, the partial derivative E ( t ) w ( t ) has the same algebraic sign on consecutive itera-
tions. Hence, with 0 < < 1 the exponentially weighted adjustment w ( n ) to the weight w at
time n grows in magnitude. That is, the weight w is adjusted by a large amount. The inclusion of
the momentum constant in the algorithm for computing the optimum weight w* = w0 tends to
accelerate the downhill descent toward this optimum point.
Problem 4.5
Consider Fig. 4.14 of the text, which has an input layer, two hidden layers, and a single output
neuron. We note the following:
(3) (3)
y1 = F ( A 1 ) = F ( w, x )
(3) (3)
Hence, the derivative of F ( A 1 ) with respect to the synaptic weight w 1k connecting neuron k in
the second hidden layer to the single output neuron is
(3)
where v 1 is the activation potential of the output neuron. Next, we note that
(3)
F ( A 1 )
- = 1
---------------------
(3)
y1
(3) (3)
y1 = ( v1 )
(3) (3) (2)
v1 = w1k yk (2)
k
(2)
where y k is the output of neuron k in layer 2. We may thus proceed further and write
(3)
y1 (3) (3)
- = ( v 1 ) = A 1
-----------
(3)
(3)
v 1
4
(3)
v 1 (2)
- = yk
------------
(3)
w 1k
(2)
= ( Ak ) (4)
(3)
F ( w, x ) F ( A 1 )
----------------------
(3)
= ---------------------
(3)
-
w 1k w 1k
(3) (3)
= ( A 1 ) ( A k )
(2)
Consider next the derivative of F(w,x) with respect to w kj , the synaptic weight connecting
neuron j in layer 1 (i.e., first hidden layer) to neuron k in layer 2 (i.e., second hidden layer):
(2) (1)
where y k is the output of neuron in layer 2, and v k is the activation potential of that neuron.
Next we note that
F ( w, x )
----------------------
(3)
= 1 (6)
y1
(3)
y1 (3)
- = ( A 1 )
-----------
(3)
(7)
v 1
(3) (3) (2)
v1 = w1k yk
k
(3)
v 1 (3)
- = w 1k
-----------
(2)
(8)
yk
(2) (2)
yk = ( vk )
(2)
yk (2) (2)
- = ( v k ) = ( A k )
-----------
(2)
(9)
v k
5
(2) (1) (1)
vk = wkj yj
j
(2)
v k (1) (1) (1)
- = y j = (v j ) = ( A j )
------------
(1)
(10)
w kj
(1)
Finally, we consider the derivative of F(w,x) with respect to w ji , the synaptic weight
connecting source node i in the input layer to neuron j in layer 1. We may thus write
(1) (1)
where y j is the output of neuron j in layer 1, and v i is the activation potential of that neuron.
Next we note that
F ( w, x )
----------------------
(3)
= 1 (12)
y1
(3)
y1 (3)
-----------
(3)
- = ( A ) (13)
v 1
(3) (3) (2)
v1 = w1k yk
k
(3) (2)
v 1 ( 3 ) yk
(1)
- =
----------- w 1k -----------
(1)
-
yj k yj
(2) (2)
(3) y k v k
= w 1k -----------
(2)
- -----------
(1)
-
k v k y j
(2)
(3) (2) v k
= w 1k ( A k ) -----------
(1)
- (14)
k yj
6
(2)
v k (2)
- = w kj
-----------
(1)
(15)
yj
(1) (1)
yj = (v j )
(1)
yj (1) (1)
- = ( v j ) = ( A j )
-----------
(1)
(16)
v j
(1) (1)
vj = w ji xi
i
(1)
v j
- = xi
------------
(1)
(17)
w ji
F ( w, x )
= ( A 1 ) w 1k ( A k )w kj ( A j )x i
(3) (3) (2) (2) (1)
----------------------
(1)
w ji k
Problem 4.12
w ( n ) = ( n )p ( n )
= ( n ) [ g ( n ) + ( n 1 )p ( n 1 ) ]
( n )g ( n ) + ( n 1 ) ( n 1 )p ( n 1 ) (1)
where, in the second term of the last line in (1), we have used (n - 1) in place of (). Define
w ( n 1 ) = ( n 1 )p ( n 1 )
w ( n ) ( n )g ( n ) + ( n 1 )w ( n 1 ) (2)
On the other hand, according to the generalized delta rule, we have for neuron j:
w j ( n ) = w j ( n 1 ) + j ( n )y ( n ) (3)
Comparing (2) and (3), we observe that they have a similar mathematical form:
7
The vector -g(n) in the conjugate gradient method plays the role of j(n)y(n), where j(n) is the
local gradient of neuron j and y(n) is the vector of inputs for neuron j.
The time-varying parameter (n - 1) in the conjugate-gradient method plays the role of
momentum in the generalized delta rule.
Problem 4.13
T
s ( n 1 )Ar ( n )
( n ) = ----------------------------------------------
T
- (1)
s ( n 1 )As ( n 1 )
r ( n ) = r ( n 1 ) ( n 1 )As ( n 1 )
( n 1 )As ( n 1 ) = r ( n ) r ( n 1 ) (2)
T T
( n 1 )s ( n 1 )As ( n 1 ) = s ( n 1 ) ( r ( n ) r ( n 1 ) )
T
= s ( n 1 )r ( n 1 ) (3)
T
s ( n 1 )r ( n ) = 0
T T
( n 1 )r ( n )As ( n 1 ) = ( n 1 )s ( n 1 )Ar ( n 1 )
T
= r (n)(r(n) r(n 1)) (4)
where it is noted that AT = A. Dividing (4) by (3) and invoking the use of (1):
T
r (n)(r(n) r(n 1))
( n ) = -------------------------------------------------------
T
- (5)
s ( n 1 )r ( n 1 )
8
In the linear form of conjugate gradient method, we have
T T
s ( n 1 )r ( n 1 ) = r ( n 1 )r ( n 1 )
T
r (n)(r(n) r(n 1))
( n ) = -------------------------------------------------------
T
- (6)
r ( n 1 )r ( n 1 )
T
r ( n )r ( n 1 ) = 0
T
r ( n )r ( n )
( n ) = ------------------------------------------
T
-
r ( n 1 )r ( n 1 )
Problem 4.15
In this problem, we explore the operation of a fully connected multilayer perceptron trained with
the back-propagation algorithm. The network has a single hidden layer. It is trained to realize the
following one-to-one mappings:
(a) Inversion:
1
f ( x ) = --- , 1< x < 100
x
(c) Exponentiation
x
f ( x) = e , 1< x < 10
9
learning-rate parameter = 0.3, and
momentum constant = 0.7.
Ten different network configurations were trained to learn this mapping. Each network was
trained identically, that is, with the same and , with bias terms, and with 10,000 passes of the
training vectors (with one exception noted below). Once each network was trained, the test dataset
was applied to compare the performance and accuracy of each configuration. Table 1 summarizes
the results obtained:
Table 1
Average percentage error
Number of hidden neurons at the network output
3 4.73%
4 4.43
5 3.59
7 1.49
10 1.12
15 0.93
20 0.85
30 0.94
100 0.9
30 (trained with 100,000 passes) 0.19
The results of Table 1 indicate that even with a small number of hidden neurons, and with a rela-
tively small number of training passes, the network is able to learn the mapping described in (a)
quite well.
10
and (b)), are summarized in Table 3:
Table 3
Average percentage error
Number of hidden neurons at the network output
2 244.0%
3 185.17
4 134.85
5 133.67
7 141.65
10 158.77
15 151.91
20 144.79
30 137.35
100 98.09
30 (trained with 100,000 passes) 103.99
These results are unacceptable since the network is unable to generalize when each neuron is
driven to its limits.
The experiment with 30 hidden neurons and 100,000 training passes was repeated, but this
time the hyperbolic tangent function was used as the nonlinearity. The result obtained this time
was an average percentage error of 3.87% at the network output. This last result shows that the
hyperbolic tangent function is a better choice than the logistic function as the sigmoid function for
realizing the mapping f(x) = e- x.
11
CHAPTER 5
Kernel Methods and Radial-Basis Function Networks
Problem 5.9
N
1
J ( F ) = ---
2
( f ( x i ) F ( x i, ) ) f ( ) d
2 m0
i=1
R
m0
where f ( ) is the probability density function of a noise distribution in the input space R . It is
reasonable to assume that the noise vector is additive to the input data vector x. Hence, we may
define the cost function J(F) as
N
1
J ( F ) = ---
2
( f ( x i ) F ( x i + ) ) f ( ) d (1)
2 m0
i=1
R
where (for convenience of presentation) we have interchanged the order of summation and
integration, which is permissible because both operations are linear. Let
z = x i + or = z x i
N
1
( f ( xi ) F ( z ) )
2
J ( F ) = --- f ( z x i ) dz (2)
2 m0
i=1
R
Note that the subscript in f ( . ) merely refers to the name of the noise distribution and is
therefore untouched by the change of variables. Differentiating (2) with respect to F, setting the
result equal to zero, and finally solving for F(z), we get the optimal estimator
f ( xi ) f ( z xi )
F ( z ) = --------------------------------------------
i=1
N
-
f ( z xi )
i=1
1
CHAPTER 6
Support Vector Machines
Problem 6.1
From Eqs. (6.2) in the text we recall that the optimum weight vector wo and optimum bias bo
satisfy the following pair of conditions:
T
w o x i + b o +1 for di = +1
T
w o x i + b o < -1 for di = -1
T
min w xi + b = 1
i = 1, 2, , N
Problem 6.2
Problem 6.3
We start with the primel problem formulated as follows (see Eq. (6.15)) of the text
N N N
1 T
J ( w, b, ) = --- w w i d i w x i b i d i + i
T
(1)
2
i=1 i=1 i=1
N
w = i d i x
i=1
Premultiplying w by wT:
1
N
i d i w
T T
w w = xi (2)
i=1
i d i xi
T T
w =
i=1
Accordingly, we may redefine the inner product wTw as the double summation:
N N
i d i j d j x j xi
T T
w w = (3)
i=1 j=1
N N N
1
Q ( ) = --- + i
T
i d i j d j x j xi (4)
2
i=1 j=1 i=1
i d i = 0
i=1
Recognizing that i > 0 for all i, we see that (4) is the formulation of the dual problem.
Problem 6.4
Consider a support vector machine designed for nonseparable patterns. Assuming the use of the
leave-one-out-method for training the machine, the following situations may arise when the
example left out is used as a test example:
2
Problem 6.5
By definition, a support vector machine is designed to maximize the margin of separation between
the examples drawn from different classes. This definition applies to all sources of data, be they
noisy or otherwise. It follows therefore that by the very nature of it, the support vector machine is
robust to the presence of additive noise in the data used for training and testing, provided that all
the data are drawn from the same population.
Problem 6.6
Since theGram K = {K(xi, xj)} is a square matrix, it can be diagonalized using the similarity
transformation:
T
K = QQ
T
k ( x i, x j ) = ( QQ ) ij
m1
( Q )il ( )ll ( Q
T
= ) lj
l=1
m1
= ( Q )il ( )ll ( Q )lj (1)
l=1
Let ui denote the ith row of matrix Q. (Note that ui is not an eigenvector.) We may then rewrite (1)
as the inner product
T
k ( x i, x j ) = u i u j
12 T 12
= ( ui ) ( u j) (2)
12
where is the square root of .
By definition, we have
T
k ( x i, x j ) = ( x i ) ( x j ) (3)
3
Comparing (2) and (3), we deduce that the mapping from the input space to the hidden (feature)
space of a support vector machine is described by
12
: x i ui
Problem 6.7
12
: x i ui
Suppose the input vector xi is multiplied by the orthogonal (unitary) matrix Q. We then have a
new mapping described by
12
: Qx i Q ui
12 T 12
k ( Qx i, Qx j ) = ( Q u i ) ( Q u j)
12 T T 12
= ( ui ) Q Q ( u j) (1)
where ui is the ith row of Q. From the definition of an orthogonal (unitary) matrix:
1 T
Q = Q
or equivalently
T
Q Q = I
12 T 12
k ( Qx i, Qx j ) = ( ui ) ( u j)
= k ( x i, x j )
4
(b) Consider first the polynomial machine described by
T p
k ( Qx i, Qx j ) = ( ( Qx i ) ( Qx j ) + 1 )
T T p
= ( x i Q Qx j + 1 )
T p
= ( xi x j + 1 )
= k ( x i, x J )
k ( Qx i, Qx j ) = exp --------2- Qx i Qx j
1 2
2
= exp --------2- ( Qx i Qx j ) ( Qx i Qx j )
1 T
2
= exp --------2- ( x i x j ) Q Q ( x i x j )
1 T T
2
= exp --------2- ( x i x j ) ( x i x j ) ,
1 T T
Q Q = I
2
= k ( x i, x J )
T
k ( Qx i, Qx j ) = tanh ( 0 ( Qx i ) ( Qx j ) + 1 )
T T
= tanh ( 0 x i Q Qx j + 1 )
T
= tanh ( 0 x i x j + 1 )
= k ( x i, x J )
Thus all three types of the support vector machine, namely, the polynomial machine, RBF
network, and MLP, satisfy the unitary invariance property in their own individual ways.
5
Problem 6.17
The truth table for the XOR function, operating on a three-dimensional pattern x, is as follows:
Table 1
Desired response
Inputs
x1 x2 x3 y
+1 +1 +1 +1
+1 -1 +1 -1
-1 +1 +1 -1
+1 +1 -1 -1
+1 -1 -1 +1
-1 +1 -1 +1
-1 -1 -1 -1
-1 -1 +1 +1
To proceed with the support vector machine for solving this multidimensional XOR problem, let
the Mercer kernel
T p
k ( x, x j ) = ( 1 + x x i )
The minimum value of power p (denoting a positive integer) needed for this problem is p = 3. For
p = 2, we end up with a zero weight vector, which is clearly unacceptable.
T 3
k ( x, x i ) = ( 1 + x x i )
T T 2 T 3
= 1 + 3x x i + 3 ( x x i ) + ( x x i )
where
T
x = [ x 1, x 2, x 3 ]
and likewise for xi. Then, proceeding in a manner similar but much more cumbersome than that
described for the two-dimensional XOR problem in Section 6.6, we end up with a polynomial
machine defined by
y = x 1, x 2, x 3
6
CHAPTER 8
Principal-Components Analysis
Problem 8.5
2
0 = 1 + (1)
q0 = s (2)
T 2
R = ss + I (3)
where s is the signal vector and 2 is the variance of an element of the additive noise vector.
Hence, using (2) and (3):
T
q 0 Rq 0
0 = ----------------
T
-
q0 q0
T T 2
s ( ss + I )s
= -----------------------------------
T
-
s s
T T 2 T
(s s)(s s) + (s s)
= ---------------------------------------------------
T
-
s s
T 2
= s s+
2 2
= s + (4)
s = 1
2
0 = 1 +
1
Problem 8.6
w ( n + 1 ) = w ( n ) + y ( n ) [ x ( n ) y ( n )w ( n ) ] (1)
x ( n ) = y ( n )q 1 for n (2)
where q1 is the eigenvector associated with the largest eigenvalue 1 of the correlation matrix
R = E[x(n)xT(n)], where E is the expectation operator. Multiplying (2) by its own transpose and
then taking expectations, we get
T 2 T
E [ x ( n )x ( n ) ] = E [ y ( n ) ]q 1 q 1
2 T
R = Y q1 q1 (3)
2
where Y is the variance of the output y(n). Post-multiplying (3) by q1:
2 T 2
Rq 1 = Y q 1 q 1 q 1 = Y q 1 (4)
2
where it is noted that q 1 = 1 by definition. From (4) we readily see that Y = 1 , which is the
desired result.
Problem 8.7
Writing the learning algorithm for minor components analysis in matrix form:
w ( n + 1 ) = w ( n ) y ( n ) [ x ( n ) y ( n )w ( n ) ]
Proceeding in a manner similar to that described in Section (8.5) of the textbook, we have the
nonlinear differential equation:
d T
----- w ( t ) = [ w ( t )Rw ( t ) ]w ( t ) Rw ( t )
dt
Define
2
M
w(t ) = k ( t )qk (1)
k =1
where qk is the kth eigenvector of correlation matrix R = E[x(n)xT(n)] and the coefficient k ( t ) is
the projection of w(t) onto qk. We may then identify two cases as summarized here:
k ( t )
k ( t ) = ------------
- for some fixed m (2)
m ( t )
d k ( t )
---------------- = ( m k ) k ( t ) (3)
dt
it follows that k ( t ) 0 as t .
Case II: k = m
d m ( t ) 2
----------------- = m m ( t ) ( m ( t ) 1 ) for t (4)
dt
Hence, m ( t ) = 1 as t .
Thus, in light of the results derived for cases I and II, we deduce from (1) that:
3
Problem 8.8
2
w j = y j x y j w j (1)
j-1
x = x wk yk (2)
k =0
where, for convenience of presentation, we have omitted the dependence on time n. Equations (1)
and (2) may be represented by the following vector-valued signal flow graph:
w0
-y0
x o o o w0
w1
-y1
o o w1
...
wj-1
-yj-1
o o wj-1
wj
-yj
o o
yj
o
wj
Note: The dashed lines indicate inner (dot) products formed by the input vector x and the
pertinent synaptic weight vectors w0, w1, ..., wj to produce y0, y1, ..., yj, respectively.
Problem 8.9
Consider a network consisting of a single layer of neurons with feedforward connections. The
algorithm for adjusting the matrix of synaptic weights W(n) of the network is described by the
recursive equation (see Eq. (8.91) of the text):
4
T T
W ( n ) = W ( n ) + ( n ) { y ( n )x ( n ) LT [ y ( n )y ( n ) ]W ( n ) } (1)
where x(n) is the input vector, y(n) is the output vector; and LT[.] is a matrix operator that sets all
the elements above the diagonal of the matrix argument to zero, thereby making it lower
triangular.
First, we note that the asymptotic stability theorem discussed in the text does not apply
directly to the convergence analysis of stochastic approximation algorithms involving matrices; it
is formulated to apply to vectors. However, we may write the elements of the parameter (synaptic
weight) matrix W(n) in (1) as a vector, that is, one column vector stacked up on top of another. We
may then interpret the resulting nonlinear update equation in a corresponding way and so proceed
to apply the asymptotic stability theorem directly.
To prove the convergence of the learning algorithm described in (1), we may use the
method of induction to show that if the first j columns of matrix W(n) converge to the first j
eigenvectors of the correlation matrix R = E[x(n)xT(n)], then the (j + 1)th column will converge to
the (j + 1)th eigenvector of R. Here we use the fact that in light of the convergence of the
maximum eigenfilter involving a single neuron, the first column of the matrix W(n) converges
with probability 1 to the first eigenvector of R, and so on.
Problem 8.10
The results of a computer experiment on the training of a single-layer feedforward network using
the generalized Hebbian algorithm are described by Sanger (1990). The network has 16 output
neurons, and 4096 inputs arranged as a 64 x 64 grid of pixels. The training involved presentation
of 2000 samples, which are produced by low-pass filtering a white Gaussian noise image and then
multiplying wi6th a Gaussian window function. The low-pass filter was a Gaussian function with
standard deviation of 2 pixels, and the window had a standard deviation of 8 pixels.
Figure 1, presented on the next page, shows the first 16 receptive field masks learned by
the network (Sanger, 1990). In this figure, positive weights are indicated by white and negative
weights are indicated by black; the ordering is left-to-right and top-to-bottom.
The first mask is a low-pass filter since the input has most of its energy near dc (zero
frequency).
The second mask cannot be a low-pass filter, so it must be a band-pass filter with a mid-band
frequency as small as possible since the input power decreases with increasing frequency.
Continuing the analysis in the manner described above, the frequency response of successive
masks approaches dc as closely as possible, subject (of course) to being orthogonal to
previous masks.
The end result is a sequence of orthogonal masks that respond to progressively higher
frequencies.
5
Figure 1: Problem 8.10 (Reproduced with permission of Biological Cybernetics)
6
CHAPTER 9
Self-Organizing Maps
Problem 9.1
(1) 1 (2) 2
g( y j) = g(0) + g ( 0 ) y j + ----- g ( 0 ) y j + (1)
2!
where
k
(k ) g( y j)
g ( 0 ) = ------------------
- for k = 1, 2, ....
k
yj yj = 0
Let
y j = 1, neuron j is on
0, neuron j is off
(i) 1 (2)
g ( y j ) = g ( 0 ) + g ( 0 ) + 2! g ( 0 ) + ,
----- neuron j is on
g(0) neuron j is off
dw j
---------- = y j x g ( y j )w j
dt
x w j g ( 0 ) + g ( 1 ) ( 0 ) + ----- g ( 2 ) ( 0 ) +
1
neuron j is on
= 2!
g ( 0 )w j neuron j is off
Consequently, a nonzero g(0) has the effect of making dwj/dt assume a nonzero value when
neuron j is off, which is undesirable. To alleviate this problem, we make g(0) = 0.
1
Problem 9.2
Assume that y(c) is a minimum L2 (least-squares) distortion vector quantizer for the code vector c.
We may then form the distortion function
1
D 2 = --- f ( c ) c ( y ( c ) ) c dc
2
2
This distortion function is similar to that of Eq. (10.20) in the text, except for the use of c and c
in place of x and x , respectively. We wish to minimize D2 with respect to y(c) and c ( y ) .
Assuming that ( ) is a smooth function of the noise vector , we may expand the
decoder output x in using the Taylor series. In particular, using a second-order approximation,
we get (Luttrell, 1989b)
( ) x ( c ( x ) + ) x
2
d
D2 2
1 + ------ k x ( c ) x
2
(1)
2
where
( )d = 1
ni ( ) ( d ) = 0
ni n j ( ) ( d ) = D 2 ij
The first term on the right-hand side of (1) is the conventional distortion term.
The second term (i.e., curvature term) arises due to the output noise model ( ) .
Problem 9.3
Consider the Peano curve shown in part (d) of Fig. 9.9 of the text. This particular self-organizing
feature map pertains to a one-dimensional lattice fed with a two-dimensional input. We see that
(counting from left to right) neuron 14, say, is quite close to neuron 97. It is therefore possible for
a large enough input perturbation to make neuron 14 jump into the neighborhood of neuron 97, or
vice versa. If this change were to happen, the topological preserving property of the SOM
algorithm would no longer hold
2
network is trained with an input consisting of 8 Gaussian clouds with unit variance but different
centers. The centers are located at the points (0,0,0,...,0), (4,0,0,...,0), (4,4,0,...,0), (0,4,0,...,0),
(0,0,4,...,0), (4,0,4, ...,0), (4,4,4, ..., 0), and (0,4,4, ...,0). The clouds occupy the 8 corners of a cube
as shown in Fig. 1a. The resulting labeled feature map computed by the SOM algorithm is shown
in Fig. 1b. Although each of the classes is grouped together in the map, the planar feature map
fails to capture the complete topology of the input space. In particular, we observe that class 6 is
adjacent to class 2 in the input space, but is not adjacent to it in the feature map.
The conclusion to be drawn here is that although the SOM algorithm does perform
clustering on the input space, it may not always completely preserve the topology of the input
space.
Problem 9.4
Consider for example a two-dimensional lattice using the SOM algorithm to learn a two-
dimensional input distribution as illustrated in Fig. 9.8 in the textbook. Suppose that the neuron at
the center of the lattice breaks down; this failure may have a dramatic effect on the evolution of
the feature map. On the other hand, a small perturbation applied to the input space leaves the map
learned by the lattice essentially unchanged.
Problem 9.5
j, i x i
i
w j = -------------------- for some prescribed neuron j (1)
j, i
i
where j,i is the discretized version of the pdf ( ) of noise vector . From Table 9.1 of the text
we recall that j,i plays a role analogous to that of the neighborhood function. Indeed, we can
3
substitute hj,i(x) for j,i in (1). We are interested in rewriting (1) in a form that highlights the role of
Voronoi cells. To this end we note that the dependence of the neighborhood function hj,i(x) and
therefore j,i on the input pattern x is indirect, with the dependence being through the Voronoi cell
in which x lies. Hence, for all input patterns that lie in a particular Voronoi cell the same
neighborhood function applies. Let each Voronoi cell be identified by an indicator function Ii,k
interpreted as follows:
Ii,k = 1 if the input pattern xi lies in the Voronoi cell corresponding to winning neuron k. Then in
light of these considerations we may rewrite (1) in the new form
j, k I i, k xi
k i
w j = ------------------------------------- (2)
j , k I i, k
k i
Now let mk denote the centroid of the Voronoi cell of neuron k and Nk denote the number of input
patterns that lie in that cell. We may then simplify (2) as
j, k N k m k
k
w j = -------------------------------
-
j, k k
N
k
= W j, k m k (3)
k
j, k N k
W j, k = -----------------------
- (4)
j, k N k
k
with
W j, k = 1 for all j
k
4
The width of the neighborhood function plays the role of the span of the kernel.
Problem 9.6
In its basic form, Hebbs postulate of learning states that the adjustment wkj applied to the
synaptic weight wkj is defined by
w kj = y k x j
where yk is the output signal produced in response to the input signal xj.
The weight update for the maximum eigenfilter includes the term y k x j and, additionally,
2
a stabilizing term defined by y k w kj . The term y k x j provides for synaptic amplification.
In contrast, in the SOM algorithm two modifications are made to Hebbs postulate of
learning:
The net result of these two modifications is to make the weight update for the SOM algorithm
assume a form similar to that in competitive learning rather than Hebbian learning.
Problem 9.7
In Fig. 1 (shown on the next page), we summarize the density matching results of computer
simulation on a one-dimensional lattice consisting of 20 neurons. The network is trained with a
triangular input density. Two sets of results are displayed in this figure:
In Fig. 1, we have also included the exact result. Although it appears that both algorithms fail to
match the input density exactly, we see that the conscience algorithm comes closer to the exact
result than the standard SOM algorithm.
5
Figure 1: Problem 9.7
Problem 9.11
Two distinct phases in the learning process can be recognized from this figure:
The neurons become ordered (i.e., the one-dimensional lattice becomes untangled), which
happens at about 20 iterations.
The neurons spread out to match the density of the input distribution, culminating in the
steady-state condition attained after 25,000 iterations.
6
Figure 1: Problem 9.11
7
CHAPTER 10
Information-Theoretic Learning Models
Problem 10.1
The maximum entropy distribution of the random variable X is a uniform distribution over the
range, [a, b], as shown by
1
------------ axb
f X ( x) = a b,
0, otherwise
Hence,
h ( X ) = f X ( x ) log f X ( x ) d x
a 1
= b -----------
ab
- log ( a b ) d x
= log ( a b )
Problem 10.3
Let
T
Y i = ai X1
T
Z i = bi X2
where the vectors X1 and X2 have multivariate Gaussian distributions. The correlation coefficient
between Yi and Zi is defined by
E[Y iZ i]
i = ----------------------------------
-
2 2
E [ Y i ]E [ Z i ]
T T
a i E [ X 1 X 2 ]b i
= ----------------------------------------------------------------------------------------------
12
-
T T T T
{ ( a i E [ X 1 X 1 ]a i ) ( b i E [ X 1 X 2 ]b i ) }
T
a i 12 b i
= ---------------------------------------------------------------
12
- (1)
T T
{ ( a i 11 a i ) ( b i 22 b i ) }
1
where
T
11 = E [ X 1 X 1 ]
T
12 = E [ X 1 X 2 ] = 21
T
22 = E [ X 2 X 2 ]
2
I ( Y i ;Z i ) = log ( 1 i )
Let r denote the rank of the cross-covariance matrix 12 . Given the vectors X1 and X2, we
may invoke the idea of canonical correlations as summarized here:
T T
Find the pair of random variables Y 1 = a 1 X 1 and Z 1 = b 1 X 2 that are most highly
correlated.
T T
Extract the pair of random variables Y 2 = a 2 X 1 and Z 2 = b 2 X 2 in such a way that Y1 and
Y2 are uncorrelated and so are Z1 and Z2.
Continue these two steps until at most r pairs of variables {(Y1, Zi), (Y2, Zi), ..., (Zr, Zr)} have
been extracted.
The essence of the canonical correlation described above is to encapsulate the dependence
between random vectors X1 and X2 in the sequence {(Y1, Zi), (Y2, Zi), ..., (Zr, Zr)}. The
uncorrelatedness of the pairs in this serquence, that is,
means that the mutual information between the vectors X1 and X2 is the sum of the mutual
r
information measures between the individual elements of the pairs { ( Y i, Z i ) } i=1 . That is, we may
write
r
I ( X 1, X 2 ) = I ( Y ij ,Z i ) + constant
i=1
r
= log ( 1 i ) + constant
2
i=1
2
Problem 10.4
Consider a multilayer perceptron with a single hidden layer. Let wji denote the synaptic weight of
hidden neuron j connected to source node i in the input layer. Let xi| denote the ith component of
the input vector x, given example . Then the induced local field of neuron j is
vj = w ji xi (1)
i
yj = (v j ) (2)
1
( v ) = ---------------
v
-
1+e
Consider next the output layer of the network. Let wkj denote the synaptic weight of output neuron
k connected to hidden neuron j. The induced local field of output neuron k is
vk = wkj y j (3)
i
yk = ( vk ) (4)
pk = yk (5)
Accordingly, we may view yk| as an estimate of the conditional probability that the proposition k
is true, given the example at the input. On this basis, we may interpret
1 yk = 1 pk
as the estimate of the conditional probability that the proposition k is false, given the input
example . Correspondingly, let qk| denote the actual (true) value of the conditional probability
that the proposition k is true, given the input example . This means that 1 - qk| is the actual
3
value of the conditional probability that the proposition k is false, given the input example .
Thus, we may define the Kullback-Leibler divergence for the multilayer perceptron as
qk 1 qk
Dp q = k pk
p q log ---------
- + ( 1 q k ) log -------------------
1 p k
k
D p q D p q p k y k v k
---------------- = ---------------- -------------- ------------- -------------
w kj p k y k v k w kj
= p ( qk pk ) y j (6)
Next, we express the partial derivative of D p q with respect to the synaptic weight wji of hidden
neuron j by writing
D p q qk 1 qk pk
---------------- = p ---------- ------------------- -------------- (7)
w ji k
p 1 p k w ji
p k p k y k v k y j v j
-------------- = -------------- ------------- ------------- ------------- ------------
w ji y k v k y j v j w ji
= ( v k )w kj ( v j )x i (8)
But
( v k ) = y k ( 1 y k )
= pk ( 1 pk ) (9)
D p q
---------------- = p x i w ji x i ( p k q k )w kj
w ji i k
4
where ( . ) is the derivative of the logistic function ( . ) with respect to its argument.
Assuming the use of the learning-rate parameter for all weight changes applied to the
network, we may use the method of steepest descent to write the following two-step probabilistic
algorithm:
= p x i w ji x i ( p k q k )w kj
i k
Problem 10.9
We first note that the mutual information between the random variables X and Y is defined by
I ( X ;Y ) = h ( X ) + h ( Y ) h ( X , Y )
To maximize the mutual information I(X;Y) we need to maximize the sum of the differential
entropy h(X) and the differential entropy h ( Y ) and also minimize the joint differential entropy
h(X,Y). From the definition of differential entropy, both h(X) and h(Y) attain their maximum value
of 0.5 when X and Y occur with probability 1/2. Moreover h(X,Y) is minimized when the joint
probability of X and Y occupies the smallest possible region in the probability space.
Problem 10.10
The outputs Y1 and Y2 of the two neurons in Fig. P10.6 in the text are respectively defined by
m
Y1 = w 1i x i + N 1
i=1
L
Y2 = w 2i x i + N 2
i=1
5
where are w1i the synaptic weights of output neuron 1, and the w2i are synaptic weights of output
neuron 2. The mutual information between the output vector Y = [Y1, Y2]T and the input vector X
= [X1, X2, .., Xm]T is
I ( X ;Y ) = h ( Y ) h ( Y X )
= h(Y) h(N) (1)
where h(Y) is the differential entropy of the output vector Y and h(N) is the differential entropy of
the noise vector N = [N1, N2]T.
Since the noise terms N1 and N2 are Gaussian and uncorrelated, it follows that they are
statistically independent. Hence,
h ( N ) = h ( N 1, N 2 )
= h ( N1 ) + h ( N2 )
2
= 1 + log ( 2 N ) (2)
h ( Y ) = h ( Y 1 ,Y 2 )
= f Y 1 ,Y 2 ( y 1, y 2 ) log f Y 1 ,Y 2 ( y 1, y 2 ) d y 1 d y 2
where f Y 1 ,Y 2 ( y 1, y 2 ) is the joint pdf of Y1 and Y2. Both Y1 and Y2 are dependent on the same set
of input signals, and so they are correlated with each other. Let
T
R = E [ YY ]
r 11 r 12
=
r 21 r 22
where
r ij = E [ Y i Y j ] , i, j = 1, 2
2 2
r 11 = 1 + N
r 12 = r 21 = 1 1 12
6
2 2
r 22 = 2 + N
2 2
where 1 and 2 are the respective variances of Y1 and Y2 in the absence of noise, and 12 is their
correlation coefficient also in the absence of noise. For the general case of an N-dimensional
Gaussian distribution, we have
- exp --- y R y
1 1 T 1
f Y ( y ) = --------------------------------------------
N 2 12 2
( 2 ) ( detR )
N 2
h ( Y ) = log ( ( 2e ) det ( R ) )
where e is the base of the natural logarithm. For the problem at hand, we have N = 2 and so
h ( Y ) = log ( 2edet ( R ) )
= 1 + log ( 2det ( R ) )
det ( R )
I ( X ;Y ) = log ---------------- (4)
2N
2
For a fixed noise variance N , the mutual information I(X;Y) is maximized by maximizing the
determinant det(R). By definition,
det ( R ) = r 11 r 22 r 12 r 21
That is,
4 2 2 2 2 2 2
det ( R ) = N + N ( 1 + 2 ) + 1 2 ( 1 12 ) (5)
2
Depending on the value of noise variance N , we may identify two distinct situations:
2
1. Large noise variance. When N is large, the third term in (5) may be neglected, obtaining
4 2 2 2
det ( R ) N + N ( 1 + 2 )
7
2 2
In this case, maximizing det(R) requires that we maximize ( 1 + 2 ) . This requirement may
2 2
be satisfied simply by maximizing the variance 1 of output Y1 or the variance 2 of output
2
Y2, separately. Since the variance of output Yi : i = 1, 2, is equal to i in the absence of noise
2 2
and 1 + N in the presence of noise, it follows from the Infomax principle that the optimum
solution for a fixed noise variance is to maximize the variance of either output, Y1 or Y2.
2 2 2 2
2. Low noise variance. When the noise variance N is small, the third term 1 2 ( 1 12 ) in
(5) becomes important relative to the other two terms. The mutual information I(X;Y) is then
maximized by making an optimal tradeoff between two options: keeping the output variances
2 2
1 and 2 large, and making the outputs Y1 and Y2 of the two neurons uncorrelated.
Based on these observations, we may now make the following two statements:
A high-noise level favors redundancy of response, in which case the two output neurons
compute the same linear combination of inputs. Only one such combination yields a response
with maximum variance.
A low-noise level favors diversity of response, in which case the two output neurons compute
different linear combinations of inputs even though such a choice may result in a reduced
output variance.
Problem 10.11
Ya = S + Na
Yb = S + Nb
Hence,
Ya + Yb 1
------------------- = S + --- ( N a + N b )
2 2
1
The mutual information between --- ( Y a + Y b ) and the signal component S is
2
Ya + Yb Ya + Yb
I ------------------- ;S = h ------------------- h ------------------
Ya + Yb
S (1)
2 2 2 -
Ya + Yb
The differential entropy of ------------------- is
2
8
Ya + Yb
h ------------------- = --- 1 + log --- var [ Y a + Y b ]
1
(2)
2 2 2
Ya + Yb
The conditional differential entropy of ------------------- given S is
2
Na + Nb
h ------------------
Ya + Yb
S = h --------------------
2 - 2
= --- log --- var [ N a + N b ]
1
(3)
2 2
Hence, the use of (2) and (3) in (1) yields (after the simplification of terms)
Ya + Yb var [ Y a + Y b ]
I ------------------- ;S = log ----------------------------------
2 var [ N a + N b ]
(b) The signal component S is ordinarily independent of the noise components Na and Nb. Hence
with
Y a + Y b = 2S + N a + N b
it follows that
The ratio ( var [ Y a + Y b ] ) ( var [ N a + N b ] ) in the expression for the mutual information
Ya + Yb
I ------------------- ;S may therefore be interpreted as a signal-plus-noise to noise ratio.
2
Problem 10.12
9
2. The output signal vector resulting from PCA has a diagonal covariance matrix. The first
principal component defines a direction in the original signal space that captures the
maximum possible variance; the second principal component defines another direction in the
remaining orthogonal subspace that captures the next maximum possible variance, and so on.
On the other hand, ICA does not find the directions of maximum variances but rather
interesting directions where the term interesting refers to deviation from Gaussianity.
Problem 10.13
Independent components analysis may be used as a preprocessing tool before signal detection and
pattern classification. In particular, through a change of coordinates resulting from the use of ICA,
the probability density function of multichannel data may be expressed as a product of marginal
densities. This change, in turn, permits density estimation with shorter observations.
Problem 10.14
N
Xi = aij U j , i = 1, 2, ..., N
j=1
where the Uj are independent random variables. The Darmois theorem states that if the Xi are
independent, then the variables Uj for which a ij 0 are all Gaussian.
Problem 10.15
The use of independent-components analysis results in a set of components that are as statistically
independent of each other as possible. In contrast, the use of decorrelation only addresses second-
order statistics and there is therefore no guarantee of statistical independence.
Problem 10.16
The Kullback-Leibler divergence between the joint pdf fY(y, w) and the factorial pdf f Y ( y, w ) is
the multifold integral
f Y ( y, w )
Df = f Y ( y, w ) log ---------------------- dy (1)
Y f Y f Y ( y, w )
10
Let
dy = d y i d y j dy
Df
Y f Y
= d yi d y j
f Y ( y, w ) log f Y ( y, w ) dy
d yi d y j f Y ( y, w ) log f Y ( y, w ) dy
= f Y ,Y ( yi, y j, w ) log f Y ,Y ( yi, y j, w )d yi d y j
i j i j
f Y i ,Y j ( y i, y j, w ) log ( f Y i ( y i, w ) f Y j ( y j, w ) )d y i d y j
f Y i ,Y j ( y i, y j, w )
= f Y i ,Y j ( y i, y j ) log ---------------------------------------------------
f Y i ( y i, w ) f Y j ( y j, w )
- d y i d y j
= I ( Y i ;Y j )
That is, the Kullback-Leibler divergence between the joint pdf fY(y, w) and the factorial pdf
distribution f Y ( y, w ) is equal to the mutual information between the components Yi and Yj of the
output vector Y for any pair (i, j).
Problem 10.18
y1 ( 0 ) y1 ( 1 ) y1 ( N 1 )
y2 ( 0 ) y2 ( 1 ) y2 ( N 1 )
Y = (1)
...
...
...
ym ( 0 ) ym ( 1 ) ym ( N 1 )
where m is the dimension of the output vector y(n) and N is the number of samples used in
computing the matrix Y. Correspondingly, define the m-by-N matrix of activation functions
11
( y1 ( 0 ) ) ( y1 ( 1 ) ) ( y1 ( N 1 ) )
( y2 ( 0 ) ) ( y2 ( 1 ) ) ( y2 ( N 1 ) )
(Y) =
...
...
...
( ym ( 0 ) ) ( ym ( 1 ) ) ( ym ( N 1 ) )
In the batch mode, we define the average weight adjustment (see Eq. (10.100) of the text)
N -1
1
W = ---- W ( n )
N
n=0
N -1
1
= I ---- ( y ( n ) )y ( n ) W
T
N
n=0
W = I ---- ( Y )Y W
1 T
N
Problem 10.19
(a) Let q(y) denote a pdf equal to the determinant det(J) with the elements of the Jacobian J being
as defined in Eq. (10.115). Then using Eq. (10.116) we may express the entropy of the random
vector Z at the output of the nonlinearity in Fig. 10.16 of the text as
h(Z) = D f q
Df q = Df f
+D
f q
h(Z) = D f f
D (1)
f q
12
(b) If q(yi) happens to equal the source pdf fU(yi) for all i, we then find that D = 0 . In such a
f q
case, (1) reduces to
h(Z) = D f f
That is, the entropy h(Z) is equal to the negative of the Kullback-Leibler divergence between
the pdf fY(y) and the corresponding factorial distribution f Y ( y ) .
Problem 10.20
z i
= log det ( A ) + log det ( W ) + log -------
y i
i
The matrix A of the linear mixer is fixed. Hence differentiating with respect to W:
z i
--------- = W + --------- log -------
T
(1)
W i
W yi
1
z i = -----------------
yi
-
1+e
z e
yi
-------i = ------------------------2-
yi y
(1 + e i)
2
= zi zi (2)
z i
Hence, differentiating log ------- with respect to the demixing matrix W, we get
y i
z
--------- log -------i = --------- log ( z i z i )
2
W y i W
13
z i 2
= --------- ------- log ( z i z i )
W z i
z i 1
= --------- ------------------- ( 1 2z i )
W ( z z 2 )
i i
z i y i 1
= ------- --------- ------------------- ( 1 2z i ) (3)
y i W ( z z 2 )
i i
z 1
-------i --------------2- = 1
yi z z
i i
z y
--------- log -------i = ---------i ( 1 2z i )
W y i W
yi
--------- = W + --------- ( 1 2z i )
T
W i
W
Putting this relation in matrix form and recognizing that the demixer output y is equal to Wx
where x is the observation vector, we find that the adjustment applied to W is defined by
W = ---------
W
T T
= (W + ( 1 2z )x )
14
CHAPTER 11
Stochastic Methodfs Rooted in Statistical Mechanics
Problem 11.1
By definition, we have
(n)
p ij = P( X t = j X t n = i)
where t denotes time and n denotes the number of discrete steps. For n = 1, we have the one-step
transition probability
(1)
p ij = p ij = P ( X t = j X t 1 = i )
(2)
p ij = pik pkj
k
where the sum is taken over all intermediate steps k taken by the system. By induction, it thus
follows that
(n 1) (n)
p ij = pik pkj
k
Problem 11.2
For p > 0, the state transition diagram for the random walk process shown in Fig., P11.2 of the test
is irreducible. The reason for saying so is that the system has only one class, namely,
{0, +1, +2, ...}.
Problem 11.3
The state transition diagram of Fig. P11.3 in the text pertains to a Markov chain with two classes:
{x1} and {x1, x2}.
1
Problem 11.4
The stochastic matrix of the Markov chain in Fig. P11.4 of the text is given by
3 1
--- --- 0
4 4
P = 0 2 1
--- ---
3 3
1 3
--- --- 0
4 4
Let 1, 2, and 3 denote the steady-state probabilities of this chain. We may then write (see Eq.
(11.27) of the text)
1 = 1 --- + 2 ( 0 ) + 3 ---
3 1
4 4
3 = 1 ( 0 ) + 2 --- + 3 ( 0 )
1
3
That is,
1 = 3
2 = 3 3
1 + 2 + 3 = 1
Hence,
3 + 3 3 + 3 = 1
or equivalently
1
3 = ---
5
and so
2
1
1 = ---
5
3
2 = ---
5
Problem 11.6
The Metropolis algorithm and the Gibbs sampler are similar in that they both generate a Markov
chain with the Gibbs distribution as the equilibrium distribution.
They differ from each other in the following respect: In the Metropolis algorithm, the
transition probabilities of the Markov chain are stationary. In contrast, in the Gibbs sampler, they
are nonstationary.
Problem 11.7
Problem 11.8
(a) We start with the notion that a neuron j flips from state xj to -xj at temperature T with
probability
1
P ( x j x j ) = --------------------------------------------- (1)
1 + exp ( E j T )
3
where Ej is the energy difference resulting from such a flip. The energy function of the Boltzman
machine is defined by
1
E = --- w ji x i x j
2 i j
i j
Hence, the energy change produced by neuron j flipping from state xj to -xj is
= ( x j ) w ji x i + ( x j ) w ji x i
j j
= 2x j w ji x i
j
= 2x j v j (2)
1
P ( x j x j ) = ---------------------------------------------
1 + exp ( 2x j vj T )
This means that for an initial state xj = -1, the probability that neuron j is flipped into state +1 is
1
------------------------------------------- (3)
1 + exp ( 2 vj T )
(c) For an initial state of xj = +1, the probability that neuron j is flipped into state -1 is
1 1
------------------------------------------- = 1 ------------------------------------------- (4)
1 + exp ( +2v j T ) 1 + exp ( 2 vj T )
The flipping probability in (4) and the one in (3) are in perfect agreement with the following
probabilistic rule
+1 with probability P ( v j )
xj =
-1 with probability 1-P ( v j )
4
where P(vj) is itself defined by
1
P ( v j ) = -------------------------------------------
1 + exp ( 2 vj T )
Problem 11.9
E(x) E(x)
L(w) = log exp ------------ log exp ------------
T T
x T x x
L ( w ) 1 E ( x )
---------------- = --- --------------- ------------------------------------- + -------------------------------------
1 1
w ji T x T w ji E ( x -) E(x)
exp ----------- exp ------------
T
T
x x
1
E ( x ) = --- w ji x i x j
2 i j
i j
Hence,
E ( x )
--------------- = x i x j , i j (1)
w ji
1
P ( X = x X = x ) = ------------------------------------- (2)
E ( x )
exp ----------- T
-
x
1
P ( X = x ) = ------------------------------------- (3)
E ( x )
exp ----------- T
-
x
5
Accordingly, using the formulas of (1) to (3), we may redefine the derivative L ( w ) w ji as
follows:
L ( w )
---------------- = --- P ( X = x X = x )x j x i P ( X = x )x j x i
1
w ji Tx T x x
Problem 11.10
(a) Factoring the transition process from state i to state j into a two-step process, we may express
the transition probability pji as
p ji = ji q ji for j i (1)
where ji is the probability that a transition from state j to state i is attempted, and qji is the
conditional probability that the attempt is successful given that it was attempted. When j = i, the
property that each row of the stochastic matrix must add to unity implies that
p ii = 1 pij
ji
= 1 ij qij
ji
ji = 1 for all i j
j
q ji = 1 q ij (3)
6
Hence, using (1) to (3) in (4):
i = j ji p ji
j
= j ji ( 1 qij ) (5)
j
we may go on to write
i = i p ij
j
= i pij
j
= i ij qij (6)
j
Hence, combining (5) and (6), using the symmetry property of (2), and then rearranging terms:
i q ij + j q ij j = 0
1
q ij = ---------------------------- (8)
1 + ( i j )
E i = T log i + T *
7
E
i = --- exp -----i
1
Z T
where
Z = exp -------
T*
T
1
q ij = ------------------------------------------------------
1 + exp --- ( E i E j )
1
T
1
= ------------------------------------------ (9)
1 + exp ( E T )
i = 1
i
and therefore
Z = exp ( E i T )
i
(e) The formula of (9) is the only possible distribution for state transitions in the Boltzmann
machine; it is recognized as the Gibbs distribution.
Problem 11.11
+
p
+
D p+ p- = p log ------- (1)
p
The probability distribution p + in the clamped condition is naturally independent of the synaptic
weights wji in the Boltzman machine, whereas the probability distribution p - is dependent on wji.
Hence differentiating (1) with respect to wji:
8
D p + p - +
p p
-
--------------------- = ------- ----------- (2)
w ji p
w ji
D p+ p-
w ji = ------------------
w ji
+ -
p p
= ------- ----------
- (3)
p w ji
Let p - denote the joint probability that the visible neurons are in state and the hidden
neurons are in state , given that the network is in its clamped condition. We may then write
p
- -
p =
Assuming that the network is in thermal equilibrium, we may use the Gibbs distribution
E
p = --- exp --------
- 1
Z T -
to write
E
p = --- exp --------
- 1
(4)
Z T -
where E is the energy of the network when the visible neurons are in state and the hidden
neurons are in state . The partition function Z is itself defined by
E
Z = exp --------
T
-
9
1
E = --- w ji x j x i (5)
2 i j
i j
where x i is the state of neuron i when the visible neurons are in state and the hidden neurons
are in state . Therefore, using (4):
-
p E E 1 Z E
- = ------- exp --------
- ------------ -----2- ----------- exp --------
1
---------- - (6)
w ji ZT T w ji Z w ji T
E
------------ = x j x i (7)
w ji
E E E
------- exp -------- ------------ = + ------- exp --------
1 1
- x x i
ZT T w ji ZT T - j
1
= --- p x j
-
T x i
E
p = --- exp --------
- 1
Z T -
as the probability that the visible neurons are in state and the hidden neurons are in state in the
free-running condition. Consider next the second term on the right-hand side of (6). Except for the
minus sign, we may express this term as the product of two factors:
1 Z E E 1 Z
-----2- ----------- exp -------- = --- exp --------
1
- --- ----------- (8)
T T -
Z w ji Z Z w ji
-
The first factor in (8) is recognized as the Gibbs distribution p defined by
E
p = --- exp --------
- 1
(9)
Z T -
10
To evaluate the second factor in (8), we write
1 Z 1 E
--- ----------- = --- ----------- exp --------
Z w ji Z w ji T -
E E
exp --------
- ------------
1
= -------
TZ
T w ji
1 E
--------
= -------
TZ T - x j
exp x i
1
p x j xi
-
= --- (10)
T
-
1 Z E p
-----2- ----------- exp -------- = ------ p x j
-
- x i (11)
Z w ji T T
- -
p 1 p
- = --- p x j - p x j
- -
----------
w ji T x i -----
T x i
+
1. The sum of probability p over the states is unity, that is,
p
+
= 1 (12)
3. The probability of a hidden state, given some visible state, is naturally the same whether the
visible neurons of the network in thermal equilibrium are clamped in that state by the external
environment or arrive at that state by free running of the network, as shown by
- +
p = p (15)
In light of this relation we may rewrite Eq. (13) as
11
- + -
p = p p (16)
Moreover, we may write
+
p - + +
------- p = p p
p
+
= p (17)
+
p
= --- ------- p x j p p x j xi
- + -
w ji x i
T p
= --- p x j xi p x j xi
+ -
T
+ +
ji = < xj x i >
p x j xi
+
=
- -
ji = < xj x i >
p x j xi
-
=
+ -
w ji = ( ji ji )
Problem 11.12
12
+
p
+
D + - = p log --------
- (1)
p p
p -
+ + +
p = p p (2)
- - - - +
p = p p = p p (3)
where, in the last line, we have made use of the fact that the input neurons are always clamped to
the environment, which means that
- +
p = p
+
p
+ +
D + - = p p log ---------
- (4)
p p
p -
where the state refers to the input neurons and refers to the output neurons.
(b) With p denoting the conditional probability of finding the output neurons in state , given
that the input neurons are in state , we may express the probability distribution of the output
states as
p p
- - -
p =
-
The conditional p is determined by the synaptic weights of the network in accordance with the
formula
E
= --------- exp -----------
- 1
p (5)
Z 1 T
where
1
E = --- w ji [ s j s i ] (6)
2 j i
13
The parameter Z1 is the partition function:
E
exp -----------
1
--------- = (7)
Z 1
T
The function of the Boltzmann machine is to find the synaptic weights for which the conditional
- +
probability p approaches the desired value p .
D + -
p p
w ji = - ------------------- (8)
w ji
+
Using (4) in (8) and recognizing that p is determined by the environment (i.e., it is
independent of the network), we get
+ -
p p
- - ----------
+
w ji = p --------- - (9)
p
w ji
-
To evaluate the partial derivative p w ji we use (5) to (7):
-
p E E
----------- = --------- --- exp -----------
1 1
-----------
w ji Z 1 T T w ji
1 Z 1 E
- ------------ exp -----------
--------
T
Z 1 w ji
2
E
---T [ s j si ] exp -----------
1 1
= ---------
Z 1
T
E
[ s j si ] exp -----------
1
+ --------
- (10)
Z 1
2
T
E
--------- exp -----------
1 -
= p 1 (11)
Z 1 T
14
E
[ s j si ] exp -----------
1 - -
--------- = <s j s i > p (12)
Z 1
T
-
where the term <s j s i > is the averaged correlation of the states sj and si with the input neurons
clamped to state and the network in a free-running condition. Substituting (11) and (12) in (10):
-
p
-------------- = --- [ s j s i ] p <s j s i > p
1 - - -
(13)
w ji T
+
p
= --- p [ s j s i ] p 1 ---------
+ -
w ji -
T p -
p <s j s i > p
+ - +
(14)
p
+
= 1 for all (15)
+ +
p p
j i ------------
- +
[ s j s i ] p | ---------
- = p [ s s ] -
p - p
-
p <s j si >
+
=
+
= <s j s i > (16)
w ji = --- p ( <s j s i > <s j s i > )
+ + -
T
= p ( ji
+ + -
ji )
15
+ -
where = T ; and ji and ji are the averaged correlations in the clamped and free-
running conditions, given that the input neurons are in state .
Problem 11.15
E = P ( x C j )d ( x, y j ) (1)
x j
where d(x, yj) is the distortion measure for representing the data point x by the vector yj, and
P ( x C j ) is the probability that x belongs to the cluster of points represented by yj. To
determine the association probabilities at a given expected distortion, we maximize the entropy
subject to the constraint of (1). For a fixed Y = {yj}, we assume that the association probabilities
of different data points are independent. We may thus express the entropy as
H = P ( x C j ) log P ( x C j ) (2)
x j
The probability distribution that maximizes the entropy under the expectation constraint is the
Gibbs distribution:
where
exp ---d ( x, y j )
1
Zx =
T
j
is the partition function. The inverse temperature B = 1/T is the Lagrange multiplier defined by the
value of E in (1).
Problem 11.6
F = D TH (1)
where D is the expected distortion, T is the temperature, and H is the conditional entropy. The
expected distortion is defined by
16
D= P(X = x) P(Y = y X = x)d ( x, y ) (2)
x y
d ( x, y )
P(Y = y X = x) = ------ exp -----------------
1
(4)
Zx T
where
d ( x, y )
Zx = exp ----------------
T
- (5)
y
d ( x, y )
= x) ------ exp ----------------- d ( x, y )
1
P(X
*
F =
Zx T
x y
d ( x, y )
+ T P(X = x) ------ exp ----------------- log Z x --- d ( x, y )
1 1
Zx T T
x y
d ( x, y )
= T P(X = x) ------ exp ----------------- ( log Z x )
1
Zx T
x y
This result simplifies as follows by virtue of the definition given in (5) for the partition function:
F = T P(X = x) log Z x
*
(6)
x
(b) Differentiating the minimum free energy F* of (6) with respect to y:
1 Z x
*
F
--------- = T P(X = x) ------ --------- (7)
y x
Z x y
Z x d ( x, y ) d ( x, y )
--------- = --- exp ----------------- --------------------
1
(8)
y T y T y
17
Hence, we may rewrite (7) as
*
F d ( x, y ) d ( x, y )
= x) ------ exp ----------------- --------------------
1
--------- =
y P(X Zx T y
x y
d ( x, y )
= P(X = x)P(Y = y X = x) --------------------
y
(9)
x
we may then state that the condition for minimizing the Lagrangian with respect to y is
d ( x, y )
P(X = x, Y = y) -------------------
y
- = 0 for all y (10)
x
Normalizing this result with respect to P(X = x) we get the minimizing condition:
d ( x, y )
P(Y = y X = x) -------------------
y
- = 0 for all y (11)
x
2 T
d ( x, y ) = x y = (x y) (x y)
d ( x, y ) T
-------------------- = ------ ( x y ) ( x y )
y y
= 2 ( ( x y ) ) (12)
For this particular measure we find it more convenient to normalize (10) with respect to the
probability
d ( x, y )
P(X = x|Y = y) -------------------
y
- = 0 (13)
x
18
Using (12) in (13) and solving for y, we get the desired minimizing solution
Problem 11.17
The advantage of deterministic annealing over maximum likelihood is that it does not make any
assumption on the underlying probability distribution of the data.
Problem 11.18
(a) Let
k ( x ) = exp --------2- x = t k ,
1 2
k = 1, 2, ..., K
2
where tk is the center or prototype vector of the kth radial basis function and K is the number of
such functions (i.e., hidden units). Define the normalized radial basis function
k ( x )
P k ( x ) = --------------------
-
k ) ( x
k
N
1
d = ---- y i F ( x i )
2
(1)
N
i=1
where F ( x i ) is the output vector of the RBF network in response to the input xi. The Gibbs
distribution for P ( x R ) is
19
exp ---T
d
Zx = (3)
yi
F = d - TH
where the average squared cost d is defined in (1), and the entropy H is defined by
H = p ( j x ) log ( j x )
x j
where p ( j x ) is the probability of associating class j at the output of the RBF network with the
input x.
20
CHAPTER 12
Dynamic Programming
Problem 12.1
As the discount factor approaches 1, the computation of the cost-to-go function J (i) becomes
longer because of the corresponding increase in the number of time steps involved in the
computation.
Problem 12.2
(a) Let be an arbitrary policy, and suppose that this policy chooses action a A i at time step 0.
We may then write
N
J ( i ) = p a c ( i, a ) + pij ( a )W ( j )
a Ai j=1
where pa is the probability of choosing action a, c(i, a) is the expected cost, pij(a) is the
probability of transition from state i to state j under action a, W(j) is the expected cost-to-go
function from time step n = 1 onward, and j is the state at that time step. We now note that
W ( j) J ( j)
which follows from the observation that if the state at time step n = 1 is j, then the situation at that
time step is the same as if the process had started in state j with the exception that all the returns
are multiplied by the discount factor . Hence, we have
J ( i ) p a c ( i, a ) + p ij ( a )J ( j )
j
p a min c ( i, a ) + p ij ( a )J ( j )
aA j
i
= min c ( i, a ) + p ij ( a )J ( j )
a j
J ( i ) min c ( i, a ) + p ij ( a )J ( j )
(1)
aA
i j
1
(b) Suppose we next go the other way by choosing a0 with
c ( i, a 0 ) + p ij ( a 0 ) = min c ( i, a ) + p ij ( a )J ( j ) (2)
a
j j
Let be the policy that chooses a0 at time step 0 and, if the next state is j, the process is viewed as
originating in state j following a policy j such that
j
J J ( j) +
N
j j
J = c ( i, a 0 ) + pij ( a0 )J ( j)
j=1
N
c ( i, a 0 ) + pij ( a0 )J ( j ) + (3)
j=1
N
J ( i ) c ( i, a 0 ) + p ij ( a 0 )J ( j ) +
j=1
J ( i ) min c ( i, a ) + p ij ( a )J ( j ) + (4)
aA j
i
(c) Finally, since is arbitrary, we immediately deduce from (1) and (4) that the optimum cost-to-
go function
J ( i ) = min c ( i, a ) + p ij ( a )J ( j )
* *
(5)
a j
Problem 12.3
Writing the system of N simultaneous equations (12.22) of the text in matrix form:
2
J = c() + P()J (1)
where
() T
J = J ( ) ( 1 ), J ( ) ( 2 ), J ( ) ( N )
T
c ( ) = C ( 1, ), C ( 2, ), C ( N , )
p 11 ( ) p 12 ( ) p 1N ( )
p 21 ( ) p 22 ( ) p 2N ( )
P() =
...
...
...
p N 1 ( ) p N 2 ( ) p NN ( )
()
I P ( )J = c()
where I is the N-by-N identity matrix. For the solution J to be unique we require that the N-by-N
matrix (I - P()) have an inverse matrix for all possible values of the discount factor .
Problem 12.4
Consider an admissible policy {0, 1, ...}, a positive integer K, and cost-to-go function J. Let the
costs of the first K stages be accumulated, and add the terminal cost KJ(XK), thereby obtaining
the total expected cost
K -1
K n
E J (X K) + g ( X n, n ( X n ), X n-1 )
n=0
where E is the expectational operator, To minimize the total expected cost, we start with KJ(XK)
and perform K iterations of the dynamic programming algorithm, as shown by
JK(X) = KJ(X)
3
Now consider the function Vn defined by
J K -n ( X )
V n ( X ) = --------------------
K -n
- for all n and X (2)
The function Vn(X) is the optimal K-stage cost J0(X). Hence, the dynamic programming algorithm
of (1) can be rewritten in terms of the function Vn(X) as follows:
V n+1 ( X 0 ) = min E [ g( X 0, ( X 0 ), X 1 ) + V n ( X 1 ) ]
V0(X) = J(X)
which has the same mathematical form as that specified in the problem.
Problem 12.5
n+1 n
J J
This property follows from the fact that if the terminal cost gK for K stages is changed to a
uniformly larger cost g K , that is,
then the last stage cost-to-go function JK-1(XK-1) will be uniformly increased. In more general
terms, we may state the following.
E [ g K ( X K , K , X K +1 ) + J K +1 ( X K +1 ) ] E [ g K ( X K , K , X K +1 ) + J K +1 ( X K +1 ) ]
This relation merely restates the monotonicity property of the dynamic programming algorithm.
4
Problem 12.6
According to (12.24) of the text the Q-factor for state-action pair (i, a) and stationary policy
satisfies the condition
Q ( i, ( i ) ) = min Q ( i, a ) for all i
a
This equation emphasizes the fact that the policy is greedy with respect to the cost-to-go
function J(i).
Problem 12.7
Figure 1, shown below, presents an interesting interpretation of the policy iteration algorithm. In
this interpretation, the policy evaluation step is viewed as the work of a critic that evaluates the
performance of the current policy; that is, it calculates an estimate of the cost-to-go function J n .
The policy improvement step is viewed as the work of a controller or actor that accounts for the
latest evaluation made by her critic and acts out the improved policy n+1. In short, the critic looks
after policy evaluation and the controller (actor) looks after policy improvement and the iteration
between them goes on.
State
n+1(i) i
Environment
Critic
Controller
(Actor)
Cost-to-go
Jn
Problem 12.8
From (12.29) in the text, we find that for each possible state, the value iteration algorithm requires
NM iterations, where N is the number of states and M is the number of admissible actions. Hence,
the total number of iterations for all N states in N2M.
5
Problem 12.9
To reformulate the value-iteration algorithm in terms of Q-factors, the only change we need to
make is in step 2 of Table 12.2 in the text. Specifically, we rewrite this step as follows:
J n+1 ( i ) = min Q ( i, a )
a
Problem 12.10
The policy-iteration algorithm alternates between two steps: policy evaluation, and policy
improvement. In other words, an optimal policy is computed directly in the policy iteration
algorithm. In contrast, no such thing happens in the value iteration algorithm.
Another point of difference is that in policy iteration the cost-to-go function is recomputed
on each iteration of the algorithm. This burdensome computational difficulty is avoided in the
value-iteration algorithm.
Problem 12.14
From the definition of Q-factor given in (12.24) in the text and Bellmans optimality equation
(12.11), we immediately see that
*
J ( i ) = min Q ( i, a )
a
Problem 12.15
The value-iteration algorithm requires knowledge of the state transition probabilities. In contrast,
Q-learning operates without this knowledge. But through an interactive process, Q-learning learns
estimates of the transition probabilities in an implicit manner. Recognizing the intimate
relationship between value iteration and Q-learning, we may therefore view Q-learning as an
adaptive version of the value-iteration algorithm.
6
Problem 12.16
Using Table 12.4 in the text, we may construct the signal-flow graph in Figure 1 for the Q-
learning algorithm:
Unit
delay
Problem 12.17
The whole point of the Q-learning algorithm is that it eliminates the need for knowing the state
transition probabilities. If knowledge of the state transition probabilities is available, then the Q-
learning algorithm assumes the same form as the value-iteration algorithm.
7
CHAPTER 13
Neurodynamics
Problem 13.1
The equilibrium state x(0) is (asymptotically) stable if in a small neighborhood around x(0), there
exists a positive definite function V(x) such that its derivative with respect to time is negative
definite in that region.
Problem 13.3
dx
--------j = j ( W, i, x ) , j = 1, 2, ..., N
dt
where W is the weight matrix, i is the bias vector, and x is the state vector with its jth element
denoted by xj.
(a) With the bias vector i treated as input and with fixed initial condition x(0), let x ( ) denote the
final state vector of the system. Then,
0 = j ( W, i, x ( ) ) , j = 1, 2, ..., N
For a given matrix W and input vector i, the set of initial points x(0) evolves to a fixed point. The
fixed points are functions of W and i. Thus, the system acts as a mapper with i as input and
x ( ) as output, as shown in Fig. 1(a):
x() x(0)
i W; W; x()
x(0) : fixed i : fixed
(a) (b)
(b) With the initial state vector x(0) treated as input, and the bias vector i being fixed, let x ( )
denote the final state vector of the system. We may then write
0 = j ( W, i:fixed, x ( ) ) , j = 1, 2, .., N
1
Thus with x(0) acting as input and x ( ) acting as output, the dynamic system behaves like a
pattern associator, as shown in Fig. 1b.
Problem 13.4
T
1 = +1, +1, +1, +1, +1
T
2 = +1, -1, -1, +1, -1
T
3 = 1-, +1, -1, +1, +1
p
1 P
W = ---- i i ----I
T
N N
i=1
0 -1 +1 +1 -1
-1 0 +1 +1 +3
1
= --- +1 +1 0 -1 +1
5
+1 +1 -1 0 +1
-1 +3 +1 +1 0
i = sgn ( W i ) , i = 1, 2, 3
0 -1 +1 +1 -1 +1
-1 0 +1 +1 +3 +1
sgn ( W 1 ) = sgn ---
1
5 +1 +1 0 -1 +1 +1
+1 +1 -1 0 +1 +1
-1 +3 +1 +1 0 +1
2
0 +1
+4 +1
= sgn --- =
1
5 +2 +1 = 1
+2 +1
+4 +1
0 -1 +1 +1 -1 +1
-1 0 +1 +1 +3 -1
sgn ( W 2 ) = sgn ---
1
5 +1 +1 0 -1 +1 -1
+1 +1 -1 0 +1 +1
-1 +3 +1 +1 0 -1
+2 +1
-4 -1
= sgn --- =
1
5 -2 -1 = 2
0 +1
-4 -1
0 -1 +1 +1 -1 -1
-1 0 +1 +1 +3 +1
sgn ( W 3 ) = sgn ---
1
5 +1 +1 0 -1 +1 -1
+1 +1 -1 0 +1 +1
-1 +3 +1 +1 0 +1
-2 -1
+4 +1
= sgn --- =
1
5 0 -1 = 2
+2 +1
+4 +1
Note: Wherever a particular element of the product W i is zero, the neuron in question is left in
its previous state.
3
T
x = +1, -1, +1, +1, +1
which is the fundamental memory with its second element reversed in polarity. We write
0 -1 +1 +1 -1 +1
-1 0 +1 +1 +3 -1
1
Wx = --- +1 +1 0 -1 +1 +1
5
+1 +1 -1 0 +1 +1
-1 +3 +1 +1 0 +1
+2
+4
1
= --- 0 (1)
5
0
-2
Therefore,
+1
+1
sgn ( Wx ) = +1
+1
-1
Thus, neurons 2 and 5 want to change their states. We therefore have 2 options:
4
0 -1 +1 +1 -1 +1
-1 0 +1 +1 +3 -1
1
Wx = --- +1 +1 0 -1 +1 +1
5
+1 +1 -1 0 +1 +1
-1 +3 +1 +1 0 -1
+4
-2
1
= --- -2
5
-2
-2
+1
-1
sgn ( Wx ) = -1
-1
-1
In both cases, the new state would satisfy the alignment condition and the computation is then
terminated.
Thus, when the noisy version of 1 is applied to the network, with its second element
changed in polarity, one of 2 things can happen with equal likelihood:
5
Problem 13.5
T
x = +1, -1, +1, +1, +1
2
4
1
Wx = --- 0
5
0
-2
and
+1
+1
sgn ( Wx ) = +1
+1
-1
According to this result, neurons 2 and 5 have changed their states. In synchronous updating, this
is permitted. Thus, with the new state vector
+1
+1
x = +1
+1
-1
0 -1 +1 +1 -1 +1
-1 0 +1 +1 +3 +1
1
Wx = --
-
5 +1 +1 0 -1 +1 +1
+1 +1 -1 0 +1 +1
-1 +3 +1 +1 0 -1
6
+2
-2
1
= --- 0
5
0
+4
Hence,
+1
-1
sgn ( Wx ) = +1
+1
+1
+1
-1
x = +1
+1
+1
which is recognized as the original probe. In this problem, we thus find that the network
experiences a limit cycle of duration 2.
Problem 13.6
T
1 = -1, -1, -1, +1, -1
T
2 = +1, +1, +1, -1, +1
T
3 = +1, -1, +1, -1, -1
are simply the negatives of the three fundamental memories considered in Problem 13.4,
respectively. These 3 vectors are therefore also fundamental memories of the Hopfield network.
7
(b) Consider the vector
T
x = 0, +1, +1, +1, +1
which is the result of masking the first element of the fundamental memory 1 of Problem 13.4.
According to our notation, a neuron of the Hopfield network is in either state +1 or -1. We
therefore have the choice of setting the zero element of x to +1 or -1. The first option restores the
vector x to its original form: fundamental memory 1 , which satisfies the alignment condition.
Alternatively, we may set the zero element equal to -1, obtaining
T
x = -1, +1, +1, +1, +1
In this latter case, the alignment condition is not satisfied. The obvious choice is therefore the
former one.
Problem 13.7
We are given
W = 0 -1
-1 0
Ws 2 = 0 -1 -1
-1 0 +1
= -1
+1
which yields
sgn ( Ws 2 ) = -1 = s
2
+1
8
Ws 4 = 0 -1 +1
-1 0 -1
= +1
-1
which yields
sgn ( Ws 4 ) = +1 = s 4
-1
Thus, both states s2 and s4 satisfy the alignment condition and are therefore stable.
Ws 1 = 0 -1 +1
-1 0 +1
= -1
-1
which yields
sgn ( Ws 1 ) = -1 = s 1
-1
Thus, both neurons want to change; suppose we pick neuron 1 to change its state, yielding the
new state vector [-1, +1]T. This is a stable vector as it satisfies the alignment condition. If,
however, we permit neuron 2 to change its state, we get a state vector equal to s4. Similarly, we
may show that the state vector s3 = [-1, -1]T is also unstable. The resulting state-transition diagram
of the network is thus as depicted in Fig. 1.
9
(-1, 1)
. x2
. (1, 1)
(-1, -1)
. . (1, -1)
The results depicted in Fig. 1 assume the use of asynchronous updating. If, however, we
use synchronous updating, we find that in the case of s1:
sgn ( Ws 1 ) = -1
-1
Permitting both neurons to change state, we get the new state vector [-1, -1]T. This is recognized
to be stable state s3. Now, we find that
0 -1 -1 = +1
-1 0 -1 +1
Thus, in the synchronous updating case, the states s1 and s3 represent a limit cycle with
length 2.
Returning to the normal operation of the Hopfield network, we note that the energy
function of the network is
1
E = --- w ji s i s j
2 i j
i j
1 1
= --- w 12 s 1 s 2 --- w 21 s 2 s 1
2 2
= w 12 s 1 s 2 since w 12 = w 21
= s1 s2 (1)
10
Evaluating (1) for all possible states of the network, we get the following table:
State Energy
[+1, +1] +1
[-1, +1] -1
[-1. -1] +1
[+1. -1] -1
Thus, states s1 and s3 represent global minima and are therefore stable.
Problem 13.8
1
E = --- w ji s j s i (1)
2 i j
1
m v = ---- s j v, j (2)
N j
1
w ji = ---- v, j v, i (3)
N v
1
E = -------- v, j v, i s j s i
2N i j v
= -------- s i v, i s j v, j
1
2N v i
j
1
= -------- ( m v N ) ( m v N )
2N v
N
= ---- m v
2
2 v
11
Problem 13.11
N N N
1 uj
E = --- c ji i ( u i ) j ( u j ) 0 b j ( ) j ( ) d (1)
2
i=1 j=1 j=1
where j ( . ) is the derivative of the function j ( . ) with respect to its argument. We now
differentiate the function E with respect to time t and note the following relations:
1. Cji = Cij
u j
2. ----- j ( u j ) = -------- ----- j ( u j )
t t u j
u j
= -------- j ( u j )
t
uj u j u j
3. ----- b j ( ) j ( ) d = -------- ----- b j ( ) j ( ) d
t 0 t u j 0
u j
= -------- b j ( u j ) j ( u j )
t
E u j N N N
------- = -------- c ji j ( u j ) j j j j
b ( u ) ( u ) (2)
t t
i=1 j=1 j=1
N
u
--------j = a j ( u j ) b j ( u j )
t
ji i i ,
c ( u ) j = 1, 2, ..., N (3)
j=1
Hence using (3) in (2) and collecting terms, we get the final result
N N 2
E
------- = a j ( u j ) j ( u j ) b j ( u j ) c ji i ( u i ) (4)
t
j=1 i=1
12
aj(uj) > 0 for all uj
and the function j ( u j ) satisfies the monotonicity condition
j ( u j ) 0 for all uj,
we then immediately see from (4) that
E
------- 0 for all t
t
In words, the function E defined in(1) is the Lyapunov function for the coupled system of
nonlinear differential equations (3).
Problem 13.12
N
d
----- v j ( t ) = v j ( t ) + c ji i ( v i ) , j = 1, 2, ..., N (1)
dt
i=1
where
c ji = ji + w ji
where ji is a Kronecker delta. According to the Cohen-Grossberg theorem of (13.47) in the text,
we have
N
d
----- u j ( t ) = a j ( u j ) b j ( u j ) c ji i ( u i ) (2)
dt
i=1
Comparison of (1) and (2) yields the following correspondences between the Cohen-Grossberg
theorem and the brain-in-state-box (BSB) model:
13
1 uj
E = --- c ji i ( u i ) j ( u j ) b j ( ) j ( ) d ,
2 i j j
1 vj
E = - --- c ji ( v i ) ( v j ) + ( ) d (3)
2 i j j 0
+1 if y j > 1
( y j ) = y j if -1 y j 1
-1 if y j -1
We therefore have
0, yj > 1
( y j ) =
1, yj 1
vj vj 1
0 ( ) d 0 d = --- v j
2
=
j j
2 j
1
= --- x j inside the linear region
2
(4)
2 j
1 1
--- c ji ( v i ) ( v j ) = --- ( ji + w ji ) ( v i ) ( v j )
2 j i 2 j i
1
= --- w ji x j x i --- ( v j )
2
2 j i 2 j
1
= --- w ji x j x i --- x j
2
(5)
2 j i 2 j
14
T
E = --- w ji x j x i --- x Wx
2 j i 2
Problem 13.13
The activation function ( v ) of Fig. P13.13 is a nonmonotonic function of the argument v; that is,
v assumes both positive and negative values. It therefore violates the monotonicity
condition required by the Cohen-Grossberg theorem; see Eq. (4) of Problem 13.11. This means
that the cohen-Grossberg theorem is not applicable to an associative memory like a Hopfield
network that uses the activation function of Fig. P14.15.
15
CHAPTER 15
Dynamically Driven Recurrent Networks
Problem 15.1
Referring to the simple recurrent neural network of Fig. 15.3, let the vector u(n) denote the input
signal, the vector x(n) denotes the signal produced at the output of the hidden layer, and the vector
y(n) denotes the output signal of the whole network. Then, treating x(n) as the state of the
network, we may describe the state-space model of the network as follows:
x ( n + 1 ) = f ( x ( n ), u ( n ) )
y(n) = g(x(n))
Problem 15.2
x I ( n + 1 ) = f 1 ( x I ( n ), u ( n ) ) (1)
x II ( n + 1 ) = f 2 ( x II ( n ), x I ( n + 1 ) ) (2)
x 0 ( n + 1 ) = f 3 ( x 0 ( n )x II ( n + 1 ) ) (3)
x II = f 2 ( x II, f 1 ( x I, u ( n ) ) ) (4)
x II ( n )
x II ( n + 1 ) = (5)
x0 ( n 1 )
x ( n + 1 ) = f ( x ( n ), u ( n ) ) (6)
1
y ( n ) = x0 ( n ) (7)
With x0(n) included in the definition of the state x(n + 1) and with x(n) dependent on the input
u(n), we thus have
y ( n ) = g ( x ( n ), u ( n ) ) (8)
where g ( .,. ) is another vector valued function. Equations (6) and (8) define the state-space model
of the recurrent MLP.
Problem 15.3
It is indeed possible for a dynamic system to be controllable but unobservable, and vice versa.
This statement is justified by virtue of the fact that the conditions for controllability and
observability are entirely different, which means that there are situations where the conditions are
satisfied for one and not for the other.
Problem 15.4
x ( n + 1 ) = ( Wa x ( n ) + wb u ( n ) )
x ( n + 2 ) = ( Wa x ( n + 1 ) + wb u ( n + 1 ) )
= ( Wa ( Wa x ( n ) + wb u ( n ) ) + wb u ( n + 1 ) )
x ( n + 3 ) = ( Wa x ( n + 2 ) + wb u ( n + 2 ) )
= ( W a W a (W a x ( n ) + w b u ( n )) + w b u ( n + 1 ) ) + w b u ( n + 2 ))
and so on. By induction, we may state that the state x(n + q) is a nested nonlinear function of x(n)
and uq(n), where
T
u q ( n ) = [ u ( n ) , u ( n + 1 ), , u ( n + q 1 ) ]
x ( n + q )
J q ( n ) = -----------------------
u q ( n ) x(n) = 0
u(n) = 0
2
As an illustrative example, consider the cast of q = 3. The Jacobian of x(n + 3) with respect to
u3(n) is
x ( n + 3 ) x ( n + 2 ) x ( n + 3 )
J 3 ( n ) = -----------------------, -----------------------, -----------------------
u ( n ) u ( n + 1 ) u ( n + 2 ) x(n) = 0
u(n) = 0
x ( n + 3 )
----------------------- = ( 0 )W a ( 0 )W a ( 0 )w b
u ( n )
= AAb
2
= A b
x ( n + 3 )
----------------------- = ( 0 )W a ( 0 )w b
u ( n + 1 )
= Ab
x ( n + 3 )
----------------------- = ( 0 )w b
u ( n + 2 )
= b
All these partial derivatives have been evaluated at x(n) = 0 and u(n) = 0. The Jacobian J3(n) is
therefore
2
J 3 ( n ) = [ A b,Ab,b ]
q1 q2
Jq ( n ) = [ A b,A b, , Ab,b ]
Problem 15.5
x ( n + 1 ) = ( Wa x ( n ) + wb u ( n ) )
T
y(n) = c x(n) (1)
T
y(n + 1) = c x(n + 1)
3
T
= c ( Wa x ( n ) + wb u ( n ) ) (2)
T
y(n + 2) = c x(n + 2)
T
= c ( Wa ( Wa x ( n ) + wb u ( n ) ) + wb u ( n + 1 ) ) (3)
and so on. By induction, we may therefore state that y(n + q) is a nested nonlinear function of x(n)
and uq(n), where
T
u q ( n ) = [ u ( n ) , u ( n + 1 ), , u ( n + q 1 ) ]
T
y q ( n ) = [ y ( n ) , y ( n + 1 ) , , y ( n + q 1 ) ]
The Jacobian of yq(n) with respect to x(n), evaluated at the origin, is defined by
T
y q ( n )
J q ( n ) = ----------------- x(n) = 0
x ( n )
u(n) = 0
y ( n ) y ( n + 1 ) y ( n + 2 )
J 3 ( n ) = --------------, -----------------------, -----------------------
x ( n ) x ( n ) x ( n ) x(n) = 0
u(n) = 0
y ( n )
-------------- = c
x ( n )
y ( n + 1 ) T
----------------------- = c ( ( 0 )W a )
x ( n )
T
= cA
4
y ( n + 2 ) T
----------------------- = c ( ( 0 )W a ) ( ( 0 )W a )
x ( n )
T T
= cA A
T 2
= c(A )
All these partial derivatives have been evaluated at the origin. We thus write
T T 2
J 3 ( n ) = [ c, c ( A , cA ) ]
By induction, we may now state that the Jacobian Jq(n) for observability is, in general,
T T 2 T q1
J q ( n ) = [ c, cA , c ( A ) , , c ( A ) ]
Problem 15.6
x ( n + 1 ) = f ( x ( n ), u ( n ) ) (1)
Suppose x(n) is N-dimensional and u(n) is m-dimensional. Define a new nonlinear dynamic
system in which the input is of additive form, as shown by
x ( n + 1 ) = f ( x ( n ) ) + u ( n ) (2)
where
x ( n ) = x(n) (3)
u(n 1)
u ( n ) = 0 (4)
u(n)
and
f ( x ( n ) ) = f ( x ( n ), u ( n ) ) (5)
0
5
Both x ( n ) and u ( n ) are (N + m)-dimensional, and the first N elements of u ( n ) are zero. From
these definitions, we readily see that
x ( n + 1 ) = x ( n + 1 )
u(n)
= f ( x ( n ), u ( n ) ) = 0
0 u(n)
which is in perfect agreement with the description of the original nonlinear dynamic system
defined in (1).
Problem 15.7
(a) The state-space model of the local activation feedback system of Fig. P15.7a depends on how
the linear dynamic component is described. For example, we may define the input as
T
u ( n ) = [ u ( n ), u ( n 1 ), , u ( n p + 2 ) ]
Let w denote the synaptic weight vector of the single neuron in Fig. P15.7a, with w1 being the first
element and w0 denoting the rest. we may then write
T
x(n) = w z(n) + b
= [ w 1, w 0 ] x ( n 1 ) + b
T
Bu ( n )
= w 1 x ( n 1 ) + Bu ( n ) (2)
where
u ( n ) = u ( n )
1
and
6
T
B = [ w 0 B, b ]
Equations (2) and (3) define the state-space model of Fig. P15.7a, assuming that its linear
dynamic component is described by (1).
(b) Consider next the local output feedback system of Fig. 15.7b. Let the linear dynamic
component of this system be described by (1). The output of the whole system in Fig. 15.7b is
then defined by
T
x(n) = (w z(n) + b)
= [ w 1, w 0 ] x ( n 1 ) + b
T
Bu ( n )
= ( w 1 x ( n 1 ) + Bu ( n ) ) (4)
where w1, w0, B , and u ( n ) are all as defined previously. The output y(n) of Fig. P15.7b is
Equations (4) and (5) define the state-space model of the local output feedback system of Fig.
P15.7b, assuming that its linear dynamic component is described by (1).
The process (state) equation of the local feedback system of Fig. P15.7a is linear but its
measurement equation is nonlinear, and conversely for the local feedback system of Fig. P15.7b.
These two local feedback systems are controllable and observable, because they both satisfy the
conditions for controllability and observability.
Problem 15.8
x ( n + 1 ) = ( Wa x ( n ) + wb u ( n ) )
Hence, we write
x ( n + 2 ) = ( Wa x ( n + 1 ) + wb u ( n + 1 ) )
= ( Wa ( Wa x ( n ) + wb u ( n ) ) + wb u ( n + 1 ) )
7
x ( n + 3 ) = ( Wa x ( n + 2 ) + wb u ( n + 2 ) )
= ( Wa ( Wa ( Wa x ( n ) + wb u ( n ) ) + wb u ( n + 1 ) ) + wb u ( n + 2 ) )
and so on.
By induction, we may now state that x(n + q) is a nested nonlinear function of x(n) and uq(n), and
thus write
x ( n + q ) = g ( x ( n )u q ( n ) )
T
u q ( n ) = [ u ( n ) , u ( n + 1 ), , u ( n + q 1 ) ]
T
y(n + q) = c x(n + q)
T
= c g ( x ( n )u q ( n ) )
= ( x ( n ), u q ( n ) )
8
Problem 15.11
x ( n + 1 ) = f ( x ( n ), u ( n ) ) (1)
x ( n ) = f ( x ( n 1 ), u ( n 1 ) )
x ( n 1 ) = f ( x ( n 2 ), u ( n 2 ) )
x ( n 2 ) = f ( x ( n 3 ), u ( n 3 ) )
and so on. Accordingly, the simple recurrent network of Fig. 15.3 may be unfolded in time as
follows:
x(n-3)
x(n-2)
f( . )
u(n-3) x(n-1)
f( . )
x(n) y(n)
u(n-2)
f( . ) g( . )
u(n-1)
Problem 15.12
The local gradient for the hybrid form of the BPTT algorithm is given by
( v j ( l ) )e j ( l ) for l = n
( v j ( l ) ) e j ( l ) + w kj ( l ) l ( l + 1 ) for n h < l < n
j(l ) =
k
( v ( l ) ) w ( l ) ( l + 1 )
j kj l for n h < l < j h
k
where h is the number of additional steps taken before performing the next BPTT computation,
with h < h .
9
Problem 15.13
(a) The nonlinear state dynamics of the real-time recurrent learning algorithm of described in
(15.48) and (15.52) olf the text may be reformulated in the equivalent form:
y j(n + 1) i ( n )
-------------------------- = ( v j ( n ) ) w ji ( n ) ------------------- + kj l ( n ) (1)
w kl ( n ) iAB
w kl ( n )
where kj is the Kronecker delta and yj(n + 1) is the output of neuron j at time n + 1. For a
teacher-forced recurrent network, we have
u i ( n ) if i A
i ( n ) = d i ( n ) if i C (2)
y i ( n ) if i B-C
y j(n + 1) yi ( n )
- = ( v j ( n ) ) w ji ( n ) ------------------- + kj l ( n )
------------------------- (3)
w kl ( n ) i B-C
w kl ( n )
(b) Let
j yi ( n )
kl ( n ) = ------------------
-
w kl ( n )
j yi ( n + 1 ) yi ( n + 1 )
kl ( n + 1 ) = ---------------------------- -------------------------
w kl ( n + 1 ) w kl ( n )
j j
kl ( n + 1 ) = ( v j ( n ) ) w ji ( n ) kl ( n ) + kj l ( n ) (4)
i B-C
This nonlinear state equation is the centerpiece of the RTRL algorithm using teacher forcing.
10