Vous êtes sur la page 1sur 6

Bifurcations in Nonlinear Networks Initialized with

Linear Network Weight Solutions


F. M. Coetzee Student Member IEEE and V. L. Stonick Member IEEE

Abstract— It has been reported [1] that weight proach as a practical heuristic to improve convergence
convergence of multilayer perceptron networks (MLP) during training of multi-layer networks, theoretical and
can be improved by smoothly changing the node non- practical requirements ensuring validity of this homo-
linearity from a linear to a sigmoidal function dur- topy approach have not been addressed. Furthermore,
ing training. While this approach might provide a the natural homotopy was not used to provide insight
useful training heuristic, formally, this method de- into the optimization process and mapping abilities of
pends on an underlying homotopy which transforms multi-layer neural networks. In this paper the theoret-
a linear to a nonlinear network of the same archi- ical geometric perspectives we developed for the SLP
tecture. As a parameter  (the homotopy parame- are extended to the MLP. Although the natural homo-
ter) is varied from zero to one, the linear network topy might yield a useful training heuristic, we illustrate
weights are mapped onto nonlinear network weight that it suffers from a number of theoretical and practi-
solutions. In this paper, a geometric interpretation of cal difficulties. Specifically, the linear system forms a
the optimization equations is used to construct and bifurcation point of the equations; solutions at the ini-
describe an example network that illustrates practi- tial system are, in general, discontinuous. Bifurcations
cal and theoretical difficulties resulting due to bifur- and infinite solutions also can occur for intermediate
cation of solution paths. Since the linear system is a networks on data sets which are not of measure zero.
generic high-order bifurcation point of the homotopy These results weaken the guarantees on global conver-
equations, solution paths are discontinuous at initial- gence and exhaustive behavior normally associated with
ization. Bifurcations and infinite solutions for inter- homotopy methods. This same geometric perspective
mediate values  2 (0; 1) also can occur for data sets clarifies understanding of the relationship between lin-
which are not of measure zero. These results weaken ear and nonlinear neural networks, by comparing and
the guarantees on global convergence and exhaustive contrasting how weight solutions arise.
behavior normally associated with homotopy meth- Basic homotopy methods are briefly reviewed in Sec-
ods. The geometric perspective further provides in- tion II. The resulting geometric perspective of the
sight into the relationship between linear and nonlin- weight optimization process is presented in Section IV.
ear perceptron networks, and how weight solutions In Section V carefully constructed examples are used to
arise in each. illustrate homotopy path behavior. Implications of these
results on robustness of the natural homotopy approach
I INTRODUCTION are discussed in Section VI.

The natural homotopy approach corresponds to chang- II HOMOTOPY METHODS


ing the node nonlinearity from a linear to a nonlin- In this review of basic homotopy methods, we empha-
ear sigmoidal function as a free parameter (the homo- size properties and constraints critical to our develop-
topy parameter) is varied. For networks without hidden ment. For a more complete introduction see e.g. [5-7].
layer nodes, we have shown that this approach defines Homotopy methods provide a constructive way to
a globally convergent constructive weight optimization find solutions to a set of equations by mapping the
method, and provides an intuitive geometrical perspec- known solutions, from a simple initial system, to the
tive on the weight optimization process for nonlinear desired solution of the unsolved system of equations.
networks [2-4]. In this paper, the natural homotopy Homotopy methods are appropriate for optimization if
is extended to multi-layer perceptron (MLP) networks the optimization problem can be reduced to solving sys-
having one hidden layer of sigmoidal neurons and lin- tems of equations. Mathematically, the basic homotopy
ear output nodes. While Yang and Yu [1] used this ap- method is as follows: Given a final set of equations
f (x) = 0; f : D  <n ! <n with an unknown solu-
The authors are with the Department of Electrical and Computer Engineering,
tion, an initial system of equations g(x) = 0; g : D 
<n ! <n with a known solution(s) is constructed. A
Carnegie Mellon University, Pittsburgh, PA 15213-3890,USA.
This research was funded in part by NSF Grant number MIP-9157221.
homotopy function h : D2 T ! <n is defined in terms paths (A) and (A’) does not have a connection of the
of an embedded parameter  2 T  <, such that initial and final solutions, while along (B) there exists
 a functional relationship between parameters of the so-
h(x;  ) = fg((xx)) when
when
 =0
 =1 (1) lutions of the homotopy equations. A path crossing
occurs at (C), and bifurcation at (D).
The objective is to solve the final equations by solving III MULTILAYER PERCEPTRON NETWORK
h(x;  ) = 0 numerically for x for increasing values HOMOTOPY FORMULATION
of  , starting at  = 0 (where the solution is known In the following, all MLP networks have one hidden
by construction), and continuing to  = 1. Intuitively, layer of sigmoidal node transfer functions and linear
using small increments of  the solution can be effi- output nodes. The network has n inputs, m hidden
ciently computed, for the previous value of  is used nodes and k linear outputs. The weight wij connects
for computing the solution at the current value of  . For input node j to hidden node i, and cij maps from hidden
differentiable systems, the problem is reduced to that of node j to output node i. The input data x[i] and the
solving the Davidenko implicit differential equation, desired data output y[i]; i = 1; 2 : : :L are collected in

H x @@x + @@h =
the data matrices
0 (2) 3
x1 x2 : : : xL 2 <n2L (3)
2
X =
where H x 2 <n2n is the Jacobian wrt x of h. Y =
2 3
y1 y2 : : : yL T 2 <k2L (4)
The inputs are mapped by the weights W to hidden
node activations i which are then mapped via the
A output layer weights C to produce the output Z , as
A' described by the following feedforward equations:

D
B = W X 2 <m2L (5)
C 8( ) = ( ) 2 <
T L2 m (6)

D
Z = C 8T 2 <k2L (7)

The error vector E 2 <k2L and least squares error


criterion 2 are defined, respectively, by
0 1

E = Y 0 C 8T (8)
Figure 1: Common problematic homotopy paths 2 = tr fE T Eg (9)

Homotopy methods are advantageous as they are pos- The natural homotopy mapping between linear and
sibly globally convergent and can be constructed to be nonlinear networks is defined by parametrizing the neu-
exhaustive. However, computing a final solution us- ral network hidden node nonlinearity in terms of  :

ing this approach is successful only if solutions for
(x)  (x;  ) = x  =0
h(x;  ) exist for all  and connects initial solutions f (x)
when
when  =1
to final solutions. Proving this connection is non-trivial
(10)
where f is the node nonlinearity of the final network,
and requires significant theory from differential topol-
ogy [8,9]. Although general homotopy theory allows
assumed to be monotonically increasing and saturating
at 61 for large positive/negative values. We use the
higher-dimensional manifold solutions to be tracked as
 varies, practical numerical algorithms require that the mapping (x;  ) = (1 0  )x +  tanh(x) in our exam-
solutions form bounded well-behaved paths as  is var-
ples.
ied [5,10]. Bifurcations and path crossings result in
As described in Section II (and more completely in
difficulties; depending on the choice of exit branch, the
[3]), it is essential that functional dependency amongst
solution being tracked can change from a local mini-
parameters be eliminated to ensure that solutions form
mum to a local maximum at hsuch a point. i Full rank paths. For a data matrix X which is not full rank,
and continuity of the Jacobian H x @ h
@ is necessary linear dependency is removed by performing a reduced
and sufficient for well behaved paths [5]. This condi- QR- decomposition of X T such X T = QR, where
tion implies that there cannot be functional redundancy Q 2 <L2s, R 2 <s2n and rank Q = s to generate
within the parameter set x. Typical undesirable paths linearly independent coordinates = RW T 2 <s2m .
are illustrated in Figure 1. A homotopy with solution The weights 2 <s2m are linearly independent weight
combinations, where each column i corresponds to a However, due to the high dimensionality of MLP man-
set of nonredundant weights leading to a single hidden ifolds, their self-intersection (due to non-uniqueness of
layer node, each of which is excited by the row basis solutions) and resulting lack of global unique coordi-
of the input data matrix X . In the following sections, nates, this perspective is not as useful for the MLP’s
it is assumed that the linearly nonredundant weights case. Therefore, we present a different geometric per-
are used. spective for the MLP problem which uses results aris-
The necessary homotopy equations for optimization ing from geometric analysis for the SLP, and which
are found by setting the differential equal to zero: clearly delineates the influence of input and output layer
E8 = 0  
(11)
weights. This perspective is illustrated in Figure 2 and
is described below.
(I m
Q) R vec ET C = 0 (12) Since each hidden layer neuron receive the same in-
puts, W = (Q<s ) is the same for each hidden node
R8 = diag vec 0 ( T )
8 9
where (13) neuron. Associated with each of the m hidden neurons
is a weight column i , defining a vector in <L from
the origin to the point (Q i ) on W . The network
Note that since the output node is linear, it is possible
to solve explicitly for the optimal weights C in terms
output is formed by linear combinations of these vec-
tors using weights c. It follows that Yc( ;  ) is a an
of the hidden node weights, reducing the number of
equations to be solved.
m-dimensional subspace. The optimal output weights
IV GEOMETRIC INTERPRETATION result from projection of the desired data vector y onto
Due the nonlinearities in the homotopy equations, it is this subspace. Simultaneous optimization of both in-
generally not possible to solve explicitly for the param- put and output weights corresponds to selecting a sub-
eters f ; Cg, verify functional independence nor es- space of dimension p  m generated by p vectors on
tablish existence of solutions. Thus it is not clear that W = (Q<s) such that their span passes closest to the
the linear system solutions initialize paths which can desired vector y. In Fig 2 two vectors corresponding
be followed to solve the final system. In this section a to two hidden units are found on the surface (Q<s ),
geometric interpretation of these equations is described and Yc( ;  ) is the hyperplane spanned by these units.
which allows for insight into the topological nature of When k > 1 (multiple outputs), there are multiple
the weight solutions, and their bifurcation behavior. projections yp onto a common hyperspace, for each
The manifolds below, each of which is associated of the different desired output vectors y of the output
with a particular network and data set, are important in nodes.
the following discussion Using this interpretation it is possible to obtain a
 Y ;c ( )  f(Q )c j 8 c 2 <m ; 8 2 <s2m g topological characterization of the weight solutions. When
 = 0, W is a subspace; linear combinations of vec-
is the manifold generated by varying both the in- tors in this plane simply select specific subspaces of the
put and one node’s output layer weights over the plane (which explains why hidden nodes do not mod-
allowable weight space. ify the mapping of the linear network). For k = 1,
 Yc ( ;  )  f(Q )c j 8 c 2 <m g = 8<m is all the solutions to the homotopy equations at  = 0
the manifold generated by varying the output layer form an equivalence class satisfying c = k. Here k
weights o set .
n for a fixed input layer weight is the optimum linear combination corresponding to the
 y
U  (Q )8 y j 8 2 < s2m is the mani- unique, minimum error projection onto the hyperspace
fold generated by varying the input layer weights, W . In the more general multiple output (k > 1) case,
but solving explicitly for an ideal set of output the problem is more complex. Multiple equivalence
node weights in terms of the input layer weights. classes of different rank corresponding to local minima
 W  f(Q ) j 8 2 <s g is the manifold gen- and saddle points result – cf. [11,12].
erated by varying the input layer weights for a sin- When  > 0, W corresponds to a smooth distortion
gle hidden node or single layer perceptron [2-4]. of the linear subspace into a manifold (with boundary)
for which the weights form an allowable set of curvi-
For simplicity consider the case k = 1 (one out-
put node) first, with Y  yT ; E  eT and C  cT .
linear coordinates ( [2] Thm 1). In this case, the weight
solutions are determined by the topology of the inter-
section of W and the plane Yc( 3 ;  ) generated by the
Geometrically, minimization of the least squares error
optimal weights 3 . In Figure 2 it is clear that the
norm defines a projection of the desired output data vec-
intersection of Yc ( 3 ;  ) and W contains an infinite
tor onto a data set generated by varying the parameters
over their allowable range. Solutions to (11)-(12) define
projections of y onto Yc; ( ), or equivalently, an or-
number of hidden node vectors (points of intersection)
that will result in the same Yc( 3 ;  ), and hence, the
thogonal projection of y onto the surface U . This per-
spective was successfully used to analyze the SLP [3,4]. same performance. These vectors correspond to a man-
y
QQ
Q yp
Q

W
1
1
1
I
@
@1 τ=0 τ>0

Figure 3: Illustration of the solution problem for the


MLP network. When  = 0, the solution set in weight
space consists of equivalence classes of solutions form-
ing manifolds of various dimensions. When  is in-
(Q )<m = Yc ( ) creased, each equivalence class can give rise to exten-
sions of the original manifolds, lower dimensional man-
Figure 2: Intersection of hyperspace (Q )<m = ifolds of solution and isolated solutions, and multiple
Yc ( ;  ) generated by two hidden layer nodes (indi- solutions could result from every equivalence class.
cated by vectors) and single layer perceptron mapping
W of the input data surface. The point yp is the projec-
tion of y onto the hyperspace Yc( ;  ). The embedding
space is <L .
1, α w 11
c
11

y1 , y2
ifold of weight solutions in < s2m whose dimension 2, 2α w 12

is at least as large as the dimension of the intersection


of the optimal plane Yc ( 3 ;  ) and the manifold W . In β c
addition to this independent variation of the columns 1, α
of the weight matrix , constrained variation amongst
y1 , y2

columns can further increase the dimension of this man- Figure 4: Original network and equivalent network af-
ifold. Intuitively, however, the constraints imposed by ter removal of linear redundancy
the nonlinearities would imply that the weight solution
manifolds have lower dimension for the nonlinear than
for the linear case. A formal theoretical analysis [12], these perspectives facilitates complete understanding of
too lengthy for inclusion in this paper, motivates this the weight optimization process.
geometrical perspective. These results are illustrated Since c can be explicitly solved in terms of , we use
graphically in Figure 3, and demonstrated in the next as the only optimization variable. For this example
section using an example. > 1 and y is chosen such that at  = 0 there is no
zero-error solution for . However, y 2 Yc; when
V HOMOTOPY PATH BEHAVIOR 0 < c <  , as depicted in Figure 6. Based on the
We now use the geometric interpretation of Section IV geometric interpretation of the optimization process, for
to construct a simple example illustrating bifurcation  < c an optimal weight results from the projection
behavior of MLPs. The network and data variable defi- y p onto @ Yc; of the desired vector y. For  < c ,
nitions are shown in Fig 4, as is the equivalent network there is a unique, isolated solution . As illustrated,
using weights c and . For this example QT = [ 1 ], when  > c , the problem can be solved with zero
L = 2; s = 1 and m = 1. error by two choices of , whose corresponding hidden
The different associated data manifolds described in node vectors span the same space containing y .
Section IV are illustrated for a fixed arbitrary value of  Fig 7 contains the error surfaces for different values
in Figure 5. From symmetry considerations only values of and  . The error is discontinuous in at = 0 (the
of > 0 are considered. We present the projection network input and output are disconnected). When  =
operator, error surface, and homotopy path descriptions 0, the hidden node weight solution set is 2 <nf0g,
concurrently and discuss their relationship. Using all of since for each 6= 0, a c exists such that c = k0 where
Y ;
c W
yp

Yc( )
yp
y
τ=τc v3
v1
x2 y
x2
v2
τ<τc τ>τ c
Y
@ ;c v3

x1
x1
Figure 6: Physical bifurcation interpretation of neural
Figure 5: Notation for simple example. The hidden system. The data surface (Q<) for different values
node (SLP) surface W (heavy line) for a given  is gen- of  are the solid lines. When 0 <  < c , a unique
erated by deforming span fQg; the shaded region Y ;c optimum hidden node vector v1 generates a span with
is the set of all possible outputs that can be generated smallest distance to y, and has projection yp generated
by the network by varying c and over their allowable with non-zero error. At c the error is zero, and bifur-
values. For a fixed , a specific hidden node vector in cation occurs. For  > c , the error is zero, but two
W is selected (indicated by the arrow), and varying c hidden node vectors (labeled v3 ) can generate the same
for this fixed generates the set Yc( ) (dashed line). span containing y, and correspond to two branches of
The desired vector y cannot be generated with zero er- solution.
ror in this example; the minimum error results from a .
projection of yp onto @ Y ;c .

k0 is the optimum linear scaling. Below the critical


0.18 τ=1

value of  = c one local minimum occurs, generated 0.16

by the projection of y onto the boundary @ Yc; . When 0.14

 = c zero error results, and for c <  < 1 two local 0.12
τ=0
minima, separated by a local maximum, appear. As
! 1, the hidden node vector approaches span fQg; 2 0.1

0.08
a constant error equal to that of the linear case appears. τ<τc
0.06
Homotopy paths for this example are shown in Fig 8.
When  = 0, = <nf0g; for  positive but less than 0.04

c there is only one solution, corresponding to the min- 0.02


imum of the error surface. Therefore  = 0 is a bifur- τ=τc
0
τ>τc
cation point of the equations. A bifurcation also occurs
at  = c . Three paths result, two of which are minima
-0.02
0 0.5 1 1.5 2 2.5 3 3.5

of zero error (paths (a) and (c)). Path (b) corresponds


to the separating local maximum for c <  < 1. The
paths (a) and (b) diverge to infinity at  " 1, due to the
Figure 7: Error surface for hidden layer node weight
as  is varied. The error is discontinuous at = 0.
saturation effect of the node nonlinearity. The path (c)
When  = 0, the hidden node surface is a plane, and
2 <nf0g is a solution; for 0 <  < c one local
leads to a finite solution.
Note that the set of y 2 <2 exhibiting this behavior
minimum occurs. When  = c zero error results, and
does not have measure zero; this set is given by y 2
S for c <  < 1 two local minima, separated by a local
0<<1 Y ;c ( ) where > 1, and forms a cone in
maximum, appear.
weight space.
4 is very difficult to deal with these problems.
Yang and Lu [1] found the natural homotopy useful
3.5 in obtaining faster convergence during training. Based
on the analysis presented in this paper, it is clear that
3
this method suffers from some fundamental difficul-
a
2.5
ties. However, this does not mean that the method
b does not provide a viable practical heuristic to aid in
2 obtaining convergence of the method, simply that no
strong claims as to the global convergence and exhaus-
1.5 tive nature often guaranteed by homotopy methods can
be made. Naturally, other homotopy methods may be
1
formulated that do not suffer from the same difficulties.
0.5
c
References
[1] L. Yang and W. Yu, “Backpropagation with homotopy,”
0 Neural Computation, vol. 5, pp. 363–366, 1993.
0 0.2 0.4 0.6 0.8 1 [2] F. M. Coetzee and V. L. Stonick, “A geometric view of neural
 networks using homotopy,” in C. A. Kamm, G. M. Kuhn,
B. Yoon, R. Chellappa, and S. Y. Kung, eds., (Proc. 1993
IEEE Workshop on Neural Networks for Signal Processing III),
Figure 8: Homotopy paths for hidden layer node, show- (), September 1993, pp. 118–127.
ing different solution states: at  = 0, = <nf0g, [3] F. M. Coetzee and V. L. Stonick. “On a natural homotopy
while for 0 <  < c there is only one solution. A between linear and nonlinear single layer networks,”.

bifurcation occurs when  = c = 0:67 resulting in


Submitted to IEEE Trans. Neural Networks, 1993.
[4] F. M. Coetzee and V. L. Stonick. “On the uniqueness of
three bifurcation paths, two, (a) and (c) corresponding weights in singlelayer perceptrons,”. Submitted to IEEE Trans.
Neural Networks, 1993.
to minima of zero error, and one (b) to a local maximum [5] C. B. Garcia and W. I. Zangwill, Pathways to solutions, fixed
for c <  < 1. Paths (a) and (b) diverge to infinity at points and equilibria. Prentice-Hall, 1981.
 = 1, while path (c) leads to a finite solution. [6] A. Morgan, Solving polynomial systems using continuation for
engineering and scientific problems. : Prentice Hall, 1987.
ISBN 0-13-822313-0.
[7] S. L. Richter and R. A. DeCarlo, “Continuation methods:
VI CONCLUSIONS theory and applications,” IEEE Trans. CAS, vol. CAS-30, pp.
347–352, June 1983.
In this paper a geometric interpretation of the weight [8] J. C. Alexander, “The topological theory of an embedding
method,” in H. Wacker, ed., (Continuation methods), ch. 2,
solution process was presented, and related to error sur- pp. 37–68. : Academic Press, 1978.
face and homotopy path descriptions. Using this for- [9] J. C. Alexander and J. A. Yorke, “The homotopy continuation
mulation and a carefully constructed example we il- method:numerically implementable topological procedures,”
Trans. AMS, vol. 242, pp. 271–284, August 1978.
lustrated that, for linear networks, solutions generally [10] L. T. Watson, S. C. Billups, and A. P. Morgan, “Algorithm
consist of equivalence classes of unbounded, higher 652:HOMPACK: A suite of codes for globally convergent
dimensional manifolds, while for nonlinear networks homotopy algorithms,” ACM Trans. Math. Software, vol. 13,
pp. 281–310, 1987.
lower dimensional manifolds result. Thus, in general, [11] P. Baldi and K. Hornik, “Neural networks and principal
the linear system is a bifurcation point of the homotopy component analysis : Learning from examples without local
equations. Finding a path (or multiple paths and lower minima,” Neural Networks, vol. 2, pp. 53–58, 1989.
[12] F. M. Coetzee and V. L. Stonick. “Topology and geometry of
dimensional manifolds) emanating from one of the lin- weight solutions in multi-layer networks,”. In preparation,
ear equivalence classes is difficult. Such an exit point 1993.
is not known a-priori, there is no guarantee that all exit-
ing solution paths are found, nor that preferred, “good”
solution paths are tracked (at best it can be claimed that
solutions with large basins of attraction at  = 0 will
probably be found). Solutions can undergo arbitrarily
large changes of magnitude as  is varied (solution dis-
continuity), and the underlying motivation for using the
homotopy approach is lost.
Further, bifurcations in the homotopy path can occur
for  2 (0; 1) for large sets of desired values and input
vectors, and thus cannot be ignored. Multiple exit paths
from each bifurcation point can appear; even if only the
minima exiting the bifurcation point are tracked, some
of these paths may diverge to infinity. Numerically, it

Vous aimerez peut-être aussi