Académique Documents
Professionnel Documents
Culture Documents
Abstract— It has been reported [1] that weight proach as a practical heuristic to improve convergence
convergence of multilayer perceptron networks (MLP) during training of multi-layer networks, theoretical and
can be improved by smoothly changing the node non- practical requirements ensuring validity of this homo-
linearity from a linear to a sigmoidal function dur- topy approach have not been addressed. Furthermore,
ing training. While this approach might provide a the natural homotopy was not used to provide insight
useful training heuristic, formally, this method de- into the optimization process and mapping abilities of
pends on an underlying homotopy which transforms multi-layer neural networks. In this paper the theoret-
a linear to a nonlinear network of the same archi- ical geometric perspectives we developed for the SLP
tecture. As a parameter (the homotopy parame- are extended to the MLP. Although the natural homo-
ter) is varied from zero to one, the linear network topy might yield a useful training heuristic, we illustrate
weights are mapped onto nonlinear network weight that it suffers from a number of theoretical and practi-
solutions. In this paper, a geometric interpretation of cal difficulties. Specifically, the linear system forms a
the optimization equations is used to construct and bifurcation point of the equations; solutions at the ini-
describe an example network that illustrates practi- tial system are, in general, discontinuous. Bifurcations
cal and theoretical difficulties resulting due to bifur- and infinite solutions also can occur for intermediate
cation of solution paths. Since the linear system is a networks on data sets which are not of measure zero.
generic high-order bifurcation point of the homotopy These results weaken the guarantees on global conver-
equations, solution paths are discontinuous at initial- gence and exhaustive behavior normally associated with
ization. Bifurcations and infinite solutions for inter- homotopy methods. This same geometric perspective
mediate values 2 (0; 1) also can occur for data sets clarifies understanding of the relationship between lin-
which are not of measure zero. These results weaken ear and nonlinear neural networks, by comparing and
the guarantees on global convergence and exhaustive contrasting how weight solutions arise.
behavior normally associated with homotopy meth- Basic homotopy methods are briefly reviewed in Sec-
ods. The geometric perspective further provides in- tion II. The resulting geometric perspective of the
sight into the relationship between linear and nonlin- weight optimization process is presented in Section IV.
ear perceptron networks, and how weight solutions In Section V carefully constructed examples are used to
arise in each. illustrate homotopy path behavior. Implications of these
results on robustness of the natural homotopy approach
I INTRODUCTION are discussed in Section VI.
H x @@x + @@h =
the data matrices
0 (2) 3
x1 x2 : : : xL 2 <n2L (3)
2
X =
where H x 2 <n2n is the Jacobian wrt x of h. Y =
2 3
y1 y2 : : : yL T 2 <k2L (4)
The inputs are mapped by the weights W to hidden
node activations i which are then mapped via the
A output layer weights C to produce the output Z , as
A' described by the following feedforward equations:
D
B = W X 2 <m2L (5)
C 8() = ( ) 2 <
T L2 m (6)
D
Z = C 8T 2 <k2L (7)
Homotopy methods are advantageous as they are pos- The natural homotopy mapping between linear and
sibly globally convergent and can be constructed to be nonlinear networks is defined by parametrizing the neu-
exhaustive. However, computing a final solution us- ral network hidden node nonlinearity in terms of :
ing this approach is successful only if solutions for
(x) (x; ) = x =0
h(x; ) exist for all and connects initial solutions f (x)
when
when =1
to final solutions. Proving this connection is non-trivial
(10)
where f is the node nonlinearity of the final network,
and requires significant theory from differential topol-
ogy [8,9]. Although general homotopy theory allows
assumed to be monotonically increasing and saturating
at 61 for large positive/negative values. We use the
higher-dimensional manifold solutions to be tracked as
varies, practical numerical algorithms require that the mapping (x; ) = (1 0 )x + tanh(x) in our exam-
solutions form bounded well-behaved paths as is var-
ples.
ied [5,10]. Bifurcations and path crossings result in
As described in Section II (and more completely in
difficulties; depending on the choice of exit branch, the
[3]), it is essential that functional dependency amongst
solution being tracked can change from a local mini-
parameters be eliminated to ensure that solutions form
mum to a local maximum at hsuch a point. i Full rank paths. For a data matrix X which is not full rank,
and continuity of the Jacobian H x @ h
@ is necessary linear dependency is removed by performing a reduced
and sufficient for well behaved paths [5]. This condi- QR- decomposition of X T such X T = QR, where
tion implies that there cannot be functional redundancy Q 2 <L2s, R 2 <s2n and rank Q = s to generate
within the parameter set x. Typical undesirable paths linearly independent coordinates = RW T 2 <s2m .
are illustrated in Figure 1. A homotopy with solution The weights 2 <s2m are linearly independent weight
combinations, where each column i corresponds to a However, due to the high dimensionality of MLP man-
set of nonredundant weights leading to a single hidden ifolds, their self-intersection (due to non-uniqueness of
layer node, each of which is excited by the row basis solutions) and resulting lack of global unique coordi-
of the input data matrix X . In the following sections, nates, this perspective is not as useful for the MLP’s
it is assumed that the linearly nonredundant weights case. Therefore, we present a different geometric per-
are used. spective for the MLP problem which uses results aris-
The necessary homotopy equations for optimization ing from geometric analysis for the SLP, and which
are found by setting the differential equal to zero: clearly delineates the influence of input and output layer
E8 = 0
(11)
weights. This perspective is illustrated in Figure 2 and
is described below.
(I m
Q) R vec ET C = 0 (12) Since each hidden layer neuron receive the same in-
puts, W = (Q<s ) is the same for each hidden node
R8 = diag vec 0 (T )
8 9
where (13) neuron. Associated with each of the m hidden neurons
is a weight column i , defining a vector in <L from
the origin to the point (Q i ) on W . The network
Note that since the output node is linear, it is possible
to solve explicitly for the optimal weights C in terms
output is formed by linear combinations of these vec-
tors using weights c. It follows that Yc(; ) is a an
of the hidden node weights, reducing the number of
equations to be solved.
m-dimensional subspace. The optimal output weights
IV GEOMETRIC INTERPRETATION result from projection of the desired data vector y onto
Due the nonlinearities in the homotopy equations, it is this subspace. Simultaneous optimization of both in-
generally not possible to solve explicitly for the param- put and output weights corresponds to selecting a sub-
eters f ; Cg, verify functional independence nor es- space of dimension p m generated by p vectors on
tablish existence of solutions. Thus it is not clear that W = (Q<s) such that their span passes closest to the
the linear system solutions initialize paths which can desired vector y. In Fig 2 two vectors corresponding
be followed to solve the final system. In this section a to two hidden units are found on the surface (Q<s ),
geometric interpretation of these equations is described and Yc( ; ) is the hyperplane spanned by these units.
which allows for insight into the topological nature of When k > 1 (multiple outputs), there are multiple
the weight solutions, and their bifurcation behavior. projections yp onto a common hyperspace, for each
The manifolds below, each of which is associated of the different desired output vectors y of the output
with a particular network and data set, are important in nodes.
the following discussion Using this interpretation it is possible to obtain a
Y;c ( ) f(Q )c j 8 c 2 <m ; 8 2 <s2m g topological characterization of the weight solutions. When
= 0, W is a subspace; linear combinations of vec-
is the manifold generated by varying both the in- tors in this plane simply select specific subspaces of the
put and one node’s output layer weights over the plane (which explains why hidden nodes do not mod-
allowable weight space. ify the mapping of the linear network). For k = 1,
Yc ( ; ) f(Q )c j 8 c 2 <m g = 8<m is all the solutions to the homotopy equations at = 0
the manifold generated by varying the output layer form an equivalence class satisfying c = k. Here k
weights o set .
n for a fixed input layer weight is the optimum linear combination corresponding to the
y
U (Q )8 y j 8 2 < s2m is the mani- unique, minimum error projection onto the hyperspace
fold generated by varying the input layer weights, W . In the more general multiple output (k > 1) case,
but solving explicitly for an ideal set of output the problem is more complex. Multiple equivalence
node weights in terms of the input layer weights. classes of different rank corresponding to local minima
W f(Q ) j 8 2 <s g is the manifold gen- and saddle points result – cf. [11,12].
erated by varying the input layer weights for a sin- When > 0, W corresponds to a smooth distortion
gle hidden node or single layer perceptron [2-4]. of the linear subspace into a manifold (with boundary)
for which the weights form an allowable set of curvi-
For simplicity consider the case k = 1 (one out-
put node) first, with Y yT ; E eT and C cT .
linear coordinates ( [2] Thm 1). In this case, the weight
solutions are determined by the topology of the inter-
section of W and the plane Yc( 3 ; ) generated by the
Geometrically, minimization of the least squares error
optimal weights 3 . In Figure 2 it is clear that the
norm defines a projection of the desired output data vec-
intersection of Yc ( 3 ; ) and W contains an infinite
tor onto a data set generated by varying the parameters
over their allowable range. Solutions to (11)-(12) define
projections of y onto Yc; ( ), or equivalently, an or-
number of hidden node vectors (points of intersection)
that will result in the same Yc( 3 ; ), and hence, the
thogonal projection of y onto the surface U . This per-
spective was successfully used to analyze the SLP [3,4]. same performance. These vectors correspond to a man-
y
QQ
Q yp
Q
W
1
1
1
I
@
@1 τ=0 τ>0
y1 , y2
ifold of weight solutions in < s2m whose dimension 2, 2α w 12
columns can further increase the dimension of this man- Figure 4: Original network and equivalent network af-
ifold. Intuitively, however, the constraints imposed by ter removal of linear redundancy
the nonlinearities would imply that the weight solution
manifolds have lower dimension for the nonlinear than
for the linear case. A formal theoretical analysis [12], these perspectives facilitates complete understanding of
too lengthy for inclusion in this paper, motivates this the weight optimization process.
geometrical perspective. These results are illustrated Since c can be explicitly solved in terms of , we use
graphically in Figure 3, and demonstrated in the next as the only optimization variable. For this example
section using an example. > 1 and y is chosen such that at = 0 there is no
zero-error solution for . However, y 2 Yc; when
V HOMOTOPY PATH BEHAVIOR 0 < c < , as depicted in Figure 6. Based on the
We now use the geometric interpretation of Section IV geometric interpretation of the optimization process, for
to construct a simple example illustrating bifurcation < c an optimal weight results from the projection
behavior of MLPs. The network and data variable defi- y p onto @ Yc; of the desired vector y. For < c ,
nitions are shown in Fig 4, as is the equivalent network there is a unique, isolated solution . As illustrated,
using weights c and . For this example QT = [ 1 ], when > c , the problem can be solved with zero
L = 2; s = 1 and m = 1. error by two choices of , whose corresponding hidden
The different associated data manifolds described in node vectors span the same space containing y .
Section IV are illustrated for a fixed arbitrary value of Fig 7 contains the error surfaces for different values
in Figure 5. From symmetry considerations only values of and . The error is discontinuous in at = 0 (the
of > 0 are considered. We present the projection network input and output are disconnected). When =
operator, error surface, and homotopy path descriptions 0, the hidden node weight solution set is 2 <nf0g,
concurrently and discuss their relationship. Using all of since for each 6= 0, a c exists such that c = k0 where
Y;
c W
yp
Yc( )
yp
y
τ=τc v3
v1
x2 y
x2
v2
τ<τc τ>τ c
Y
@ ;c v3
x1
x1
Figure 6: Physical bifurcation interpretation of neural
Figure 5: Notation for simple example. The hidden system. The data surface (Q<) for different values
node (SLP) surface W (heavy line) for a given is gen- of are the solid lines. When 0 < < c , a unique
erated by deforming span fQg; the shaded region Y;c optimum hidden node vector v1 generates a span with
is the set of all possible outputs that can be generated smallest distance to y, and has projection yp generated
by the network by varying c and over their allowable with non-zero error. At c the error is zero, and bifur-
values. For a fixed , a specific hidden node vector in cation occurs. For > c , the error is zero, but two
W is selected (indicated by the arrow), and varying c hidden node vectors (labeled v3 ) can generate the same
for this fixed generates the set Yc( ) (dashed line). span containing y, and correspond to two branches of
The desired vector y cannot be generated with zero er- solution.
ror in this example; the minimum error results from a .
projection of yp onto @ Y;c .
= c zero error results, and for c < < 1 two local 0.12
τ=0
minima, separated by a local maximum, appear. As
! 1, the hidden node vector approaches span fQg; 2 0.1
0.08
a constant error equal to that of the linear case appears. τ<τc
0.06
Homotopy paths for this example are shown in Fig 8.
When = 0, = <nf0g; for positive but less than 0.04