Académique Documents
Professionnel Documents
Culture Documents
12.1 Introduction
The hierarchical levels of organization in artificial neural networks may be classified as
follows. At the most fundamental level are synapses, followed by neurons, then layers of
neurons in the case of a layered network, and finally the network itself. The design of
neural networks that we have pursued up to this point has been of a modular nature at
the level of neurons or layers only. It may be argued that the architecture of a neural
network should go one step higher in the hierarchical level of organization. Specifically,
it should consist of a multiplicity of networks, and that learning algorithms should be
designed to take full advantage of the resulting modular structure. The present chapter is
devoted to a particular class of modular networks that relies on the combined use of
supervised and unsupervised learning paradigms.
We may justify the rationale for the use of modular networks by considering the
approximation problem. The approximation of a prescribed input-output mapping may
be realized using a local method that captures the underlying local structure of the mapping.
Such a model is exemplified by radial-basis function (RBF) networks, which were studied
in Chapter 7. The use of a local method offers the advantage of fast learning and therefore
the ability to operate in real time, since it usually requires relatively few training examples
to learn a single task. However, a limitation of local methods is that they tend to be
memory intensive. Alternatively, the approximation may be realized using a global method
that captures the underlying global structure of the mapping. This second model is exempli-
fied by back-propagation learning applied to multilayer perceptrons, which were studied
in Chapter 6. The use of global methods offers the advantages of a smaller storage
requirement and better generalization performance. However, they suffer from a slow
learning process that limits their range of applications. In light of this dichotomy between
local and global methods of approximation, it is natural to ask: How can we combine the
advantages of these two methods? The answer appears to lie in the use of a modular
architecture that captures the underlying structure of an input-output mapping at an
intermediate level of granularity. The idea of using a modular network for realizing a
complex mapping function was discussed by Hinton and Jacobs as far back as the mid-
1980s (Jacobs et al., 1991a). Mention should also be made of a committee machine
consisting of a layer of elementary perceptrons followed by a vote-taking perceptron in
the second layer, which was described in Nilsson (1965). However, it appears that the
class of modular networks discussed in this chapter was first described in Jacobs and
Jordan (1991), and the architecture for it was presented by Jacobs et al. (1991a).
A useful feature of a modular approach is that it also provides a better fit to a dis-
continuous input-output mapping. Consider, for example, Fig. 12.1, which depicts a
473
474 12 / Modular Networks
Discontinuous
f function
’ This definition is adapted from Osherson et al. (1990);Jacobs and Jordan (1991);and Jacobs et al. (1991a).
476 12 / Modular Networks
response, and the other modules are the losers. Furthermore, each module receives an
amount of training information that is proportional to its relative ability to learn. This
means that the winning module receives more training information than the losing modules.
From Chapter 10 we recall that unsupervised learning involves the use of positive
feedback. So it is with a modular network in that the competition among the modules for
the right to learn the training patterns involves a positive feedback effect. More precisely,
if a particular module learns a great deal about some training patterns, then it will likely
perform well when presented with related training patterns and thus learn a great deal
about them too. By the same token, the module will perform poorly when presented with
unrelated training patterns, in which case it will learn little or nothing about them. In both
cases, the positive feedback effect manifests itself by some form of self-amplification.
where k is the number of free parameteTs in the model, 6 is the estimate of the parameter
vector 8 characterizing the model, f(xl8) is the conditional probability density function
of the input vector x given the estimate 6, the probability distribution a(6) expresses our
prior knowledge about the estimate 6, and N is the length of data available for processing.
The first term in Eq. (12.2) decreases with k, whereas the second term increases with k.
Accordingly, the optimum model order k is the value of k for which MDL(k) is minimum.
Returning to the issue at hand, it would be highly desirable to have the overall learning
capacity of the modular network match the complexity of the input data. For this to
happen, the total number of free parameters in the modular network should ideally satisfy
the MDL criterion. With the modular network containing a multitude of expert networks
and an integrating unit, it is obvious that each expert network is too simple to deal with
12.2 I Basic Notions of Modularity 477
the complexity of the input data by itself. Accordingly, when an expert network is faced
with complex data, it will unavoidably yield large residuals (i.e., estimation errors). This
will, in turn,fuel the competitive process in the modular network and thereby permit the
other expert networks to try to describe the residuals. This is accomplished under the
direction of the integrating unit, thereby providing a viable solution to the credit-assignment
problem.
tasks, the same set of hidden neurons is forced to represent information about both tasks.
On the other hand, in the case of the split model, different sets of hidden neurons are
used to represent information about the two tasks. Provided that enough computational
resources were available in both cases, the split model was found to develop more efficient
internal representations.
3. Hardware Constraints. In a brain, there is a physical limit on the number of neurons
that can be accommodated in the available space. In a related discussion on representations
employed by the brain, it is suggested by Ballard (1986) that such a limitation compels
the brain to adopt a modular structure, and that the brain uses a coarse code to represent
multidimensional spaces. To represent a space of dimension k, it is hypothesized that the
number of neurons required to do the representation is NklOk-l, where N is the number
of just-noticeable differences in each dimension of the space and D is the diameter of the
receptive field of each neuron. With a limit imposed on the number of neurons in a cortical
area of the brain, the representation of high-dimensional spaces is distributed in different
areas that compute different functions. In an analogous manner, it may be argued that,
in order to reduce the number of neurons in an artificial neural network, the representation
of multidimensional spaces may be distributed among multiple networks (Jacobs et al.,
1991b).
With the background on modularity described in this section and the previous one, we
are ready to undertake a detailed analysis of a special class of modular networks, which
we do in the remaining sections of the chapter.*
The goal of the learning algorithm used to train the modular network of Fig. 12.2 is
to model the probability distribution of the set of training patterns {x,d}. We assume that
the patterns {x,d} used to do the training are generated by a number of different regressive
processes.
To do the learning, the expert networks and the gating network in Fig. 12.2 are all
trained simultaneously. For this purpose, we may use a learning algorithm that proceeds
as follows (Jacobs and Jordan, 1991).
1. An input vector x is selected at random from some prior distribution.
2. A rule or expert network is chosen from the distribution P(i1x);this is the probability
of the ith rule given the input vector x.
*Much of the material presented in the remainder of this chapter is based on Jacobs and Jordan (1991), and
Jordan and Jacobs (1992). The approach taken in these two papers is statistical in nature, being based on
maximum-likelihood estimation. Some similar results are reported by Szymanski and L e m o n (1993). using
an information-theoretic approach.
12.3 / Associative Gaussian Mixture Model 479
I output
vector
FIGURE 12.2 Block diagram of a modular network; the outputs of the expert networks
(modules) are mediated by a gating network.
= Fi(x), i = 1, 2 , . . . , K (12.6)
Note also that, in general, the elements of the output vector of each expert are nut
uncorrelated. Rather, the covariance matrix of the ith expert network’s output vector yi
is the covariance matrix of E;. For the sake of simplicity, however, it is assumed that for
all K expert networks we have
E l = E2 = ..* = EX.
and that the covariance matrix of is the identity matrix, as shown by
Ai=I, i = l , 2, . . . , K (12.7)
480 12 / Modular Networks
The multivariate Gaussian distribution of the desired response vector d, given the input
vector x and that the ith expert network is chosen, may therefore be expressed as (Wilks,
1962)
f(dlx’i) =
1
(27Tdet Ai)q/2exp( -51 (d - yi)*A;’(d - y,))
- 1 1
- -exp( --
2 (d - yi)T(d- yi))
(27T)4‘2
1 1
- -exp(-- Ild - y#), i = 1, 2 , . . . , K (12.8)
(2n)q/2 2
where ll*ll denotes the Euclidean norm of the enclosed vector. The multivariate distribution
in Eq.(12.8) is written as a conditional probability density function to emphasize the fact
that, for a given input vector x, we are assuming that the ith expert network produces the
closest match to the desired response vector d.
On this basis, we may treat the probability distribution of the desired response vector
d as a mixture model (i.e., as a linear combination of K different multivariate Gaussian
distributions), as shown by
i= 1
(12.9)
The probability distribution of Eq. (12.9)is called an associative Gaussian mixture model3;
the term “associative” refers to the fact that the model is associated with a set of training
patterns represented by the input vector x and desired response vector d.
The goal of the learning algorithm is to model the distribution of a given set of training
patterns. To do so, we first recognize the fact that the output vector yi of the ith expert
network is a function of the synaptic weight vector wi of that network. Let the vector w
of appropriate dimension denote the synaptic weights of all the expert networks axranged
as follows:
W = [Ij (12.10)
Similarly, let the vector g denote the activations of all the output neurons in the gating
(12.11)
For a discussion of nonassociative Gaussian mixture models, see McLachlan and Basford (1988).
12.3 I Associative Gaussian Mixture Model 481
We may thus view the conditional probability density function f(dlx) as a likelihood
function, with the whole synaptic weight vector w and the activation vector g playing the
roles of unknown parameters. In situations of the kind described by Eq. (12.9), it is
preferrable to work with the natural logarithm off(d)x) rather thanf(d1x); we may do so
since the logarithm is a monotone increasing function of its argument. Accordingly, we
may define a log-likelihoodfunction as follows:
Kw, g) = In f ( d h (12.12)
Substituting Eq. (12.9) in (12.12) and ignoring the constant term - ln(2~)4’~, we may
formally express the log-likelihood function l(w,g) as follows (Jacobs and Jordan, 1991;
Jacobs et al., 1991a):
(12.13)
where it is understood that yi depends on wi (Le., the ith portion of w). We may thus
view l(w,g) as an objective function, the maximization of which yields maximum-likelihood
estimates of all the free parameters of the modular network in Fig. 12.2, represented by
the synaptic weights of the different expert networks and those of the gating network.
We may now offer the following interpretations for some of the modular network‘s
unknown quantities (Jacobs and Jordan, 1991):
1. The optimized module’s output vectors yl, y2,. . . , yKof the expert networks are
unknown conditional mean vectors.
2. The optimized gating network’s outputs g,, g2, . . . , gK are the conditional a priori
probabilities that the respective modules generated the current training pattern.
The probabilistic parameters referred to under points 1 and 2 are all conditional on the
input vector x.
Whereas the different expert networks of the modular structure in Fig. 12.2 are permitted
to have an arbitrary connectivity, the activations of the output neurons of the gating
network are constrained to satisfy two requirements (Jacobs and Jordan, 1991):
(12.14)
K
xgi=1 (12.15)
i= 1
These two constraints are necessary if the activations gi are to be interpreted as a priori
probabilities.
Given a set of unconstrained variables, {ujlj = 1, 2 , . . . , K } , we may satisfy the two
constraints of Eqs. (12.14) and (12.15) by defining the activation gi of the ith output
neuron of the gating network as follows (Bridle, 1990a):
(12.16)
where ui is the weighted sum of the inputs applied to the ith output neuron of the gating
network. This normalized exponential transformation may be viewed as a multiinput
generalization of the logistic function. It preserves the rank order of its input values,
and is a differentiable generalization of the “winner-takes-all’’ operation of picking the
maximum value. For this reason, the transformation of Eq. (12.16) is referred to as sofmax
(Bridle, 1990a, b).
482 12 / Modular Networks
This probability is conditional on both the input vector x and the desired response vector
d. From this definition, we also note that as with the a priori probabilities represented
by the activations gi, the a posteriori probabilities hi satisfy the two necessary conditions:
0 5 hi 5 1 for all i (12.18)
and
K
chi=l (12.19)
i=l
There are two different parameter adjustments to be performed in the modular network
of Fig. 12.2:
dl
- = hi(d - y;), i = 1,2, . . . ,K (12.20)
aYi
Equation (12.20) states that, during the training process, the synaptic weights of the ith
expert network in Fig. 12.2 are adjusted to correct the error between the output vector yi
and the desired response vector d, but in proportion to the a posteriori probability hithat
the ith expert network generated the training pattern in current use (Jacobs and Jordan,
1991).
Suppose now that each expert network consists of a single layer of neurons, as depicted
in the architectural graph of Fig. 12.3a. The specification of the neurons in Fig. 12.3a
depends on whether we are solving a regression or classification problem, as explained
here:
Regression. In a nonlinear regression problem the residuals are generally assumed
to have a multivariate Gaussian distribution. In a corresponding way, the output
neurons of the expert networks are modeled as linear.
ClassiJication. In a pattern-classification problem, the output neurons are usually
assumed to have a sigmoidal nonlinearity. In this case, a mixture of Bernoulli
12.4 / Stochastic-Gradient Learning Algorithm 483
Input Output
layer layer
(4
FIGURE 12.3 (a) Single layer of linear neurons constituting the expert network.
(b) Signal-flow graph of a linear neuron.
In all cases, of course, the gating networks use the softmax nonlinearity.
In the discussion that follows we consider a nonlinear regression problem, assuming
a multivariate Gaussian model. The nonlinear nature of the problem is taken care of by
the softmax nonlinearity of the gating network in Fig. 12.2. The Gaussian assumption is
taken care of by using linear output neurons for the implementation of the expert networks.
We may thus define the mth element of the output vector yi of the ith expert network
as the inner product of the corresponding synaptic weight vector w$) and the input vector
x, as depicted in the signal flow graph of Fig. 12.3b; that is,
(12.21)
where the superscript T denotes transposition, and the weight vector w p is made up of
the elements w?), . . . ,wg)of the mth neuron in the ith expert network. Hence,
In binary classification, the classifier output y is a mscrete random variable with one of two possible
outcomes: 1 or 0. It is generally assumed that the probabilistic component of the model has a Bernoulli
distribution. Let p r denote the conditional probability that the ith expert network reports outcome 1 given the
input vector x. The resulting probability distribution of the modular network may then be described by a Bernoulli
mixmre model (Jordan and Jacobs, 1993):
484 12 / Modular Networks
differentiating yj") with respect to the synaptic weight vector w?), we get
(12.22)
The sensitivity vector of the log-likelihood function 1 with respect to the synaptic
weight vector wl.1 is defined by the functional derivative dl/dwy). Using the chain rule,
we may express this sensitivity vector as
(12.23)
The partial derivative dZ/ayy)is the mth element of the functional derivative allay, defined
in Eq. (12.20); that is,
(12.24)
where e?) is the error signal produced at the output of the mth neuron in the ith expert
network, as shown by
= d(m)- yy) (12.25)
We are now ready to formulate the expression for the sensitivity vector Wawl.). Specifi-
cally, substituting Eqs. (12.22) and (12.24) in (12.23), we get
i = 1,2, . . . ,K
-- - hie!m)x, (12.26)
awp m=l,2, ...,q
To maximize the log-likelihood function 1 with respect to the synaptic weights of the
different expert networks, we may use gradient ascent in weight space. In particular, we
modify the synaptic weight vector w?) by applying a small adjustment Awy), defined by
(12.27)
where 7 is a small learning-rate parameter. Note that the scaling factor on the right-hand
+
side of Eq. (12.27) is 7,since we are using gradient ascent to maximize 1. Thus, using
wlrn)(n)to denote the value of the synaptic weight vector w?) at iteration n of the learning
algorithm, the updated value of this synaptic weight vector at iteration n + 1 is computed
in accordance with the recursion
wy)(n + 1) = wirn)(n)+ Awim)(n)
m = 1,2, . . . , q
= wy)(n) + qhielrn)(n)x,
{ i=1,2, ..., K
(12.28)
This is the desired recursive formula for adapting the expert networks of the modular
architecture shown in Fig. 12.2.
"1
"2 g2
"P
Input output
layer layer
(a)
(b)
FIGURE 12.4 (a) Single layer of softmax neurons for the gating network. (b) Signal-
flow graph of a softmax neuron.
the expert networks in two respects. First, the gating network has K output neurons,
whereas each expert network has q output neurons. Second, the gating network uses a
softmax for the activation function of its output neurons as depicted in the signal-flow
graph of Fig. 12.4b, whereas the expert networks use linear output neurons as shown in
the signal-flow graph of Fig. 12.3b. In any event, we note that at the ith output neuron
of the gating network, say, the unconstrained variables are represented by the weighted
sum ui of the inputs applied to that neuron. The activation gi of the ith output neuron is
related to the weighted sum uiby the softmax of Eq. (12.16). Hence, substitutingEq. (12.16)
in the definition of the log-likelihood function E given in Eq. (12.13), and recognizing that
the summation term exp(uj) is the same for all i, we may rewrite the expression for
the log-likelihood function E as
2
K
1
1 = In i= 1 exP(ui) * exp( -2 /Id - y#) - In j=1 exp(uj) (12.29)
The partial derivative of the log-likelihood function E with respect to the ith weighted
sum ui of the inputs applied to the ith output neuron of the gating network is therefore
found to be (after simplification)
(12.30)
where we have made use of the definitions of giand hi given in Eqs. (12.16) and (12.17),
respectively. Equation (12.30) states that the synaptic weights of the ith output neuron of
the gating network are adjusted such that the activations of the network (Le., the a priori
probabilities si)move toward the corresponding a posteriori probabilities hi (Jacobs and
Jordan, 1991). Note that the a priori probabilities giare conditional on the input vector
486 12 I Modular Networks
x , whereas the a posteriori probabilities hi are conditional on both the input vector x and
the desired response vector d.
From the signal-flow graph of Fig. 12.4b, we see that the weighted sum ui of the ith
output neuron of the gating network is equal to the inner product of the pertinent synaptic
weight vector ai and the input vector x , as shown by
ui = xTai (12.31)
where the vector ai is made up of the synaptic weights ail,ai2,. . . , aiPof neuron i in the
gating network. Hence the partial derivative of the weighted sum ui with respect to the
weight vector ai is given by the p-by-1 vector
(12.32)
The sensitivity vector of the log-likelihood function 1 with respect to the synaptic
weight vector ai is defined by the partial derivative dlldai. Using the chain rule, we may
express this sensitivity vector as
(12.33)
2. Adapting the Expert and Gating Networks. Present the network a task example
represented by the input vector x and desired response vector d. Hence, compute
for iteration n = 0, 1, 2 , . . . , output i = 1, 2,. . . , K, and neuron m = 1, 2,
. . . , q:
ui(n) = xTai(n)
e d u i (n))
gi(n) =
C5,exp(uj(n))
y!m)(n)= xTwy)(n)
y,(n) = [yi')(n),yi2)(n),. . . ,y?)(n)lT
I O = &d
e,m) - Y!%)
I
_ _ _1_ _ _
- - - - - - _ - - _ _ _ _ _Cluster
I
_ - _ _ - _ _ I
Cluster K
I_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ I
where uiis the weighted sum of the inputs applied to that neuron; basically, this equation
is a reproduction of Eq. (12.16) for the case of the single level of hierarchy shown in
Fig. 12.2. Similarly, the activation of the jth output neuron in the ith cluster of the
hierarchical network shown in Fig. 12.5 is defined by
(12.38)
where uili is the weighted sum of the inputs applied to this particular neuron in the ith
cluster.
12.5 I Hierarchical Structure of Adaptive Expert Networks 489
The expert networks in each cluster of Fig. 12.5 are also assumed to consist of a single
layer of linear neurons. Let yji denote the output vector of thejth expert network in the
ith cluster. We may then express the output vector yi of the ith cluster of output networks
as
L
where gjli is as defined in Eq. (12.38). Correspondingly, the output vector y of the whole
network in Fig. 12.5 is defined by
K
Y= giyi (12.40)
i= I
d = F,,(x) + E,
i i = 1,2, . . . ,K
j = 1,2, . . . ,L
where Fji(.) is a vector-valued nonlinear function of its argument vector, and
(12.41)
E is
a zero-mean, Gaussian-distributed random vector.
The goal is to model the probability distribution of the set of training examples {x,d}.
For the objective function appropriate to the learning problem at hand, we use a log-
likelihood function defined as the expanded form of an associative Gaussian mixture
model. Specifically, we write (except for a constant)
(12.42)
where d is the desired response vector associated with the input vector x applied simultane-
ously to all the expert networks and gating networks in Fig. 12.5; the outputs yjiand the
activations gi and gjliare as defined before. Given the training data {x,d}, we wish to
maximize the log-likelihood function E with respect to the unknown quantities yji,gi, and
giv. These quantities, pertaining to the structure of Fig. 12.5, may be given the following
probabilistic interpretations (Jordan and Jacobs, 1992):
m The output vectors yji of the individual expert networks are the unknown conditional
mean vectors of multivariate Gaussian distributions.
The activation gi of the top-level gating network and the activations gjliof the gating
networks inside the isolated clusters are the unknown conditional apriori probabilities
490 12 / Modular Networks
that the ith cluster and the ijth expert networks generated the current training pattern
kdl.
All of these unknown probabilistic quantities are conditional on the input vector x.
To assist in the formulation of the learning algorithm, we introduce two probabilistic
definitions. First, we define the conditional a posteriori probability that the ith cluster of
expert networks generates a particular desired response vector d as
Second, we define the conditional a posteriori probability that the jth expert network in
the ith cluster generates a particular desired response vector d as
The a posteriori probabilities hi and hjIiare both conditional on the input vector x and
the desired response vector d.
The output vectors yji and the activations g, and gjlidepend on the synaptic weights of
the neural networks that constitute the respective expert networks and gating networks.
The log-likelihood function 1 defined in Eq. (12.42) may therefore be viewed as a log-
likelihood function with these synaptic weights as the unknown parameters. Hence, max-
imizing the log-likelihood function 1 yields the maximum likelihood estimates of these
parameters. The maximization of 1may be performed in an iterative fashion, using gradient
ascent for the computation of small adjustments applied simultaneously to all the synaptic
weights in the network.
We may compute these synaptic modifications in Fig. 12.5 by proceeding in three
stages, as follows (Jordan and Jacobs, 1992):
1. Adapting the Top-Level Gating Network. Here we differentiate the log-likelihood
function 1 of Eq. (12.42) with respect to the weighted sum ui of the inputs applied
to the ith output neuron of the top-level gating network, obtaining the scalar partial
derivative
(12.45)
Hence, during the training process the a priori probability gi tries to move toward
the corresponding a posteriori probability hi.
2. Adapting the Gating Networks in the Clusters. In this case, we differentiate the log-
likelihood function 1 of Eq. (12.42) with respect to the weighted sum ujli of the
inputs applied to the jth output neuron of the gating network in the ith cluster,
obtaining the scalar partial derivative
(i = 1 , 2
(12.46)
Consequently, during training the a priori probability gjlitries to move toward the
corresponding a posteriori probability hjli.
3. Adapting the Expert Networks. Next, we differentiate the log-likelihood function 1
of Eq. (12.42) with respect to the output vector yji of the jth expert network in the
12.6 I Piecewise Control Using Modular Networks 491
(12.47)
Thus, during training the synaptic weights of the jth expert network in the ith cluster
are updated by an amount proportional to the a posteriori probability that this
particular expert network generated the training pattern in current use.
The partial derivative al/au,,, of Eq. (12.46) and the partial derivative allay,, of Eq.
(12.47) share a common factor, namely, the a posteriori probability hi. This means that
the expert networks within a cluster are tied to each other. Consequently, the expert
networks within a cluster tend to learn similar mappings early in the training process.
However, when the probabilities associated with a cluster to which the expert networks
belong assume larger values later in the training process, they start to specialize in what
they learn. Thus the hierarchical network of Fig. 12.5 tends to evolve in a coarse-to-$ne
structural fashion. This property is important, because it implies that a deep hierarchical
network is naturally robust with respect to the overfitting problem (Jordan and Jacobs,
1992).
The final step in the development of the stochastic-gradientlearning algorithm for the
network of Fig. 12.5 involves the determination of the sensitivity factors aElaai, al/acjli,
and al/aw)r). The vector ai denotes the synaptic weight vector of the ith output neuron
of the top-level gating network, cjlidenotes the synaptic weight vector of the jth output
neuron of the gating network in the ith cluster, and wp) denotes the synaptic weight vector
of the mth output neuron of the jth expert network in the ith cluster; the index m = 1,
2, . . . , q, where q is the total number of output neurons in each expert network. To find
the formulas for these sensitivity factors, we use chain rules to express them as the
products of certain functional derivatives, as shown here:
_ - ai
dl _ _aui
- i = 1,2, . . . ,K (12.48)
aai aui aai’
i = 1,2, . . . , K
d l - d l auj1i
----
acj,i auj,i acj,j ’ { j = 1,2, . . . ,L
(12.49)
( i=1,2, ...,K
(12.50)
The derivations of the functional derivatives in Eqs. (12.48) through (12.50) and their
use in the determination of these three sensitivity factors, and therefore the development
of a learning algorithm for the network of Fig. 12.5, follow a procedure similar to that
described in Section 12.4; this derivation is presented as an exercise to the reader as
Problem 12.7.
Mass
FIGURE 12.6 Two-joint planar arm. (From R.A. Jacobs and M.I. Jordan, 1993, with
permission of IEEE.)
control strategy is applicable to situations where it is known how the dynamics of a plant
change with its operating points.
An important advantage of a modular network over a multilayer perceptron as feedfor-
ward controller for the kind of plant described here is that it is relatively robust with
respect to temporal crosstaEk (Jacobs and Jordan, 1993; Jacobs et al., 1991b). To explain
what we mean by this phenomenon, suppose that a multilayer perceptron is trained on a
particular task using the back-propagation algorithm, and then it is switched to another
task that is incompatible with the first one. Ideally, we would like to have the network
learn the second task without its performance being unnecessarily impaired with respect
to the first task. However, according to Sutton (1986), back-propagation learning has the
opposite effect in the sense that it tends to preferentially modify the synaptic weights of
hidden neurons that have already developed useful properties. Consequently, after learning
the second task, the multilayer perceptron is no longer able to perform the first task. We
may of course alternate the training process between tasks, and carry on in this way until
the multilayer perceptron eventually learns both tasks. However, the price that we have
to pay for it is a prolonged learning process. The generalization performance of the
multilayer perceptron may also be affected by the blocked presentation of incompatible
training data.
In contrast, when a modular network is trained on a number of incompatible tasks, it
has the ability to partition the parameter space of the plant into a number of regions and
assign different expert networks to learn a control law for each region (Jacobs and Jordan,
1993). The end result is that the network is relatively immune to temporal crosstalk
between tasks.
Jacobs and Jordan (1993) have compared the performance of a single network and a
modular network as feedforward controllers for a robot arm in the form of the two-joint
planar manipulator shown in Fig. 12.6. These two networks were specified as follows5:
A fully connected multilayer perceptron with 19 input nodes (corresponding to the
joint variables and payload identity), 10 hidden neurons, and 2 output neurons
(corresponding to the feedforward torques).
~
In the study reported by Jacobs and Jordan (1993), two other modifications of the modular network were
also considered, which are referred to as modular architectures with a share network. During training the share
network learns a strategy that is useful to all tasks. One of these two networks was constrained to discover a
particular type of decomposition. The performance of these two configurations was almost the same, and superior
to that of the conventional modular network.
12.7 / Summary and Discussion 493
0.3
W r n
SN-single network
MA-modular architecture
0.2
Joint
root
mean-square
error
(radians)
0.1
0.0 I I I I I
0 2 4 6 8 10
Number of epochs
FIGURE 12.7 Learning curves for a single network and a modular network, both of
which are trained on the two-joint planar arm. (From R.A. Jacobs and M.I. Jordan, 1993,
with permission of IEEE.)
on the other hand, activation functions in the form of softmax are used for the neurons
in the gating networks.
The use of a modular hierarchical approach as described here offers many attractive
features (Houk, 1992; Jordan, 1992; Jacobs et al., 1991b):
Biological plausibility
m A treelike structure that is a universal approximator
m A level of granularity that is intermediate between those of the local and global
methods of approximation
m No back-propagation of error terms in the recursive computations of the synaptic
weights characterizing the expert networks and gating networks
An insightful probabilistic interpretation for the various unknown quantities of the
network
A viable solution to the credit-assignmentproblem through the computation of a set
of a posteriori probabilities
The probabilistic interpretation referred to here is a direct consequence of the way in
which the objective function for deriving the learning algorithm is formulated. Specifically,
the objective function is defined as a log-likelihood function of the synaptic weights
of the expert networks and those of the gating networks. Accordingly, maximization of
the objective function results in maximuin-likelihood estimates of these unknown free
parameters of the network. Maximum-likelihood parameter estimation is a well-developed
branch of statistics, which means that we have a wealth of knowledge at our disposal for
the computation and interpretation of these estimates. In particular, if the estimates are
known to be unbiased, then the maximum-likelihood estimates of the synaptic weights
of the various network components satisfy an inequality referred to as the Cram&-Rao
lower bound (Van Trees, 1968). If this bound is satisfied with an equality, the estimates
are said to be effzcient.At this point a logical question to ask is: Do parameter estimation
procedures better than the maximum-likelihood procedure exist? Certainly, if an efficient
procedure does not exist, then there may be unbiased estimates with lower variances than
the maximum-likelihood estimates. The difficulty, however, is that there is no general
rule for finding them. It is for this reason that the maximum-likelihood procedure enjoys
a great deal of popularity as a tool for parameter estimation.
A practical limitation of the maximum-likelihood procedure is that it can be computa-
tionally intensive. We may overcome this limitation by following a gradient ascent proce-
dure, as described in Sections 12.4 and 12.5. However, by doing so there is no guarantee
that we reach the globally maximum value of the objective function (i.e., log-likelihood
function), as there is the real possibility of being trapped in a local minimum. To guard
against such a possibility, we may add small amounts of noise to the sensitivity factors
so as to help the stochastic-gradient algorithm escape local minima (Jelinek and Reilly,
1990).
Alternatively, we may use the expectation-maximization (EM) algorithm (Dempster et
al., 1977), which is well suited for mixture models. The EM algorithm is a general
approach to iterative computation of maximum-likelihood estimates when the observed
data can be viewed as incomplete data. The term “incomplete data” has two implications:
1. The existence of two sample spaces X and Y represented by the observed data
vector x and the complete data vector y, respectively
2. The one-to-many mapping y + x(y) from space Y to space X
Problems 495
The complete data vector y is not observed directly, but only through the vector x. Each
iteration of the EM algorithm is composed of two steps: an expectation (E) step and a
maximization (M) step-hence the name of the algorithm. Suppose that 6(n) denotes the
current value of a parameter vector after n cycles of the algorithm. Let I, denote the log-
likelihood of the complete data vector y. The next cycle of the EM algorithm may then
be described as follows:
E STEP. Compute the expected value of the complete-data log-likelihood I,, given the
observed data vector x and the current model represented by the parameter vector &n)
by finding the deterministic quantity
M STEP. Determine the updated value 6(n + 1) of the parameter vector as the solution
of the equations
6(n + 1) = argmax Q[8,6(n)] (12.52)
Jordan and Jacobs (1993) describe the application of the EM algorithm to the design
of hierarchical mixtures of experts for solving nonlinear regression problems: They
compare the performance of a modular network so trained with that of a multilayer
perceptron trained with the back-propagation algorithm for a four-joint robot arm moving
in three-dimensional space. Computer simulations were carried out for both batch and
on-line modes of operations. In both cases, learning based on the EM algorithm outper-
formed back-propagation learning.
The material presented in this chapter represents one particular viewpoint of how
to design modular networks for solving nonlinear regression problems. The notions of
modularity described here represent an important step in the development of a powerful
approach to the design of neural networks. There is much that remains to be explored on
this highly important subject.
PROBLEMS
12.1 Figure P12.1 shows a modular network with a treelike structure. Reduce this
network to its most basic form involving the minimum number of components.
12.2 Consider the modular network of Fig. 12.2. Derive the expressions for the partial
derivatives allayi and dlldu, given in Eqs. (12.20) and (12.30), respectively.
12.3 A modular network is trained with a particular data set, and the same module is
found to win the competition for all the training data. What is the implication of
such a phenomenon?
12.4 Consider the modular network of Fig. 12.2, in which each expert network consists
of a feedforward network with two layers of computation nodes. The modular
network is designed to perform nonlinear regression.
Another useful application of the EM algorithm is described in Xu and Jordan (1993a). In this paper, the
EM algorithm is used to combine multiple classifiers for a particular pattern recognition task. In a related paper,
Xu and Jordan (1993b) propose the use of the EM algorithm and the model of a finite mixture of Gaussian
distributions for unsupervised learning.
496 12 / Modular Networks
FIGURE P12.1
(a) Describe the types of neurons that are used in the two layers of each expert
network.
(b) How would you design such a network in light of the theory presented in
Sections 12.3 and 12.4?
12.5 Consider a piecewise-linear task described by
12.8 The learning algorithm derived in Section 12.4 for the modular network of Fig.
12.2 was formulated for the on-line mode of operation. Describe the changes that
would have to be made to this algorithm for the batch mode of operation.
12.9 For unbiased estimates and therefore improved performance of the modular network
in Fig. 12.2, it may be necessary to adapt the variances of the multivariate Gaussian
distribution. To do so, we redefine the covariance matrix Aiof the ith expert
network’s output vector yi in the modular network of Fig. 12.2 as follows:
Ai= a;I
where UT is the variance to be adapted, and I is the identity matrix. Accordingly,
we reformulate the objective function of Eq. (12.13) as
K