Chapter 12

Modular Networks
12.1 Introduction
The hierarchical levels of organization in artificial neural networks may be classified as
follows. At the most fundamental level are synapses, followed by neurons, then layers of
neurons in the case of a layered network, and finally the network itself. The design of
neural networks that we have pursued up to this point has been of a modular nature at
the level of neurons or layers only. It may be argued that the architecture of a neural
network should go one step higher in the hierarchical level of organization. Specifically,
it should consist of a multiplicity of networks, and that learning algorithms should be
designed to take full advantage of the resulting modular structure. The present chapter is
devoted to a particular class of modular networks that relies on the combined use of
supervised and unsupervised learning paradigms.
We may justify the rationale for the use of modular networks by considering the
approximation problem. The approximation of a prescribed input-output mapping may
be realized using a local method that captures the underlying local structure of the mapping.
Such a model is exemplified by radial-basis function (RBF) networks, which were studied
in Chapter 7. The use of a local method offers the advantage of fast learning and therefore
the ability to operate in real time, since it usually requires relatively few training examples
to learn a single task. However, a limitation of local methods is that they tend to be
memory intensive. Alternatively, the approximation may be realized using a global method
that captures the underlying global structure of the mapping. This second model is exempli-
fied by back-propagation learning applied to multilayer perceptrons, which were studied
in Chapter 6. The use of global methods offers the advantages of a smaller storage
requirement and better generalization performance. However, they suffer from a slow
learning process that limits their range of applications. In light of this dichotomy between
local and global methods of approximation, it is natural to ask: How can we combine the
advantages of these two methods? The answer appears to lie in the use of a modular
architecture that captures the underlying structure of an input-output mapping at an
intermediate level of granularity. The idea of using a modular network for realizing a
complex mapping function was discussed by Hinton and Jacobs as far back as the mid-
1980s (Jacobs et al., 1991a). Mention should also be made of a committee machine
consisting of a layer of elementary perceptrons followed by a vote-taking perceptron in
the second layer, which was described in Nilsson (1965). However, it appears that the
class of modular networks discussed in this chapter was first described in Jacobs and
Jordan (1991), and the architecture for it was presented by Jacobs et al. (1991a).
A useful feature of a modular approach is that it also provides a better fit to a dis-
continuous input-output mapping. Consider, for example, Fig. 12.1, which depicts a
473
474 12 / Modular Networks
Discontinuous
f function
FIGURE 12.1 A discontinuous (piecewise-linear) function and its approximation.
one-dimensional function g(x) with a discontinuity, as described by

x, x>o
= (12.1)
-x, x50
If we were to use a single fully connected network to approximate this function, the
approximation may exhibit erratic behavior near the discontinuity, as illustrated by the
dashed curve in Fig. 12.1. In a situation of this kind, it would be preferable to split the
function into two separate pieces, and use a modular network to learn each piece separately
(Jacobs et al., 1991b).
The use of a modular approach may also be justified on neurobiological grounds.
Modularity appears to be an important principle in the architecture of vertebrate nervous
systems, and there is much that can be gained from the study of learning in modular
networks in different parts of the nervous system (Houk, 1992). For example, the existence
of hierarchical representations of information is particularly evident in the cortical visual
areas (Van Essen et al., 1992; Van Essen, 1985; Fodor, 1983). The highly complex
computation performed by the visual system is broken down into pieces, just like any
good engineer would do when designing a complex system, as evidenced by the following
(Van Essen et al., 1992):
1. Separate modules are created in the visual system for different subtasks, allowing
the neural architecture to be optimized for particular types of computation.
2. The same module is replicated many times, as exemplified by the internal structure
of area VI of the visual cortex.
3. Coordinated and efficient routing of information between modules is maintained.
Modularity may therefore be viewed as an additional variable, which would permit
the formation of higher-order computationa2 units that can perform complex tasks. Refer-
ring back to the hierarchical levels of organization in the brain as described in Chapter
1, and recognizing the highly complex nature of the computational vision and motor
control tasks that a human being can perform so efficiently and effortlessly, it is apparent
12.2 / Basic Notions of Modularity 475
that modularity as a computational technique is the key to understanding complex tasks

performed by artificial neural networks. Unfortunately, the scope of our knowledge of
this important subject is rather limited at the present. This chapter on a very special kind
of modular networks should therefore be viewed as a good beginning.
Organization of the Chapter

The main body of this chapter on modular networks is organized as follows. In Section
12.2 we formally define what we mean by a modular network, and discuss the implications
of modularity. In Section 12.3 we describe an associative Gaussian mixture model for a
specific configuration of modular networks. This is followed by the derivation of a
stochastic-gradientlearning algorithm (based on maximization of a log-likelihoodfunction)
for the network, which we do in Section 12.4. In Section 12.5 we extend the concept of
modularity by developing a hierarchical structure of modular representations. In Section
12.6 we describe an application of modular networks in control. The chapter concludes
in Section 12.7 with a summary of the properties of modular networks and some final
thoughts on the subject
12.2 Basic Notions of Modularity

A modular network is formally defined as follows.’
A neural network is said to be modular if the computation pedormed by the network

can be decomposed into two or more modules (subsystems) that operate on distinct
inputs without communicating with each other. The outputs of the modules are
mediated by an integrating unit that is not permitted to feed information back to
the modules. In particular, the integrating unit both ( I ) decides how the outputs of
the modules should be combined to form the jinal output of the system, and (2)
decides which modules should learn which training patterns.
Modularity may therefore be viewed as a manifestation of the principle of divide and

conquer, which permits us to solve a complex computational task by dividing it into
simpler subtasks and then combining their individual solutions.
A modular network fuses supervised and unsupervised learning paradigms in a seamless
fashion. Specifically, we have the following.
Supervised learning, exemplified by an external “teacher” that supplies the desired
responses (target patterns) needed to train the different modules of the network.
However, the teacher does not specify which module of the network should produce
each desired response; rather, it is the function of the unsupervised learning paradigm
to do this assignment.
Unsupervised learning, exemplified by the modules “competing” with each other
for the right to produce each desired response. The integrating unit has the role of
“mediating” among the different modules for this right.
Consequently, the modules of the network tend to specialize by learning different regions
of the input space. However, the form of competitive learning described here does not
necessarily enforce the specialization; rather, it is done naturally. In the competition,
roughly speaking, the winner is the module whose output most closely matches the desired
’ This definition is adapted from Osherson et al. (1990);Jacobs and Jordan (1991);and Jacobs et al. (1991a).
response, and the other modules are the losers. Furthermore, each module receives an
amount of training information that is proportional to its relative ability to learn. This
means that the winning module receives more training information than the losing modules.
From Chapter 10 we recall that unsupervised learning involves the use of positive
feedback. So it is with a modular network in that the competition among the modules for
the right to learn the training patterns involves a positive feedback effect. More precisely,
if a particular module learns a great deal about some training patterns, then it will likely
perform well when presented with related training patterns and thus learn a great deal
about them too. By the same token, the module will perform poorly when presented with
unrelated training patterns, in which case it will learn little or nothing about them. In both
cases, the positive feedback effect manifests itself by some form of self-amplification.
The Credit-Assignment Problem

In a neurobiological system, a large network needs constraints to limit the number of
synapses to the space available on the surfaces of the dendrites and cell body (soma) of
neurons in the system. An additional, and perhaps more restrictive, constraint stems from
the difficulty of implementing learning algorithms in large networks (Houk, 1992). This
is because of the credit-assignment problem, which was discussed in Chapter 1. This
problem refers to the issue of getting the right training information to the right synapses
in a network so as to improve the overall system performance. It appears that modularity
may serve to organize a network in a manner that is beneficial to credit assignment (Houk
and Barto, 1992).
In the case of artificial neural networks, we also expect to find that modularity provides
a viable solution to the credit-assignment problem by having the integrating unit learn to
properly allocate the training data to the different modules of the network. This allocation
is carried out in accordance with the inherent complexity of the input data and the learning
capacity of the expert networks acting as estimators.
From statistical estimation theory, we know that (unless we have some prior informa-
tion) the estimation procedure should include a criterion for selecting the model order
(Le., the number of free parameters in the model). The minimum description length (MDL)
criterion, which describes the process of searching for a model with the shortest code
length, has several attributes that befit its use for model-order selection (Rissanen, 1978,
1989). Most important, since the MDL criterion permits the shortest encoding of the data,
a model based on it captures best all the properties in the data that we wish to learn-
indeed, can learn. In its most essential form, the MDL criterion may be written as (Rissanen,
1989):
k
MDL(k) = - lnf(xl6) ~ ( 6+) -In
2
N (12.2)
where k is the number of free parameteTs in the model, 6 is the estimate of the parameter
vector 8 characterizing the model, f(xl8) is the conditional probability density function
of the input vector x given the estimate 6, the probability distribution a(6) expresses our
prior knowledge about the estimate 6, and N is the length of data available for processing.
The first term in Eq. (12.2) decreases with k, whereas the second term increases with k.
Accordingly, the optimum model order k is the value of k for which MDL(k) is minimum.
Returning to the issue at hand, it would be highly desirable to have the overall learning
capacity of the modular network match the complexity of the input data. For this to
happen, the total number of free parameters in the modular network should ideally satisfy
the MDL criterion. With the modular network containing a multitude of expert networks
and an integrating unit, it is obvious that each expert network is too simple to deal with
12.2 I Basic Notions of Modularity 477
the complexity of the input data by itself. Accordingly, when an expert network is faced
with complex data, it will unavoidably yield large residuals (i.e., estimation errors). This
will, in turn,fuel the competitive process in the modular network and thereby permit the
other expert networks to try to describe the residuals. This is accomplished under the
direction of the integrating unit, thereby providing a viable solution to the credit-assignment
problem.
Advantages of Modular Networks

The use of modular architecture has an important implication: The structure of each
module in the network biases the set of training data for which it is likely to win the
competition. In other words, the network is sensitive to structure-function relationships
characterizing the training patterns, and such relationships can be exploited to bias the
nature of the decomposition discovered by the network and thereby develop specialization
among the different modules. Thus, by its very design and implications, a modular
network offers several advantages over a single neural network in terms of learning speed,
representation capability, and the ability to deal with hardware constraints, as explained
here (Jacobs et al., 1991b).
1. Speed of Learning. If a complex function is naturally decomposable into a set
of simpler functions, then a modular network has the built-in ability to discover the
decomposition.Accordingly, a modular network is able to learn the set of simpler functions
faster than a multilayer perceptron can learn the undecomposed complex function. Con-
sider, for example, the piecewise-linear function defined in Eq. (12.1) and depicted in
Fig. 12.1. This function can be learned by a multilayer perceptron with at least a single
hidden layer. Alternatively, it can be learned by a modular network consisting simply of
two linear neurons and an integrating unit. One neuron learns the function g(x) = x for
x > 0, and the other neuron learns the remaining function g(x) = -x for x 5 0. The role
of the integrating unit is to select the appropriate neuron in the appropriate context.
Assuming that the integrating unit is able to learn its role relatively easily, a modular
network should be able to learn the function g(x) faster than a multilayer perceptron for
two reasons: (1) It has no hidden neurons, and (2) each of its two modules (made up of
single neurons) is required to learn only a linear function (Jacobs et al., 1991b).
2. Data Representation. The representation of input data developed by a modular
network tends to be easier to understand than in the case of an ordinary multilayer
perceptron, by virtue of the ability of a modular network to decompose a complex task
into a number of simpler tasks. This property was demonstrated by Rueckl et al. (1989),
who used simulated retinal images to perform two relatively independent tasks: object
recognition (“what” task), and spatial location (‘ ‘where’’ task). Comparative computa-
tions involving two different models were investigated:
An unsplit model, consisting of a fully connected multilayer perceptron with a single
hidden layer
A split model, in which the synaptic weights between the hidden and output layers
were partitioned into two equal subsets, one subset connected only to the output
neurons representing the “what” task and the other subset connected only to the
remaining output neurons representing the “where” task.
The two models were tested on the same simulated retinal data. At each time step of the
simulation, one of nine objects is placed at one of nine corners of a simulated retina. The
“what” task is to identify the object, and the “where” task is to identify its location in
the retina. When the unsplit model is used to resolve these two relatively independent
tasks, the same set of hidden neurons is forced to represent information about both tasks.
On the other hand, in the case of the split model, different sets of hidden neurons are
used to represent information about the two tasks. Provided that enough computational
resources were available in both cases, the split model was found to develop more efficient
internal representations.
3. Hardware Constraints. In a brain, there is a physical limit on the number of neurons
that can be accommodated in the available space. In a related discussion on representations
employed by the brain, it is suggested by Ballard (1986) that such a limitation compels
the brain to adopt a modular structure, and that the brain uses a coarse code to represent
multidimensional spaces. To represent a space of dimension k, it is hypothesized that the
number of neurons required to do the representation is NklOk-l, where N is the number
of just-noticeable differences in each dimension of the space and D is the diameter of the
receptive field of each neuron. With a limit imposed on the number of neurons in a cortical
area of the brain, the representation of high-dimensional spaces is distributed in different
areas that compute different functions. In an analogous manner, it may be argued that,
in order to reduce the number of neurons in an artificial neural network, the representation
of multidimensional spaces may be distributed among multiple networks (Jacobs et al.,
1991b).
With the background on modularity described in this section and the previous one, we
are ready to undertake a detailed analysis of a special class of modular networks, which
we do in the remaining sections of the chapter.*
12.3 Associative Gaussian Mixture Model

Consider the specific configuration of a modular network shown in Fig. 12.2. The structure
consists of K supervised modules called expert networks, and an integrating unit called
a gating network that performs the function of a mediator among the expert networks.
Let the training examples be denoted by input vector x of dimension p and desired
response (target output) vector d of dimension q. The input vector x is applied to the
expert networks and the gating network simultaneously. Let yi denote the q-by-1 output
vector of the ith expert network, let g, denote the activation of the ith output neuron of
the gating network, and let y denote the q-by-1 output vector of the whole modular
network. We may then write
K
Y= giyi (12.3)
i= 1
The goal of the learning algorithm used to train the modular network of Fig. 12.2 is
to model the probability distribution of the set of training patterns {x,d}. We assume that
the patterns {x,d} used to do the training are generated by a number of different regressive
processes.
To do the learning, the expert networks and the gating network in Fig. 12.2 are all
trained simultaneously. For this purpose, we may use a learning algorithm that proceeds
as follows (Jacobs and Jordan, 1991).
1. An input vector x is selected at random from some prior distribution.
2. A rule or expert network is chosen from the distribution P(i1x);this is the probability
of the ith rule given the input vector x.
*Much of the material presented in the remainder of this chapter is based on Jacobs and Jordan (1991), and
Jordan and Jacobs (1992). The approach taken in these two papers is statistical in nature, being based on
maximum-likelihood estimation. Some similar results are reported by Szymanski and L e m o n (1993). using
an information-theoretic approach.
12.3 / Associative Gaussian Mixture Model 479
I output
vector
FIGURE 12.2 Block diagram of a modular network; the outputs of the expert networks
(modules) are mediated by a gating network.
3. A desired response vector d is generated by the selected rule i according to the

regressive process
d = F~(x)+ E[, i = 1, 2 , . . . , K (1 2.4)
where Fi(x) is a deterministic, vector-valued function of the input vector x, and
is a random vector. For simplicity, it may be assumed that the random vector is
Gaussian-distributed with zero mean and covariance matrix $1, where I is the
identity matrix and c? is a common variance. To simplify matters further, we may
set a2 = 1.
Note that the output vector of each expert is nut modeled as a multivariate Gaussian
distribution. Rather, it is viewed as the conditional mean of a multivariate Gaussian
distribution. Specifically, the output vector yi of the ith expert network is written as
yi=pi, i = l , 2, . . . , K (12.5)
The vector pi is the conditional mean of the desired response d given the input vector
x and that the ith expert network is picked for training, as shown by [in light of Eq.
(12.411
pi = E[dlx,i]
= Fi(x), i = 1, 2 , . . . , K (12.6)
Note also that, in general, the elements of the output vector of each expert are nut
uncorrelated. Rather, the covariance matrix of the ith expert network’s output vector yi
is the covariance matrix of E;. For the sake of simplicity, however, it is assumed that for
all K expert networks we have
E l = E2 = ..* = EX.
and that the covariance matrix of is the identity matrix, as shown by
Ai=I, i = l , 2, . . . , K (12.7)
The multivariate Gaussian distribution of the desired response vector d, given the input
vector x and that the ith expert network is chosen, may therefore be expressed as (Wilks,
1962)
f(dlx’i) =
1
(27Tdet Ai)q/2exp( -51 (d - yi)*A;’(d - y,))
- 1 1
- -exp( --
2 (d - yi)T(d- yi))
(27T)4‘2
1 1
- -exp(-- Ild - y#), i = 1, 2 , . . . , K (12.8)
(2n)q/2 2
where ll*ll denotes the Euclidean norm of the enclosed vector. The multivariate distribution
in Eq.(12.8) is written as a conditional probability density function to emphasize the fact
that, for a given input vector x, we are assuming that the ith expert network produces the
closest match to the desired response vector d.
On this basis, we may treat the probability distribution of the desired response vector
d as a mixture model (i.e., as a linear combination of K different multivariate Gaussian
distributions), as shown by
i= 1
(12.9)
The probability distribution of Eq. (12.9)is called an associative Gaussian mixture model3;
the term “associative” refers to the fact that the model is associated with a set of training
patterns represented by the input vector x and desired response vector d.
The goal of the learning algorithm is to model the distribution of a given set of training
patterns. To do so, we first recognize the fact that the output vector yi of the ith expert
network is a function of the synaptic weight vector wi of that network. Let the vector w
of appropriate dimension denote the synaptic weights of all the expert networks axranged
as follows:
W = [Ij (12.10)
Similarly, let the vector g denote the activations of all the output neurons in the gating
(12.11)
For a discussion of nonassociative Gaussian mixture models, see McLachlan and Basford (1988).
12.3 I Associative Gaussian Mixture Model 481
We may thus view the conditional probability density function f(dlx) as a likelihood
function, with the whole synaptic weight vector w and the activation vector g playing the
roles of unknown parameters. In situations of the kind described by Eq. (12.9), it is
preferrable to work with the natural logarithm off(d)x) rather thanf(d1x); we may do so
since the logarithm is a monotone increasing function of its argument. Accordingly, we
may define a log-likelihoodfunction as follows:
Kw, g) = In f ( d h (12.12)
Substituting Eq. (12.9) in (12.12) and ignoring the constant term - ln(2~)4’~, we may
formally express the log-likelihood function l(w,g) as follows (Jacobs and Jordan, 1991;
Jacobs et al., 1991a):
(12.13)
where it is understood that yi depends on wi (Le., the ith portion of w). We may thus
view l(w,g) as an objective function, the maximization of which yields maximum-likelihood
estimates of all the free parameters of the modular network in Fig. 12.2, represented by
the synaptic weights of the different expert networks and those of the gating network.
We may now offer the following interpretations for some of the modular network‘s
unknown quantities (Jacobs and Jordan, 1991):
1. The optimized module’s output vectors yl, y2,. . . , yKof the expert networks are
unknown conditional mean vectors.
2. The optimized gating network’s outputs g,, g2, . . . , gK are the conditional a priori
probabilities that the respective modules generated the current training pattern.
The probabilistic parameters referred to under points 1 and 2 are all conditional on the
input vector x.
Whereas the different expert networks of the modular structure in Fig. 12.2 are permitted
to have an arbitrary connectivity, the activations of the output neurons of the gating
network are constrained to satisfy two requirements (Jacobs and Jordan, 1991):
(12.14)
K
xgi=1 (12.15)
i= 1
These two constraints are necessary if the activations gi are to be interpreted as a priori
probabilities.
Given a set of unconstrained variables, {ujlj = 1, 2 , . . . , K } , we may satisfy the two
constraints of Eqs. (12.14) and (12.15) by defining the activation gi of the ith output
neuron of the gating network as follows (Bridle, 1990a):
(12.16)
where ui is the weighted sum of the inputs applied to the ith output neuron of the gating
network. This normalized exponential transformation may be viewed as a multiinput
generalization of the logistic function. It preserves the rank order of its input values,
and is a differentiable generalization of the “winner-takes-all’’ operation of picking the
maximum value. For this reason, the transformation of Eq. (12.16) is referred to as sofmax
(Bridle, 1990a, b).
12.4 Stochastic-Gradient Learning Algorithm

To assist in the formulation of the learning algorithm for the modular network of Fig.
12.2, we define the a posteriori probability associated with the output of the ith expert
network as (Jacobs and Jordan, 1991)
This probability is conditional on both the input vector x and the desired response vector
d. From this definition, we also note that as with the a priori probabilities represented
by the activations gi, the a posteriori probabilities hi satisfy the two necessary conditions:
0 5 hi 5 1 for all i (12.18)
and
K
chi=l (12.19)
i=l
There are two different parameter adjustments to be performed in the modular network
of Fig. 12.2:
1. Modifications of the synaptic weights in the different expert networks

2. Modifications of the synaptic weights in the gating network
All these parameter adjustments are performed simultaneously. We may do so using
a stochastic gradient algorithm based on the associative Gaussian mixture model of Eq.
(12.9), as described next.
Adapting the Expert Networks

The log-likelihood function 1 for the modular network of Fig. 12.2 is defined by Eq.
(12.13). Hence, differentiating this equation with respect to the output vector yi of the ith
expert network, we get the following q-by-1 partial derivative (after simplification):
dl
- = hi(d - y;), i = 1,2, . . . ,K (12.20)
aYi
Equation (12.20) states that, during the training process, the synaptic weights of the ith
expert network in Fig. 12.2 are adjusted to correct the error between the output vector yi
and the desired response vector d, but in proportion to the a posteriori probability hithat
the ith expert network generated the training pattern in current use (Jacobs and Jordan,
1991).
Suppose now that each expert network consists of a single layer of neurons, as depicted
in the architectural graph of Fig. 12.3a. The specification of the neurons in Fig. 12.3a
depends on whether we are solving a regression or classification problem, as explained
here:
Regression. In a nonlinear regression problem the residuals are generally assumed
to have a multivariate Gaussian distribution. In a corresponding way, the output
neurons of the expert networks are modeled as linear.
ClassiJication. In a pattern-classification problem, the output neurons are usually
assumed to have a sigmoidal nonlinearity. In this case, a mixture of Bernoulli
12.4 / Stochastic-Gradient Learning Algorithm 483
Input Output
layer layer
(4
FIGURE 12.3 (a) Single layer of linear neurons constituting the expert network.
(b) Signal-flow graph of a linear neuron.
distributions4rather than a multivariate Gaussian distribution is generally used to

formulate the log-likelihood function.
In all cases, of course, the gating networks use the softmax nonlinearity.
In the discussion that follows we consider a nonlinear regression problem, assuming
a multivariate Gaussian model. The nonlinear nature of the problem is taken care of by
the softmax nonlinearity of the gating network in Fig. 12.2. The Gaussian assumption is
taken care of by using linear output neurons for the implementation of the expert networks.
We may thus define the mth element of the output vector yi of the ith expert network
as the inner product of the corresponding synaptic weight vector w$) and the input vector
x, as depicted in the signal flow graph of Fig. 12.3b; that is,
(12.21)
where the superscript T denotes transposition, and the weight vector w p is made up of
the elements w?), . . . ,wg)of the mth neuron in the ith expert network. Hence,
In binary classification, the classifier output y is a mscrete random variable with one of two possible
outcomes: 1 or 0. It is generally assumed that the probabilistic component of the model has a Bernoulli
distribution. Let p r denote the conditional probability that the ith expert network reports outcome 1 given the
input vector x. The resulting probability distribution of the modular network may then be described by a Bernoulli
mixmre model (Jordan and Jacobs, 1993):
differentiating yj") with respect to the synaptic weight vector w?), we get
(12.22)
The sensitivity vector of the log-likelihood function 1 with respect to the synaptic
weight vector wl.1 is defined by the functional derivative dl/dwy). Using the chain rule,
we may express this sensitivity vector as
(12.23)
The partial derivative dZ/ayy)is the mth element of the functional derivative allay, defined
in Eq. (12.20); that is,
(12.24)
where e?) is the error signal produced at the output of the mth neuron in the ith expert
network, as shown by
= d(m)- yy) (12.25)
We are now ready to formulate the expression for the sensitivity vector Wawl.). Specifi-
cally, substituting Eqs. (12.22) and (12.24) in (12.23), we get
i = 1,2, . . . ,K
-- - hie!m)x, (12.26)
awp m=l,2, ...,q
To maximize the log-likelihood function 1 with respect to the synaptic weights of the
different expert networks, we may use gradient ascent in weight space. In particular, we
modify the synaptic weight vector w?) by applying a small adjustment Awy), defined by
(12.27)
where 7 is a small learning-rate parameter. Note that the scaling factor on the right-hand
+
side of Eq. (12.27) is 7,since we are using gradient ascent to maximize 1. Thus, using
wlrn)(n)to denote the value of the synaptic weight vector w?) at iteration n of the learning
algorithm, the updated value of this synaptic weight vector at iteration n + 1 is computed
in accordance with the recursion
wy)(n + 1) = wirn)(n)+ Awim)(n)
m = 1,2, . . . , q
= wy)(n) + qhielrn)(n)x,
{ i=1,2, ..., K
(12.28)
This is the desired recursive formula for adapting the expert networks of the modular
architecture shown in Fig. 12.2.
Adapting the Gating Network

Consider next the way in which the gating network is adapted. As with the expert networks,
we assume that the gating network consists of a single layer of output neurons, as shown
in the architectural graph of Fig. 12.4a. This structure differs from that of Fig. 12.3a for
12.4 / Stochastic-Gradient Learning Algorithm 485
"1
"2 g2
"P
Input output
layer layer
(a)
(b)
FIGURE 12.4 (a) Single layer of softmax neurons for the gating network. (b) Signal-
flow graph of a softmax neuron.
the expert networks in two respects. First, the gating network has K output neurons,
whereas each expert network has q output neurons. Second, the gating network uses a
softmax for the activation function of its output neurons as depicted in the signal-flow
graph of Fig. 12.4b, whereas the expert networks use linear output neurons as shown in
the signal-flow graph of Fig. 12.3b. In any event, we note that at the ith output neuron
of the gating network, say, the unconstrained variables are represented by the weighted
sum ui of the inputs applied to that neuron. The activation gi of the ith output neuron is
related to the weighted sum uiby the softmax of Eq. (12.16). Hence, substitutingEq. (12.16)
in the definition of the log-likelihood function E given in Eq. (12.13), and recognizing that
the summation term exp(uj) is the same for all i, we may rewrite the expression for
the log-likelihood function E as
2
K
1
1 = In i= 1 exP(ui) * exp( -2 /Id - y#) - In j=1 exp(uj) (12.29)
The partial derivative of the log-likelihood function E with respect to the ith weighted
sum ui of the inputs applied to the ith output neuron of the gating network is therefore
found to be (after simplification)
(12.30)
where we have made use of the definitions of giand hi given in Eqs. (12.16) and (12.17),
respectively. Equation (12.30) states that the synaptic weights of the ith output neuron of
the gating network are adjusted such that the activations of the network (Le., the a priori
probabilities si)move toward the corresponding a posteriori probabilities hi (Jacobs and
Jordan, 1991). Note that the a priori probabilities giare conditional on the input vector
486 12 I Modular Networks
x , whereas the a posteriori probabilities hi are conditional on both the input vector x and
the desired response vector d.
From the signal-flow graph of Fig. 12.4b, we see that the weighted sum ui of the ith
output neuron of the gating network is equal to the inner product of the pertinent synaptic
weight vector ai and the input vector x , as shown by
ui = xTai (12.31)
where the vector ai is made up of the synaptic weights ail,ai2,. . . , aiPof neuron i in the
gating network. Hence the partial derivative of the weighted sum ui with respect to the
weight vector ai is given by the p-by-1 vector
(12.32)
The sensitivity vector of the log-likelihood function 1 with respect to the synaptic
weight vector ai is defined by the partial derivative dlldai. Using the chain rule, we may
express this sensitivity vector as
(12.33)
Hence, substituting Eqs. (12.30) and (12.32) in (12.33), we get

dl
-
aai
= (hi - &)X (12.34)
Correspondingly, the adjustment applied to the synaptic weight vector ai is defined by

dl
Aai = 7-
da,
= V(hi - gi)X (12.35)
where we have used the same learning-rate parameter for adapting the expert networks.
We may, if we so wish, use a different learning-rate parameter here.
Let ai(n)be the value of the synaptic weight vector ai of the ith output neuron in the
gating network at iteration rz of the learning algorithm. The value of this weight vector
+
at iteration n 1 is updated by using the recursion
ai(n + 1) = ai(n) + v(hi(n)- gi(n))x (12.36)
This is the desired formula for adapting the gating network.
Note that there is no back-propagation of error terms in the formulas of Eqs. (12.28)
and (12.36), which simplifies the recursive computations of the various synaptic weights
in the modular network of Fig. 12.2. The reason there is no back-propagation here is
because we chose to make the expert and gating networks single-layered. This is not part
of the model; rather, it is a simplification in network design that we have made in order
to make an important point.
Summary of the Learning Algorithm for Fig. 12.2

We may now summarize the procedure for adapting the different expert networks and
the gating network of the modular structure of Fig. 12.2.
1. Initialization. Assign initial values to the synaptic weights of the different expert
networks and those of the gating network by using small values that axe uniformly
distributed.
12.5 / Hierarchical Structure of Adaptive Expert Networks 487
2. Adapting the Expert and Gating Networks. Present the network a task example
represented by the input vector x and desired response vector d. Hence, compute
for iteration n = 0, 1, 2 , . . . , output i = 1, 2,. . . , K, and neuron m = 1, 2,
. . . , q:
ui(n) = xTai(n)
e d u i (n))
gi(n) =
C5,exp(uj(n))
y!m)(n)= xTwy)(n)
y,(n) = [yi')(n),yi2)(n),. . . ,y?)(n)lT
I O = &d
e,m) - Y!%)
w!m)(n+ 1) = wy)(n) + qhi(n)ey)(n)x

a,(n + 1) = ai(n) + q[hi(n)- gi(n)]x
3. Repeat step 2 for all the available training examples.
4. Iterate the computations in steps 2 and 3 until the networks reach a steady state.
12.5 Hierarchical Structure of Adaptive Expert Networks

In perceptual studies, it appears that human subjects robustly recognize objects first at
categorical levels and then at successively subordinate levels; this notion suggests the
presence of structured memories that are organized and searched in a hierarchical manner
during recognition (Van Essen et al., 1992; Ambros-Ingerson et al., 1990). In other words,
hierarchy is another important principle in the architecture of vertebrate nervous systems.
The motivation for building hierarchy into the architecture of a modular network is to
have the expert networks take on a general structure of their own. The most logical
procedure for doing so is to continue with the principle of divide and conquer by structuring
the expert networks themselves in modular form in a manner similar to that described in
Section 12.3 (Jordan and Jacobs, 1992). This approach is illustrated in Fig. 12.5, involving
two hierarchical levels of network architecture. The network consists of K clusters of
adaptive expert networks, with each cluster containing L adaptive expert networks. The
outputs of the isolated clusters are mediated by a top-level gating network as before.
Likewise, the outputs of the isolated expert networks in each cluster are mediated by a
gating network incorporated into the cluster, as illustrated in Fig. 12.5. We may readily
generalize the divide-and-conquer policy used in Fig. 12.5 to construct a treelike architec-
ture with an arbitrary number of hierarchical levels.
To simplify the discussion, however, we confine ourselves in this section to adaptive
modular networks involving two hierarchical levels, as illustrated in Fig. 12.5. Neverthe-
less, the learning algorithm developed here may be readily extended to cover a tree of
arbitrary depth.
As in Section 12.4, we assume that all the gating networks inside and outside the
clusters in Fig. 12.5 consist of a single layer of neurons, and whose activation functions
are modeled to be softmax. In particular, the activation gi of the ith output neuron of the
top-level gating network is defined by
exP(ui)
gi = i = 1 , 2 , ...,K (12.37)
exp(uj) '
I
_ _ _1_ _ _
- - - - - - _ - - _ _ _ _ _Cluster
I
_ - _ _ - _ _ I
Cluster K
I_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ I
FIGURE 12.5 Two-level hierarchical network, combining modularity and hierarchy.
where uiis the weighted sum of the inputs applied to that neuron; basically, this equation
is a reproduction of Eq. (12.16) for the case of the single level of hierarchy shown in
Fig. 12.2. Similarly, the activation of the jth output neuron in the ith cluster of the
hierarchical network shown in Fig. 12.5 is defined by
(12.38)
where uili is the weighted sum of the inputs applied to this particular neuron in the ith
cluster.
12.5 I Hierarchical Structure of Adaptive Expert Networks 489
The expert networks in each cluster of Fig. 12.5 are also assumed to consist of a single
layer of linear neurons. Let yji denote the output vector of thejth expert network in the
ith cluster. We may then express the output vector yi of the ith cluster of output networks
as
L
yi = C, gjliyji, i = 1,2, . . . ,K (12.39)

j= 1
where gjli is as defined in Eq. (12.38). Correspondingly, the output vector y of the whole
network in Fig. 12.5 is defined by
K
Y= giyi (12.40)
i= I
where gi is as defined in Eq. (12.37).

For a probabilistic interpretation of the problem being considered in this section,
presumed to be more complex than that considered in Section 12.3, we assume that the
training patterns {x,d} are generated by a set of nested regressive processes (Jordan and
Jacobs, 1992). To be precise, the rules (i.e., expert networks) within the same cluster are
assumed to have commonalities in their input-output parameterizations. At each time,
we assume that the following holds:
1. An input vector x is selected at random from some prior distribution.
2. A cluster is chosen from the distribution P(i[x); this is the probability of the ith
cluster given the input vector x.
3. A rule (i.e., expert network) is chosen from the distribution P(jlx,i); this is the
probability of the jth rule given the input vector x and the ith cluster.
4. A desired response vector d is generated according to the regressive relation:
d = F,,(x) + E,
i i = 1,2, . . . ,K
j = 1,2, . . . ,L
where Fji(.) is a vector-valued nonlinear function of its argument vector, and
(12.41)
E is
a zero-mean, Gaussian-distributed random vector.
The goal is to model the probability distribution of the set of training examples {x,d}.
For the objective function appropriate to the learning problem at hand, we use a log-
likelihood function defined as the expanded form of an associative Gaussian mixture
model. Specifically, we write (except for a constant)
(12.42)
where d is the desired response vector associated with the input vector x applied simultane-
ously to all the expert networks and gating networks in Fig. 12.5; the outputs yjiand the
activations gi and gjliare as defined before. Given the training data {x,d}, we wish to
maximize the log-likelihood function E with respect to the unknown quantities yji,gi, and
giv. These quantities, pertaining to the structure of Fig. 12.5, may be given the following
probabilistic interpretations (Jordan and Jacobs, 1992):
m The output vectors yji of the individual expert networks are the unknown conditional
mean vectors of multivariate Gaussian distributions.
The activation gi of the top-level gating network and the activations gjliof the gating
networks inside the isolated clusters are the unknown conditional apriori probabilities
that the ith cluster and the ijth expert networks generated the current training pattern
kdl.
All of these unknown probabilistic quantities are conditional on the input vector x.
To assist in the formulation of the learning algorithm, we introduce two probabilistic
definitions. First, we define the conditional a posteriori probability that the ith cluster of
expert networks generates a particular desired response vector d as
Second, we define the conditional a posteriori probability that the jth expert network in
the ith cluster generates a particular desired response vector d as
The a posteriori probabilities hi and hjIiare both conditional on the input vector x and
the desired response vector d.
The output vectors yji and the activations g, and gjlidepend on the synaptic weights of
the neural networks that constitute the respective expert networks and gating networks.
The log-likelihood function 1 defined in Eq. (12.42) may therefore be viewed as a log-
likelihood function with these synaptic weights as the unknown parameters. Hence, max-
imizing the log-likelihood function 1 yields the maximum likelihood estimates of these
parameters. The maximization of 1may be performed in an iterative fashion, using gradient
ascent for the computation of small adjustments applied simultaneously to all the synaptic
weights in the network.
We may compute these synaptic modifications in Fig. 12.5 by proceeding in three
stages, as follows (Jordan and Jacobs, 1992):
1. Adapting the Top-Level Gating Network. Here we differentiate the log-likelihood
function 1 of Eq. (12.42) with respect to the weighted sum ui of the inputs applied
to the ith output neuron of the top-level gating network, obtaining the scalar partial
derivative
(12.45)
Hence, during the training process the a priori probability gi tries to move toward
the corresponding a posteriori probability hi.
2. Adapting the Gating Networks in the Clusters. In this case, we differentiate the log-
likelihood function 1 of Eq. (12.42) with respect to the weighted sum ujli of the
inputs applied to the jth output neuron of the gating network in the ith cluster,
obtaining the scalar partial derivative
(i = 1 , 2
(12.46)
Consequently, during training the a priori probability gjlitries to move toward the
corresponding a posteriori probability hjli.
3. Adapting the Expert Networks. Next, we differentiate the log-likelihood function 1
of Eq. (12.42) with respect to the output vector yji of the jth expert network in the
12.6 I Piecewise Control Using Modular Networks 491
ith cluster, obtaining the q-by-1 partial derivative
(12.47)
Thus, during training the synaptic weights of the jth expert network in the ith cluster
are updated by an amount proportional to the a posteriori probability that this
particular expert network generated the training pattern in current use.
The partial derivative al/au,,, of Eq. (12.46) and the partial derivative allay,, of Eq.
(12.47) share a common factor, namely, the a posteriori probability hi. This means that
the expert networks within a cluster are tied to each other. Consequently, the expert
networks within a cluster tend to learn similar mappings early in the training process.
However, when the probabilities associated with a cluster to which the expert networks
belong assume larger values later in the training process, they start to specialize in what
they learn. Thus the hierarchical network of Fig. 12.5 tends to evolve in a coarse-to-$ne
structural fashion. This property is important, because it implies that a deep hierarchical
network is naturally robust with respect to the overfitting problem (Jordan and Jacobs,
1992).
The final step in the development of the stochastic-gradientlearning algorithm for the
network of Fig. 12.5 involves the determination of the sensitivity factors aElaai, al/acjli,
and al/aw)r). The vector ai denotes the synaptic weight vector of the ith output neuron
of the top-level gating network, cjlidenotes the synaptic weight vector of the jth output
neuron of the gating network in the ith cluster, and wp) denotes the synaptic weight vector
of the mth output neuron of the jth expert network in the ith cluster; the index m = 1,
2, . . . , q, where q is the total number of output neurons in each expert network. To find
the formulas for these sensitivity factors, we use chain rules to express them as the
products of certain functional derivatives, as shown here:
_ - ai
dl _ _aui
- i = 1,2, . . . ,K (12.48)
aai aui aai’
i = 1,2, . . . , K
d l - d l auj1i
----
acj,i auj,i acj,j ’ { j = 1,2, . . . ,L
(12.49)
( i=1,2, ...,K
(12.50)
The derivations of the functional derivatives in Eqs. (12.48) through (12.50) and their
use in the determination of these three sensitivity factors, and therefore the development
of a learning algorithm for the network of Fig. 12.5, follow a procedure similar to that
described in Section 12.4; this derivation is presented as an exercise to the reader as
Problem 12.7.
12.6 Piecewise Control Using Modular Networks

Jacobs and Jordan (1993) describe the use of a modular network that learns to perform
nonlinear control tasks using a piecewise control strategy called gain scheduling. This
Mass Of link 9 Angle of joint 2
Mass
FIGURE 12.6 Two-joint planar arm. (From R.A. Jacobs and M.I. Jordan, 1993, with
permission of IEEE.)
control strategy is applicable to situations where it is known how the dynamics of a plant
change with its operating points.
An important advantage of a modular network over a multilayer perceptron as feedfor-
ward controller for the kind of plant described here is that it is relatively robust with
respect to temporal crosstaEk (Jacobs and Jordan, 1993; Jacobs et al., 1991b). To explain
what we mean by this phenomenon, suppose that a multilayer perceptron is trained on a
particular task using the back-propagation algorithm, and then it is switched to another
task that is incompatible with the first one. Ideally, we would like to have the network
learn the second task without its performance being unnecessarily impaired with respect
to the first task. However, according to Sutton (1986), back-propagation learning has the
opposite effect in the sense that it tends to preferentially modify the synaptic weights of
hidden neurons that have already developed useful properties. Consequently, after learning
the second task, the multilayer perceptron is no longer able to perform the first task. We
may of course alternate the training process between tasks, and carry on in this way until
the multilayer perceptron eventually learns both tasks. However, the price that we have
to pay for it is a prolonged learning process. The generalization performance of the
multilayer perceptron may also be affected by the blocked presentation of incompatible
training data.
In contrast, when a modular network is trained on a number of incompatible tasks, it
has the ability to partition the parameter space of the plant into a number of regions and
assign different expert networks to learn a control law for each region (Jacobs and Jordan,
1993). The end result is that the network is relatively immune to temporal crosstalk
between tasks.
Jacobs and Jordan (1993) have compared the performance of a single network and a
modular network as feedforward controllers for a robot arm in the form of the two-joint
planar manipulator shown in Fig. 12.6. These two networks were specified as follows5:
A fully connected multilayer perceptron with 19 input nodes (corresponding to the
joint variables and payload identity), 10 hidden neurons, and 2 output neurons
(corresponding to the feedforward torques).
~
In the study reported by Jacobs and Jordan (1993), two other modifications of the modular network were
also considered, which are referred to as modular architectures with a share network. During training the share
network learns a strategy that is useful to all tasks. One of these two networks was constrained to discover a
particular type of decomposition. The performance of these two configurations was almost the same, and superior
to that of the conventional modular network.
12.7 / Summary and Discussion 493
0.3
W r n
SN-single network
MA-modular architecture
0.2
Joint
root
mean-square
error
(radians)
0.1
0.0 I I I I I
0 2 4 6 8 10
Number of epochs
FIGURE 12.7 Learning curves for a single network and a modular network, both of
which are trained on the two-joint planar arm. (From R.A. Jacobs and M.I. Jordan, 1993,
with permission of IEEE.)
w A modular network, consisting of 6 expert networks and 1 gating network; each

expert network receives the joint variables and has two output neurons, and the
gating network receives the payload identity and has 2 output neurons.
Figure 12.7 shows the learning curves for these two systems. The horizontal axis gives
the number of epochs used for training. The vertical axis gives the joint root-mean-square
error (RMSE) in radians, averaged over 25 runs. The curves show that the modular
network achieves significantly better performance than the multilayer perceptron after 7
epochs.
12.7 Summary and Discussion

In this chapter we have introduced modularity and hierarchy as two important principles
in the design of neural networks. In particular, we have described a maximum-likelihood
procedure for building a hierarchy of adaptive expert networks through the repeated use
of modularity to implement the principle of divide and conquer. The end result is a
hierarchical neural network made up of isolated clusters of expert networks whose individ-
ual outputs are mediated by corresponding gating networks and finally the outputs of the
clusters themselves are mediated by a top-level gating network, as illustrated in Fig. 12.5
for the case of two levels of hierarchy. We may, of course, go one step further than that
shown in Fig. 12.5 and build clusters of expert networks inside larger clusters, and thereby
expand the level of hierarchy. The ultimate objective is to build a hierarchical structure
in which all the expert networks and the gating networks consist of a single layer of
neurons, thereby avoiding the need for the back-propagation of error terms. For the expert
networks, the neurons are linear, assuming that the problem at hand is a regressive problem;
on the other hand, activation functions in the form of softmax are used for the neurons
in the gating networks.
The use of a modular hierarchical approach as described here offers many attractive
features (Houk, 1992; Jordan, 1992; Jacobs et al., 1991b):
Biological plausibility
m A treelike structure that is a universal approximator
m A level of granularity that is intermediate between those of the local and global
methods of approximation
m No back-propagation of error terms in the recursive computations of the synaptic
weights characterizing the expert networks and gating networks
An insightful probabilistic interpretation for the various unknown quantities of the
network
A viable solution to the credit-assignmentproblem through the computation of a set
of a posteriori probabilities
The probabilistic interpretation referred to here is a direct consequence of the way in
which the objective function for deriving the learning algorithm is formulated. Specifically,
the objective function is defined as a log-likelihood function of the synaptic weights
of the expert networks and those of the gating networks. Accordingly, maximization of
the objective function results in maximuin-likelihood estimates of these unknown free
parameters of the network. Maximum-likelihood parameter estimation is a well-developed
branch of statistics, which means that we have a wealth of knowledge at our disposal for
the computation and interpretation of these estimates. In particular, if the estimates are
known to be unbiased, then the maximum-likelihood estimates of the synaptic weights
of the various network components satisfy an inequality referred to as the Cram&-Rao
lower bound (Van Trees, 1968). If this bound is satisfied with an equality, the estimates
are said to be effzcient.At this point a logical question to ask is: Do parameter estimation
procedures better than the maximum-likelihood procedure exist? Certainly, if an efficient
procedure does not exist, then there may be unbiased estimates with lower variances than
the maximum-likelihood estimates. The difficulty, however, is that there is no general
rule for finding them. It is for this reason that the maximum-likelihood procedure enjoys
a great deal of popularity as a tool for parameter estimation.
A practical limitation of the maximum-likelihood procedure is that it can be computa-
tionally intensive. We may overcome this limitation by following a gradient ascent proce-
dure, as described in Sections 12.4 and 12.5. However, by doing so there is no guarantee
that we reach the globally maximum value of the objective function (i.e., log-likelihood
function), as there is the real possibility of being trapped in a local minimum. To guard
against such a possibility, we may add small amounts of noise to the sensitivity factors
so as to help the stochastic-gradient algorithm escape local minima (Jelinek and Reilly,
1990).
Alternatively, we may use the expectation-maximization (EM) algorithm (Dempster et
al., 1977), which is well suited for mixture models. The EM algorithm is a general
approach to iterative computation of maximum-likelihood estimates when the observed
data can be viewed as incomplete data. The term “incomplete data” has two implications:
1. The existence of two sample spaces X and Y represented by the observed data
vector x and the complete data vector y, respectively
2. The one-to-many mapping y + x(y) from space Y to space X
Problems 495
The complete data vector y is not observed directly, but only through the vector x. Each
iteration of the EM algorithm is composed of two steps: an expectation (E) step and a
maximization (M) step-hence the name of the algorithm. Suppose that 6(n) denotes the
current value of a parameter vector after n cycles of the algorithm. Let I, denote the log-
likelihood of the complete data vector y. The next cycle of the EM algorithm may then
be described as follows:
E STEP. Compute the expected value of the complete-data log-likelihood I,, given the
observed data vector x and the current model represented by the parameter vector &n)
by finding the deterministic quantity
where E is the expectation operator and 8 is an unknown parameter vector.
M STEP. Determine the updated value 6(n + 1) of the parameter vector as the solution
of the equations
6(n + 1) = argmax Q[8,6(n)] (12.52)
Jordan and Jacobs (1993) describe the application of the EM algorithm to the design
of hierarchical mixtures of experts for solving nonlinear regression problems: They
compare the performance of a modular network so trained with that of a multilayer
perceptron trained with the back-propagation algorithm for a four-joint robot arm moving
in three-dimensional space. Computer simulations were carried out for both batch and
on-line modes of operations. In both cases, learning based on the EM algorithm outper-
formed back-propagation learning.
The material presented in this chapter represents one particular viewpoint of how
to design modular networks for solving nonlinear regression problems. The notions of
modularity described here represent an important step in the development of a powerful
approach to the design of neural networks. There is much that remains to be explored on
this highly important subject.
PROBLEMS
12.1 Figure P12.1 shows a modular network with a treelike structure. Reduce this
network to its most basic form involving the minimum number of components.
12.2 Consider the modular network of Fig. 12.2. Derive the expressions for the partial
derivatives allayi and dlldu, given in Eqs. (12.20) and (12.30), respectively.
12.3 A modular network is trained with a particular data set, and the same module is
found to win the competition for all the training data. What is the implication of
such a phenomenon?
12.4 Consider the modular network of Fig. 12.2, in which each expert network consists
of a feedforward network with two layers of computation nodes. The modular
network is designed to perform nonlinear regression.
Another useful application of the EM algorithm is described in Xu and Jordan (1993a). In this paper, the
EM algorithm is used to combine multiple classifiers for a particular pattern recognition task. In a related paper,
Xu and Jordan (1993b) propose the use of the EM algorithm and the model of a finite mixture of Gaussian
distributions for unsupervised learning.
FIGURE P12.1
(a) Describe the types of neurons that are used in the two layers of each expert
network.
(b) How would you design such a network in light of the theory presented in
Sections 12.3 and 12.4?
12.5 Consider a piecewise-linear task described by
For comparison, the following network configurations are used:
1. Multilayer perceptron: “10 + 10 + 1” network

2. Single-level hierarchical network, Gating networks: 10 += 2;
as in Fig. 12.2: Expert networks: 10 -+ 1
3. Two-level hierarchical network, Two levels:
as in Fig. 12.5: Gating networks: 10 2;
Expert networks: 10 + 1
Compare the computational complexities of these three networks.

12.6 Construct the block diagram of a hierarchical network with three levels of hierarchy.
Level 3 has four clusters of expert networks, and level 2 has two clusters.
12.7 Consider the hierarchical structure of Fig. 12.5, involving two levels of modularity.
(a) Derive the expressions for the partial derivatives dZlaui, aZlaujli, and allayji,
given in Eqs. (12.45), (12.46), and (12.47), respectively.
(b) Hence, develop a stochastic gradient algorithm for the recursive computation
of the synaptic weights of the expert networks and the gating networks shown
in Fig. 12.5. Assume that all of these networks consist of single layers of
output neurons.
Problems 497
12.8 The learning algorithm derived in Section 12.4 for the modular network of Fig.
12.2 was formulated for the on-line mode of operation. Describe the changes that
would have to be made to this algorithm for the batch mode of operation.
12.9 For unbiased estimates and therefore improved performance of the modular network
in Fig. 12.2, it may be necessary to adapt the variances of the multivariate Gaussian
distribution. To do so, we redefine the covariance matrix Aiof the ith expert
network’s output vector yi in the modular network of Fig. 12.2 as follows:
Ai= a;I
where UT is the variance to be adapted, and I is the identity matrix. Accordingly,
we reformulate the objective function of Eq. (12.13) as
K
(a) Derive the partial derivative dl/da?.

(b) Formulate a procedure for the adjustment of u;.
12.10 In Section 12.6 we discussed how temporal crosstalk can arise in a multilayer
perceptron trained with the back-propagation algorithm. The spatial counterpart of
this phenomenon, spatial crosstalk, can also arise in such a neural network. Describe
an example of this latter phenomenon (Jacobs et al., 1991b).

Chapter 12

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chapter 12

Transféré par

Droits d'auteur :

Formats disponibles

Modular Networks

FIGURE 12.1 A discontinuous (piecewise-linear) function and its approximation.

one-dimensional function g(x) with a discontinuity, as described by

that modularity as a computational technique is the key to understanding complex tasks

Organization of the Chapter

12.2 Basic Notions of Modularity

A neural network is said to be modular if the computation pedormed by the network

Modularity may therefore be viewed as a manifestation of the principle of divide and

The Credit-Assignment Problem

Advantages of Modular Networks

12.3 Associative Gaussian Mixture Model

3. A desired response vector d is generated by the selected rule i according to the

12.4 Stochastic-Gradient Learning Algorithm

1. Modifications of the synaptic weights in the different expert networks

Adapting the Expert Networks

distributions4rather than a multivariate Gaussian distribution is generally used to

Adapting the Gating Network

Hence, substituting Eqs. (12.30) and (12.32) in (12.33), we get

Correspondingly, the adjustment applied to the synaptic weight vector ai is defined by

Summary of the Learning Algorithm for Fig. 12.2

w!m)(n+ 1) = wy)(n) + qhi(n)ey)(n)x

12.5 Hierarchical Structure of Adaptive Expert Networks

FIGURE 12.5 Two-level hierarchical network, combining modularity and hierarchy.

yi = C, gjliyji, i = 1,2, . . . ,K (12.39)

where gi is as defined in Eq. (12.37).

ith cluster, obtaining the q-by-1 partial derivative

12.6 Piecewise Control Using Modular Networks

Mass Of link 9 Angle of joint 2

w A modular network, consisting of 6 expert networks and 1 gating network; each

12.7 Summary and Discussion

where E is the expectation operator and 8 is an unknown parameter vector.

For comparison, the following network configurations are used:

1. Multilayer perceptron: “10 + 10 + 1” network

Compare the computational complexities of these three networks.

(a) Derive the partial derivative dl/da?.

Vous aimerez peut-être aussi