Vous êtes sur la page 1sur 15

L 4 L l

The Perceptron

4.1 Introduction
The perceptron is the simplest form of a neural network used for the classification of a
special type of patterns said to be linearly separable (i.e., patterns that lie on opposite
sides of a hyperplane). Basically, it consists of a single neuron with adjustable synaptic
weights and threshold, as shown in Fig. 4.1. The algorithm used to adjust the free
parameters of this neural network first appeared in a learning procedure developed by
Rosenblatt (1958, 1962) for his perceptron brain model.’ Indeed, Rosenblatt proved that
if the patterns (vectors) used to train the perceptron are drawn from two linearly separable
classes, then the perceptron algorithm converges and positions the decision surface in the
form of a hyperplane between the two classes. The proof of convergence of the algorithm
is known as the perceptron convergence theorem.
The single-layer perceptron depicted in Fig. 4.1 has a single neuron. Such a perceptron
is limited to performing pattern classification with only two classes. By expanding the
output (computation) layer of the perceptron to include more than one neuron, we may
correspondingly form classification with more than two classes. However, the classes
would have to be linearly separable for the perceptron to work properly. Accordingly,
insofar as the basic theory of a single-layer perceptron as a pattern classifier is concerned,
we need consider only the single-neuron configuration shown in Fig. 4.1. The extension
of the theory so developed to the case of more than one neuron is a trivial matter.

Organization of the Chapter


This chapter on the single-layer perceptron is relatively short, but it is important for
historical and theoretical reasons. In Section 4.2 we present some basic considerations
involved in the operation of the perceptron. Then, in Section 4.3 we describe Rosenblatt’s
original algorithm for adjusting the synaptic weight vector of the perceptron for pattern
classification of linearly separable classes, and demonstrate convergence of the algorithm.
This is followed by the description of a performance measure for the single-layerperceptron
in Section 4.4. In Section 4.5 we consider the relationship between the single-layer
perceptron and the maximum-likelihood Gaussian classifier. The chapter concludes with
general discussion in Section 4.6.

The network organization of the original version of the perceptron, as envisioned by Rosenblatt (1962),
has three types of units: sensory units, association units, and response units. The connections from the sensory
units to the association units have fixed weights, and the connections from the association units to the response
units have variable weights. The association units act as preprocessors designed to extract a pattern from the
environmental input. Insofar as the variable weights are concerned, the operation of Rosenblatt’s original
perceptron is essentially the same as that depicted in Fig. 4.1 for the case of a single response unit.

106
4.2 / Basic Considerations 107

Inputs

FIGURE 4.1 Single-layer perceptron.

4.2 Basic Considerations


Basic to the operation of Rosenblatt's perceptron is the McCulloch-Pitts model of a
neuron. From Chapter 1, we recall that such a neuron model consists of a linear combiner
followed by a hard limiter, as depicted in Fig. 4.2. The summing node of the neuron
model computes a linear combination of the inputs applied to its synapses, and also
accounts for an externally applied threshold. The resulting sum is applied to a hard limiter.
Accordingly, the neuron produces an output equal to +1 if the hard limiter input is
positive, and -1 if it is negative.
In the signal-flow graph model of Fig. 4.2, the synaptic weights of the single-layer
perceptron are denoted by wlr w2,. . . , w,. (We have simplified the notation here by not
including an additional subscript to identify the neuron, since we only have to deal with
a single neuron.) Correspondingly, the inputs applied to the perceptron are denoted by
xlrxz,. . . , xp.The externally applied threshold is denoted by 8. From the model of Fig.
4.2 we thus find that the linear combiner output (i.e., hard limiter input) is

u = C wiXi- e (4.1)
i=l

The purpose of the perceptron is to classify the set of externally applied stimuli xl,xz,
. . . , xp into one of two classes, Cel or %z, say. The decision rule for the classification is
to assign the point represented by the inputs xl,xz, . . . , xp to class (e if the perceptron
output y is + 1 and to class CeZ if it is - 1.
To develop insight into the behavior of a pattern classifier, it is customary to plot a
map of the decision regions in the p-dimensional signal space spanned by the p input
variables xl, xz,. . . , xp. In the case of an elementary perceptron, there are two decision
regions separated by a hyperplane defined by

i:wixi - o = o
i=l
(4.2)

1.
X1

x2 4
\ w.

Inputs

/' Threkld
P

FIGURE 4.2 Signal-flow graph of the perceptron.


108 4 / The Perceptron

This is illustrated in Fig. 4.3 for the case of two input variables x, and xz, for which the
decision boundary takes the form of a straight line. A point (xl,xz)that lies above the
boundary line is assigned to class %1, and a point (xI,xz)that lies below the boundary
line is assigned to class (e2. Note also that the effect of the threshold 8 is merely to shift
the decision boundary away from the origin.
The synaptic weights wlr wz, . . . , w pof the perceptron can be fixed or adapted on an
iteration-by-iterationbasis. For the adaptation, we may use an error-correctionrule known
as the perceptron convergence algorithm, developed in the next section.

4.3 The Perceptron Convergence Theorem


For the development of the error-correction learning algorithm for a single-layer per-
ceptron, we find it more convenient to work with the modified signal-flow graph model
of Fig. 4.4. In this second model, which is equivalent to that of Fig. 4.2, the threshold
e(n) is treated as a synaptic weight connected to a fixed input equal to - 1. We may thus
define the ( p + 1)-by-1 input vector

Correspondingly, we define the ( p + 1)-by-1 weight vector


w(n) = w&), wh),. . . , wp(n)lr (4.4)
Accordingly, the linear combiner output is written in the compact form
u(n) = wT(n)x(n) (4.5)
For fixed n, the equation wTx = 0, plotted in a p-dimensional space with coordinates xl,
x2, . . . , x,, defines a hyperplane as the decision surface between two different classes of
inputs.
Suppose then the input variables of the single-layer perceptron originate from two
linearly separable classes that fall on the opposite sides of some hyperplane. Let XI be
the subset of training vectors xl(l), x1(2), . . . that belong to class Gel, and let X 2 be the

x2 I

X1

Decision boundary
wlxl + w2x2 - 8 = 0

FIGURE 4.3 Illustrating linear reparability for a two-dimensional, two-class pattern-


classification problem.
4.3 I The Perceptron Convergence Theorem 109

Fixed x,,=-Io,
input

Inputs Y
W Hard
Y
limiter
. .xo Linear
combiner
FIGURE 4.4 Equivalent signal-flow graph of the perceptron.

subset of training vectors x2(1), x2(2), . . . that belong to class S2.The union of XI and
X , is the complete training set X. Given the sets of vectors XI and Xzto train the classifier,
the training process involves the adjustment of the weight vector w in such a way that
the two classes and %, are separable. These two classes are said to be linearly separable
if a realizable setting of the weight vector w exists. Conversely, if the two classes % I and
%, are known to be linearly separable, then there exists a weight vector w such that we
may state
wTx 2 0 for every input vector x belonging to class
and (4.6)
wrx < 0 for every input vector x belonging to class %,
Given the subsets of training vectors XI and X 2 , the training problem for the elementary
perceptron is then to find a weight vector w such that the two inequalities of Eq. (4.6)
are satisfied.
The algorithm for adapting the weight vector of the elementary perceptron may now
be formulated as follows:
1. If the nth member of the training vector, x(n), is correctly classified by the weight
vector w(n)computed at the nth iteration of the algorithm, no correction is made
to the weight vector of the perceptron, as shown by
w(n + 1) = w(n) if w'(n>x(n) 2 0 and x(n) belongs to class
and (4.7)
w(n + 1) = w(n) if w'(n)x(n) < 0 and x(n)belongs to class %,
2. Otherwise, the weight vector of the perceptron is updated in accordance with the
rule
w(n + 1) = w(n) - ~(n)x(n) if wT(n)x(n)2 0 and x(n) belongs to class sz
and (4.8)
w(n + 1) = w(n) + 7(n)x(n) if w'(n)x(n) < 0 and x(n) belongs to class
where the learning-rate parameter ~ ( ncontrols
) the adjustment applied to the weight
vector at iteration n.
If q(n) = 7 > 0, where 7 is a constant independent of the iteration number n, we have
a f i e d increment adaptation rule for the perceptron.
In the sequel, we first prove the convergence of a fixed increment adaptation rule for
which 7 = 1. Clearly, the value of 77 is unimportant, so long as it is positive. A value
110 4 / The Perceptron

of q f: 1 merely scales the pattern vectors without affecting their separability. The case
of a variable q(n) is considered later in the section.
The proof is presented for the initial condition w(0) = 0. Suppose that wT(n)x(n)< 0
for n = 1, 2, . . . and an input vector x(n) that belongs to the subset X , . That is, the
perceptron incorrectly classifies the vectors x(l), x(2), . . . , since the second condition of
Eq. (4.6) is violated. Then, with the constant ~ ( n =) 1, we may use the second line of
Eq. (4.8) to write
w(n + 1) = w(n) + x(n) for x(n) belonging to class (4.9)
Given the initial condition w(0) = 0, we may iteratively solve this equation for w(n + l),
obtaining the result
w(n + 1) = x(1) + x(2) + - - + x(n)
* (4.10)
Since the classes (el and %2 are assumed to be linearly separable, there exists a solution
wo for which wTx(n) > 0 for the vectors x( I), . . . , x(n) belonging to the subset X,. For
a fixed solution wo, we may then define a positive number a by the relation
(4.11)

where x(n) E Xl stands for “x(n) belongs to subset XI.’’ Hence, multiplying both sides
of Eq. (4.10) by the row vector wi, we get
wiw(n + 1) = WtX(1) + WtX(2) + * . . + wtx(n> (4.12)
Accordingly, in light of the definition given in Eq. (4.1l), we have
wiw(n + 1) 2 na (4.13)
Next, we make use of an inequality known as the Cauchy-Schwarz inequality. Given
two vectors wo and w(n + l), the Cauchy-Schwarz inequality states that
llwol1211w(n+ 1)Ip 2 [wiw(n + 1)12 (4.14)
where /(*I1denotes the Euclidean norm of the enclosed argument vector, and the inner
product w;w(n + +
1) is a scalar quantity. We now note from Fiq. (4.13) that [wiw(n 1)12 is
equal to or greater than n2a2.Moreover, from Eq. (4.14) we note that Ilwol1211w(n 1)112 +
is equal to or greater than [w;w(n + 1)12. It follows therefore that
Ilwo(1211w(n+ 1)y 2 n2a2

or, equivalently,

(4.15)

Next, we follow another development route. In particular, we rewrite Eq. (4.9) in the
form
w(k + 1) = w(k) + x(k) fork = 1, . . . , n and x(k) E XI (4.16)
Hence, taking the squared Euclidean norm of both sides of Eq. (4.16), we get
Ilw<k + 1)112 = lw(k>112 + llX(k)1l2 + 2wT(k)x(k) (4.17)
But, under the assumption that the perceptron incorrectly classifies an input vector x(k)
belonging to the subset X,, we have wT(k)x(k)< 0. We therefore deduce from Eq. (4.17)
that
Ilw@ + 1)IP 5 llW(W + Ilx(k)l12
4.3 / The Perceptron Convergence Theorem 111

or, equivalently,
Ilw<k + - l l ~ ( k ) (5( ~llx(k)112, k = 1, . . . , (4.18)
Adding these inequalities for k = 1, . . . , n, and assuming that the initial condition
w(0) = 0, we get the following condition:

c llx(k>ll"
n

lIw(n + 1)1125
k=l

5 np (4.19)
where P is a positive number defined by
P= Iw
l 112 (4.20)

Equation (4.19) states that the squared Euclidean norm of the weight vector w(n + 1)
grows at most linearly with the number of iterations n.
Clearly, the second result of Eq. (4.19) is in conflict with the earlier result of Eq. (4.15)
for sufficiently large values of n. Indeed, we can state that n cannot be larger than some
value n,, for which Eqs. (4.15) and (4.19) are both satisfied with the equality sign. That
is, n,, is the solution of the equation

Solving for nmx for a solution vector wo, we find that

(4.21)

We have thus proved that for q(n) = I for all n, and w(0) = 0, and given that a solution
vector wo exists, the rule for adapting the synaptic weights connecting the associator units
to the response unit of the perceptron must terminate after at most n,, iterations. Note
also from Eqs. (4.11), (4.20), and (4.21) that there is no unique solution for wo or nm,.
We may now state the Jixed-increment convergence theorem for a single-layer per-
ceptron as follows (Rosenblatt, 1962):

Let the subsets of training vectors X , and X , be linearly separable. Let the inputs
presented to the single-layer perceptron originate from these two subsets. The
perceptron converges after some no iterations, in the sense that
w(no) = w(no + 1) = w(no + 2) = . . .
is a solution vector for no 5 n,, .
Consider next the absolute error-correction procedure for the adaptation of a single-
layer perceptron, for which q(n) is variable. In particular, let q(n) be the smallest integer
for which
rp(W(n)x(n> > lwWx(n)l
With this procedure we find that if the inner product w'(n)x(n) at iteration n has an
incorrect sign, then wr(n + l)x(n) at iteration n + 1 would have the correct sign. This
suggests that if w'(n)x(n) has an incorrect sign, we may modify the training sequence at
iteration n + 1 by setting x(n + 1) = x(n). In other words, each pattern is presented
repeatedly to the perceptron until that pattern is classified correctly.
Note also that the use of an initial value w(0) different from the null condition merely
results in a decrease or increase in the number of iterations required to converge, depending
112 4 / The Perceptron

on how w(0) relates to the solution wo. Hence, regardless of the value assigned to w(O),
the single-layer perceptron of Fig. 4.1 is assured of convergence.

Summary
In Table 4.1 we present a summary of the perceptron convergence algorithm (Lippmann,
1987). The symbol sgn(-), used in step 3 of the table for computing the actual response
of the perceptron, stands for the signum finction:
+1 ifu>0
sgn(u) = (4.22)
-1 ifu<O

TABLE 4.1 Summary of the Perceptron Convergence Algorithm


Variables and Parameters
x(n) = ( p + 1)-by-1 input vector
= [- 1, xdn), xz(n), . .. 1 x,(n)l'
w(n) = ( p + 1)-by-1weight vector
= [B(n),w1(n), w&), . . . ,wp(n>lT
B(n) = threshold
y(n) = actual response (quantized)
d(n) = desired response
77 = learning-rate parameter, a positive constant less than unity
Step 7: Initialization
Set w(0) = 0. Then perform the following computations for time n = 1, 2, . . . .

Step 2: Activation
At time n, activate the perceptron by applying continuous-valued input vector x(n) and desired
response d(n).
Step 3: Computation of Actual Response
Compute the actual response of the perceptron:
y(n) = sgn[w'(n)x(n)l
where sgn(-) is the signum function.
Step 4: Adaptation of Weight Vector
Update the weight vector of the perceptron:
w(n + 1) = w(n) + rl[d(n)- y(n)lx(n)
where
+1 if x(n) belongs to class
d(n) =
{ -1 if x(n) belongs to class
Step 5
Increment time n by one unit, and go back to step 2.
4.4 I Performance Measure 113

In effect, the signum function is a shorthand notation for the asymmetric form of hard
limiting. Accordingly, the quantized response y(n) of the perceptron is written in the
compact form
An) = sgn(w'(n)x(n)) (4.23)
Note that the input vector x(n) is a ( p + 1)-by-1 vector whose first element is fixed at
- 1 throughout the computation in accordance with the definition of Eq. (4.3). Correspond-
ingly, the weight vector w(n) is a ( p + 1)-by-1 vector whose first element equals the
threshold q n ) , as defined in Eq. (4.4). In Table 4.1 we have also introduced a quantized
desired response d(n), defined by
+1 if x(n) belongs to class
d(n) = (4.24)
-- 1 if x(n) belongs to class %*
Thus, the adaptation of the weight vector w(n) is summed up nicely in the form of an
error-correction learning rule, as shown by
w(n + 1) = w(n) + q[d(n) - y(n)]x(n) (4.25)
where q is the Zeaming-rate parameter, and the difference d(n) - y(n) plays the role of
an error signal. The learning-rate parameter is a positive constant limited to the range
0 < q 5 1. In assigning a value to it inside this range, we have to keep in mind two
conflicting requirements (Lippmann, 1987):
Averaging of past inputs to provide stable weight estimates, which requires a
small q
Fast adaptation with respect to real changes in the underlying distributions of the
process responsible for the generation of the input vector x, which requires a
large 77

4.4 Performance Measure


In the previous section we derived the perceptron convergence algorithm without any
reference to a performance measure, yet in Chapter 2 we emphasized that improvement
in the performance of a learning machine takes place over time in accordance with a
performance measure. What then is a suitable performance measure for the single-layer
perceptron? An obvious choice would be the auerage probability of classijication error,
defined as the average probability of the perceptron making a decision in favor of a
particular class when the input vector is drawn from the other class. Unfortunately, such
a performance measure does not lend itself readily to the analytic derivation of a learning
algorithm. Shynk (1990) has proposed the use of a performance function that befits the
operation of a perceptron, as shown by the statistical expectation
J = -E[e(n)u(n)] (4.26)
where e(n) is the error signal defined as the difference between the desired response d(n)
and the actual response y(n) of the perceptron, and v(n) is the linear combiner output
(Le., hard-limiter input). Note that the error signal e(n) is not a quantized signal, but
rather the difference between two quantized signals. The instantaneous estimate of the
performance function is the time-varying function
S(n) = -e(n>v(n)
= - y(n>lv(n) (4.27)
114 4 I The Perceptron

The instantaneous gradient vector is defined as the derivative of the estimate &n) with
respect to the weight vector w(n), as shown by

(4.28)

Hence, recognizing that the desired response d(n) is independent of w(n) and the fact that
the actual response y(n) has a constant value equal to - 1 or + 1, we find that the use of
Eq. (4.27) in (4.28) yields2

(4.29)

For the defining equation (4.5), we readily find that3

(4.30)

Thus the use of Eq. (4.30) in (4.29) yields the result


vw.m = -[d(n) - y(n)lx(n) (4.31)
Hence, according to the error-correction rule described in Chapter 2 we may now express
the change applied to the weight vector as
A w ( ~ )= - qV,.T(n)
= q[d(n)- y(n>lx(n) (4.32)

To be precise, we should also include a Dirac delta function (Le., impulse function) whenever the perceptron
output y(n) goes through a transition at u(n) = 0. Averaged over a long sequence of iterations, however, such
a transition has a negligible effect.
Consider a scalar quantity u defined as the inner product of two real-valued p-by-1 vectors w and x, as
shown by
P
v = WTX = 2 wix,
i=l

where wi and xi are the ith elements of the vectors w and x, respectively. The partial derivative of the quantity
v with respect to the vector w is defined by (Haykin, 1991)

I::[
From the given definition of u , we note that

We therefore have
av
-=
aw

av- - x,,
-
awi
auiaw,

i = 1 , 2 , . . . ,p
4.5 / Maximum-LikelihoodGaussian Classifier 115

where r ) is the learning-rate parameter. This is precisely the correction made to the weight
vector as we progress from iteration n to n -I- 1, as described in Eq. (4.25). We have thus
demonstrated that the instantaneous performance function j ( n ) defined in Eq. (4.27) is
indeed the correct performance measure for a single-layer perceptron (Shynk, 1990).

4.5 Maximum-Likelihood Gaussian Classifier


A single-layer perceptron bears a close relationship to a classical pattern classifier known
as the maximum-likelihood Gaussian ClassiJier,in that they are both examples of linear
classiJiers (Duda and Hart,1973).
The maximum-likelihood method is a classical parameter-estimation procedure that
views the parameters as quantities whose values are@ed but unknown. The best estimate
is defined to be the one that maximizes the probability of obtaining the samples actually
observed. To be specific, suppose that we separate a set of samples according to class,
so that we have M subsets of samples denoted by XI, X,, . . . , X,. Let the samples in
subset Xjbe drawn independently according to a probability law of known parametric
form. Let this probability law be described by the conditional probability densityfunction
f(xlwj),where x is an observation vector and wjrepresents an unknown parameter vector
associated with class T j ,j = 1, 2, . . . ,M . We further assume that the samples in X , give
no information about wj if i # j , that is, the parameters for the different classes are
functionally independent. Now, viewed as a function of w,f(x)w) is called the likelihood
of w with respect to the observation vector x. The maximum-likelihood estimate of w is,
by definition, that value ii. which maximizes the likelihood f(x)w).
To proceed with the issue at hand, consider a p-by-1 observation vector x defined by
x= [XI, x2, .. . ,Xp]T (4.33)
Note that the vector x as defined here is confined exclusively to the externally applied
inputs. The observation vector x is described in terms of a mean vector p and covariance
matrix C,which are defined by, respectively,
p = E[xl (4.34)
and
c= - p)(x - PYI (4.35)
where E is the expectation operator. Assuming that the vector x is Gaussian-distributed,
we may express the joint-probability density function of the elements of x as follows
(Van Trees, 1968):

(4.36)

where det C is the determinant of the covariance matrix C,the matrix C-'is the inverse
of C,and p is the dimension of the vector x.
To be more specific, suppose that we have a two-class problem, with the vector x
characterized as follows, depending on whether it belongs to class or class %,:
x EX,: meanvector = p1
covariance matrix = C
x EX,: mean vector = p2
covariance matrix = C
116 4 / The Perceptron

In other words, the vector x has a mean vector the value of which depends on whether
it belongs to class %1 or class (e2,and a covariance matrix that is the same for both classes.
It is further assumed that:
rn The classes % l and (e2 are equiprobable.
rn The samples of class or (e2 are correlated, so that the covariance matrix C is
nondiagonal.
rn The covariance matrix C is nonsingular, so that the inverse matrix C-' exists.
We may then use Eq. (4.36) to express the joint-probability density function of the
input vector x as follows:

(4.37)

where the index i = 1, 2. Taking the natural logarithm of both sides of Eq. (4.37) and
then expanding terms, we get
1
P
2 2
1
lnf(xl%J = - - ln(2n) - - ln(det C) - - xTC-'x
2
+ prC-'x - -21 pTC-'pl (4.38)

The first three terms on the right-hand side of Eq. (4.38) are independent of the index i.
. Hence, for the purpose of pattern classification considered here, we may define a log-
likelihood as follows:
1
&(X)= pTC-lx - - pTC-'pi (4.39)
2
where i = 1, 2. For class TI, the log-likelihood is
1
ll(X) = p p x - -
2 pfC-'pl (4.40)

For class %2, the log-likelihood is


1
&(x)= p p x - p;c-1p, 5 (4.41)

Hence, subtracting Eq. (4.41) from (4.40), we get


1 = &(x) - Iz(x)
1
= (p1- p2)Tc-1x- - ( p p - l p ' - p2TC-'pz) (4.42)
2
which is linearly related to the input vector x. To emphasize this property of the composite
log-likelihood 1, we rewrite Eq. (4.42) in the equivalent form
1 = fiTx - 8

= i:
i= 1
*jxj - e (4.43)

where fi is the maximum-likelihood estimate of the unknown parameter vector w, defined


by
w = c-yp.1- p*) (4.44) I

and 8 is a constant threshold, defined by


1
8 = -2 (pfC-1p1- p;c-'p2) (4.45)
4.5 I Maximum-Likelihood Gaussian Classifier 117

The maximum-likelihood Gaussian classiJer or Gaussian classijier (for short) may thus
be implemented by a linear combiner as shown in the signal-flow graph of Fig. 4.5.
Equations (4.44) and (4.45) simplify considerably for the following special case. All
the samples of the vector x are uncorrelated with each other, and also have a common
variance uz,regardless of the associated class; that is, the covariance matrix C is a diagonal
matrix defined by
c = u2I (4.46)
where I is the p-by-p identity matrix. We now have
1
C-'= -1 (4.47)
u 2

Accordingly, Eqs. (4.44) and (4.45) simplify as, respectively,


1
% = - ((+2 11.1 - 11.2) (4.48)

and

(4.49)

where )).I)denotes the Euclidean norm of the enclosed vector.


In any case, treating 1 as the overall classifier output, the Gaussian classifier, character-
ized by the parameter vector % and threshold 8, solves the pattern-classification problem
in accordance with the following rule:
If 1 2 0, then 1, 2 lZand therefore assign x to class (el
(4.50)
If 1 < 0, then lZ> lIand therefore assign x to class (ez
The operation of the Gaussian classifier is analogous to that of the single-layer per-
ceptron in that they are both linear classifiers; see Eqs. (4.2) and (4.43). There are, however,
some subtle and important differences between them, which should be carefully noted as
discussed here:
w The single-layer perceptron operates on the premise that the patterns to be classified
are linearly separable. The Gaussian distribution of the two patterns assumed in the
derivation of the maximum-likelihood Gaussian classifier do certainly overlap each
other and are therefore not exactly separable; the extent of the overlap is determined
by the mean vectors pl and p2,and the covariance matrices C1 and Cz. The nature
of this overlap is illustrated in Fig. 4.6 for the special case of a scalar input x (i.e.,
input dimension p = 1). When the inputs are nonseparable and their distributions
overlap as described here, the perceptron convergence algorithm develops a problem

FIGURE 4.5 Signal-flow graph of Gaussian classifier.


118 4 I The Perceptron

Decision
boundary

% %
FIGURE 4.6 Two overlapping, one-dimensional Gaussian distributions.

in that the decision boundaries between the different classes may oscillate continu-
ously (Lippmann, 1987).
The maximum-likelihood Gaussian classifier minimizes the average probability of
classification error. This minimization is independent of the overlap between the
underlying Gaussian distributions of the two classes. For example, in the special
case illustrated in Fig. 4.6, the maximum-likelihood Gaussian classifier always posi-
tions the decision boundary at the point where the two Gaussian dstributions cross
each other. This positioning of the decision boundary is perfectly in accord with the
optimum Bayes 's hypothesis-testing procedure on the assumption that the two classes
are equiprobable (ie., there is no prior information to be incorporated into the
decision making).
The perceptron convergence algorithm is nonparametric in the sense that it makes
no assumptions concerning the form of the underlying distributions; it operates by
concentrating on errors that occur where the distributions overlap. It may thus be
more robust than classical techniques, and work well when the inputs are generated
by nonlinear physical mechanisms, and whose distributions are heavily skewed
and non-Gaussian (Lippmann, 1987). In contrast, the maximum-likelihood Gaussian
classifier is parametric; its derivation is contingent on the assumption that the
underlying distributions are Gaussian, which may limit its area of application.
The perceptron convergence algorithm is both adaptive and simple to implement;
its storage requirement is confined to the set of synaptic weights and threshold. On
the other hand, the design of the maximum-likelihood Gaussian classifier is fixed;
it can be made adaptive, but at the expense of increased storage requirement and
more complex computations (Lippmann, 1987).

4.6 Discussion
The study of the single-layer perceptron presented in this chapter has been formulated in
the context of the McCulloch-Pitts formal model of a neuron. The nonlinear element of
this model is represented by a hard limiter at its output end. It would be tempting to think
that we can do better if we were to use a sigmoidal nonlinear element in place of the
hard limiter. Well, it turns out that the steady-state, decision-making characteristics of a
single-layer perceptron are basically the same, regardless of whether we use hard-limiting
or soft-limiting as the source of nonlinearity in the neural model (Shynk and Bershad,
4.6 I Discussion 119

1991, 1992; Shynk, 1990). We may therefore state formally that so long as we limit
ourselves to the model of a neuron that consists of a linear combiner followed by a nonlinear
element, then regardless of the form of nonlinearity used, a single-layer perceptron can
perform pattern classification only on linearly separable patterns.
Linear separability requires that the patterns to be classified must be sufficiently sepa-
rated from each other to ensure that the decision surfaces consist of hyperplanes. This
requirement is illustrated in Fig. 4.7a for the case of a two-dimensional, single-layer
perceptron. If now the two patterns TI and (e2 are allowed to move too close to each
other, as in Fig. 4.7b, they become nonlinearly separable and the elementary perceptron
fails to classify them.
The first real critique of Rosenblatt’s perceptron was presented by Minsky and Selfridge
(1961). Minsky and Selfridge pointed out that the perceptron as defined by Rosenblatt
could not even generalize toward the notion of binary parity, let alone make general
abstractions. They also suggested the roles that connectionist networks might play in the
implementation of larger systems.
The computational limitations of Rosenblatt’ s perceptron were subsequently put on a
solid mathematical foundation in the famous book by Minsky and Papert (1969, 1988).
This is a thorough and well-written book. After the presentation of some brilliant and
highly detailed mathematical analysis of the perceptron, Minsky and Papert proved that
the perceptron as defined by Rosenblatt is inherently incapable of making some global
generalizations on the basis of locally learned examples. In the last chapter of their book,
Minsky and Papert go on to make the conjecture that the limitations of the kind they
had discovered for Rosenblatt’s perceptron would also hold true for its variants, more
specifically, multilayer neural networks. Quoting from Section 13.2 of the book by Minsky
and Papert (1969):

( b)

FIGURE 4.7 (a) A pair of linearly separable patterns. (b) A pair of nonlinearly separable
patterns.
120 4 / The Perceptron

The perceptron has shown itself worthy of study despite (and even because of !) its severe
limitations. It has many features to attract attention: its linearity; its intriguing learning
theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason
to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we
consider it to be an important research problem to elucidate (or reject) our intuitive judgement
that the extension to multilayer systems is sterile.

This conjecture cast serious doubt on the computational capabilities of not only the
perceptron but neural networks in general up to the mid-1980s.
History has shown, however, that the conjecture made by Minsky and Papert seems
to be unjustified in that we now have several advanced forms of neural networks that are
computationally more powerful than Rosenblatt’s perceptron. For example, multilayer
perceptrons trained with the back-propagation algorithm discussed in Chapter 6 and
the radial basis-function networks discussed in Chapter 7 overcome the computational
limitations of the single-layer perceptron.

4.1 The single-layer perceptron is a linear pattern classifier. Justify this statement.
4.2 Consider two one-dimensional, Gaussian-distributed classes and %z that have a
common variance equal to 1. Their mean values are as follows:
p1 = -10
pz = 4-10

These two classes are essentially linearly separable. Design a classifier that separates
these two classes.
4.3 Verify that Eqs. (4.22)-(4.25), summarizing the perceptron convergence algorithm,
are consistent with Eqs. (4.7) and (4.8).
4.4 Equations (4.44) and (4.45) define the weight vector and threshold of a maximum-
likelihood Gaussian classifier, assuming that the samples of classes or %z are
correlated. Suppose now that these samples are uncorrelated but have different vari-
ances, so that for both classes we may write
C = diag[cr:, (T;, . . . ,cr3
Find the weight vector and threshold of the classifier so defined.
4.5 In Eq. (4.27) we defined one form of an instantaneous performance function for a
I single-layer perceptron. Another way of expressing this performance measure is as
follows (Shynk, 1990):
j(n) = Iu(n)l - d(n) u(n)
where u(n) is the hard-limiter input, and d(n) is the desired response. Using this
performance measure, find the instantaneous gradient vector V,j(n), and show that
the expression so derived is identical with that in Eq. (4.31).
4.6 The perceptron may be used to perform numerous logic functions. Demonstrate
the implementation of the binary logic functions AND, OR, and COMPLEMENT.
However, a basic limitation of the perceptron is that it cannot implement the EXCLU-
SIVE OR function. Explain the reason for this limitation.

Vous aimerez peut-être aussi