Académique Documents
Professionnel Documents
Culture Documents
Panayiota Poirazi
Bartlett W. Mel
Department of Biomedical Engineering, University of Southern California, Los Angeles,
CA 90089, U.S.A.
1 Introduction
A+B
A+C
soma
Segev, 1987; Shepherd & Brayton, 1987; Borg-Graham & Grzywaca, 1992;
Mel, 1992a, 1992b, 1993, 1994, 1999; Zador, Claiborne, & Brown, 1992; Yuste
& Tank, 1996; Segev & Rall, 1998; Mel, Ruderman, & Archie, 1998a, 1998b),
results of existing experimental and modeling studies strongly suggest that
electrical compartmentalization provided by the dendritic branching struc-
ture, coupled with local thresholding provided by active dendritic channels,
together are likely to have a profound impact on the way in which synap-
tic inputs to a dendritic tree are integrated to produce a cell’s output (see
Figure 1).
A key challenge in deciphering the functions of individual neurons in-
volves properly characterizing the form of nonlinear interactions between
synapses within a dendritic compartment, and to this end, a variety of bio-
physical modeling studies have identified possible functional roles for such
interactions (Koch et al., 1982; Rall & Segev, 1987; Shepherd & Brayton, 1987;
Borg-Graham & Grzywacz, 1992; Kock & Poggio, 1992; Zador et al., 1992;
Mel 1992a, 1992b, 1993). In this same vein, two recent biophysical model-
ing studies (Mel et al., 1998a, 1998b) have shown that the time-averaged
response of a cortical pyramidal cell with weakly excitable dendrites can
mimic the output of a quadratic “energy” model, a formalism that has been
used to model a variety of receptive field properties in visual cortex (Pollen
& Ronner, 1983; Adelson & Bergen, 1985; Heeger, 1992; Ohzawa, DeAngelis,
& Freeman, 1997). This close connection between dendritic and receptive
Capacity of a Subsampled Quadratic Classifier 1191
field nonlinearities is highly suggestive and supports the notion that ac-
tive dendrites could contribute to a diverse set of neuronal functions in the
mammalian central nervous system that involve a common, quasi-static
multiplicative nonlinearity of relatively low order (for further discussion,
see Mel, 1999).
The potential contributions of active dendritic processing to the brain’s
capacity to learn and remember are a fundamental problem in neuroscience
that have thus far received relatively little attention. In one study address-
ing the issue of neuronal capacity, Zador and Pearlmutter (1996) computed
the Vapnik-Chervonenkis (1971) dimension of an integrate-and-fire neu-
ron with a passive dendritic tree. However location-dependent nonlinear
synaptic interactions, which are likely to occur within the dendrites of many
neuron types, were not considered in this study.
To begin to address this issue, we previously developed a spatially ex-
tended single-neuron model called the “clusteron,” which explicitly in-
cluded location-dependent multiplicative synaptic interactions in the neu-
ron’s input-output function (Mel, 1992a) and illustrated how a neuron with
active dendrites could act as a high-dimensional quadratic learning ma-
chine. The clusteron abstraction was conceived primarily to help quantify
the impact of local dendritic nonlinearities on the usable memory capacity of
a cell; an earlier biophysical modeling study had demonstrated empirically
that this impact could be significant (Mel, 1992b). However, an additional
insight gained from this work was that a cell with s input sites is likely
capable of representing only a small fraction of the O(d2 ) second-order in-
teraction terms available among d input lines, since in the brain, it is often
reasonable to assume that d is very large. In short, if a neuron does indeed
pull out higher-order features in its dendrites, it is likely to have to choose
very carefully among them.
Inspired in part by this observation, our goal here has been to quantify the
degree to which inclusion of a subset of the best available higher-order terms
augments the capacity of a subsampled quadratic (SQ) classifier relative to
its linear counterpart. Although it has been long appreciated that inclusion
of higher-order terms can increase the power of a learning machine for both
regression and classification (Poggio, 1975; Barron, Mucciardi, Cook, Craig,
& Barron, 1984; Giles & Maxwell, 1987; Ghosh & Shin, 1992; Guyon, Boser,
& Vapnik, 1993; Karpinsky & Werther, 1993; Schurmann, 1996), and algo-
rithms for learning polynomials have often included heuristic strategies for
subsampling the intractable number of higher-order product terms, existing
theory bearing on the learning capabilities of quadratic classifiers is limited.
Cover (1965) showed that an rth order polynomial classifier consisting of
all r-wise products of the input variables in d dimensions has a natural
separating capacity of
µ ¶
d+r
2 ,
r
1192 Panayiota Poirazi and Bartlett W. Mel
µ ¶
K
B(k, d) = (1 + d + k)w + log 2 , (2.1)
k
where the first term measures the entropy contained in the stored param-
eter values, with w an unknown representing the average number of bits
of value entropy per parameter, and the second term measures the entropy
contained in the choice of higher-order terms, assuming that all combina-
Capacity of a Subsampled Quadratic Classifier 1193
2000
d = 10
d = 20
choice
1500 d = 30 bits
500
0
0 20 40 60 80 100
Product Term Ratio p = k/K (%)
1 The expression for B assumes that the value bits and the choice bits are independent
of each other, which holds only when the k best-product terms chosen for inclusion in
the classifier are unlikely to include any zero values. This assumption breaks down as
k → K, in which limit some of the k largest learned coefficients will have values near zero.
This leads in turn to an overprediction of classifier capacity according to equation 2.1 and
the resulting slight nonmonotonicity seen in Figure 2 for large p values. A more intricate
counting method would be needed to separate these two sources of classifier capacity for
large p values.
1194 Panayiota Poirazi and Bartlett W. Mel
tion 2.1 cannot be directly used to predict error rates for SQ classifiers. First,
the value of w is generally unknown a priori. Second, actual error rates do
not depend on parameter entropy alone, but depend also on geometric fac-
tors that determine the suitability of the classifier’s parameters to specific
training set distributions.
Following the maxim that capacity is closely linked to trainable degrees
of freedom, however, and the principle that capacity and error rates should
be related to each other in consistent (if unknown) ways, we conjectured
that there could exist subfamilies of isomorphic SQ classifiers wherein the
relative capacity of any two classifiers i and j, at any error rate and for
any training set distribution, is given by the ratio of their respective B(k, d)
values. Specifically, a family of isomorphic SQ classifiers C is defined such
that for any two classifiers {i, j} ∈ C , the relative number of patterns N
learnable by each classifier at error rate ² obeys the relation
Ni Bi
Ri/j = = , for all ². (2.2)
Nj Bj
Put another way, two isomorphic classifiers should yield equal error rates
when asked to learn an equivalent number of patterns per parameter bit, for
any input distribution. We describe several tests of this scaling conjecture
in the following sections.
2.1 Special Cases: Linear and Full Quadratic Classifiers. The scaling
relation expressed in equation 2.2 may be easily verified in the limiting cases
of k = 0 and k = K, which correspond to linear and full quadratic classifiers,
respectively. In both cases, the combinatorial term representing the choice
capacity in equation 2.1 vanishes. Assuming that the value-entropy scaling
factor w is always equal for the two classifiers at a given error rate, the
capacity ratio for two isomorphic “choiceless” classifiers indexed by i and j
simplifies to the ratio of their explicit parameter counts:
1 + di
Ri/j = (2.3)
1 + dj
for the full quadratic case. Both outcomes are consistent with the fact that
the natural separating capacity for both linear and full polynomial discrim-
inant functions is directly proportional to the number of free parameters
Capacity of a Subsampled Quadratic Classifier 1195
35 35
30 30
25 25
20 20
15 15
10 10
spherical gaussian model non-spherical gaussian model
5 learned linear discriminant 5 learned quadratic discriminant
0 0
0 50 100 150 200 0 200 400 600 800 1000
Training Patterns Training Patterns
VC Cover
Dimension Capacity
C Scaling of Performance Curves
Across Dimension
45
d = 10, 20, 30
40
Classification Error (%)
35
30
25 linear
20 full quadratic
15
10
0
0 2 4 6 8 10
Training Patterns / Model Parameters
Figure 3: We trained linear and full quadratic classifiers using randomly labeled
patterns drawn from a d-dimensional zero-mean spherical gaussian distribution
and measured average classification error versus training set size. (A) A sim-
ple analytical model (crosses) fit numerical simulations results (squares) in the
asymptote of large N (see the appendix). Dashed lines indicate VC dimension
1 + d and Cover capacity 2(1 + d) for the perceptron, which correspond to 1%
and 7% error rates, respectively. Error rates slightly above the theoretical mini-
mum (0% at VC dimension) were due to imperfect control of the learning-rate
parameter. (B) A Bayes-optimal classifier based on estimated covariance matri-
ces (crosses) provided a control for the performance of a numerically optimized
full quadratic classifier (squares); the performance curves again merge in the
limit of large N. (C) Scaling of linear and quadratic error rates across dimension,
when plotted against training patterns per model parameter (1 + d for linear,
1 + d + (d2 + d)/2 for quadratic).
Capacity of a Subsampled Quadratic Classifier 1197
0.2 40
d = 10, N = 100 d = 10, N = 100
0.18 Linear 35
0.16 Quadratic
30
0.14
25
0.12
0.1 20
0.08
15
0.06
10
0.04 Linear
5 Quadratic
0.02
0 0
10 8 6 4 2 0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Weight Values Weight Magnitude
and recording the associated capacities for the linear versus various SQ
curves (see Figure 5B). The contribution of the choice flexibility is evi-
dent here, leading to capacity boosts in excess of that predicted by ex-
plicit parameter counts. For example, at a fixed 1% error rate in 10 di-
mensions, addition of the 10 best quadratic terms to the 11 linear and
constant terms boosted the trainable capacity by more than a factor of
3, whereas the number of explicit parameters grew by less than a factor
of 2.
By analogy with the special cases of linear and full quadratic classifiers,
we sought to identify subfamilies of isomorphic SQ classifiers within which
the relative capacity of any two classifiers was given by Ri/j . In doing so,
however, two complications arise. First, it is a priori unclear what condi-
tion(s) on d and k bind together families of isomorphic SQ classifiers in
general, if such families exist. Second, even if such families do exist, w no
longer cancels in the expression for Ri/j , as it does for both linear and full
quadratic classifiers. In cases where w is unknown, therefore, evaluation of
Ri/j is prevented.
Regarding the criterion for classifier isomorphism, we found empirically
that error rates for SQ classifiers i and j in dimensions di and dj could be
brought into register across dimension with a simple capacity scaling, as
1198 Panayiota Poirazi and Bartlett W. Mel
k =20 k = 40
30 8
k =30 k = 30
25
6 k = 20
20
k =40
15 4
k =10
10 full quadratic
2
5 k = 0 (linear)
0 0
0 100 200 300 0 5 10 15 20 25
Training Patterns Classification Error (%)
ki kj
= (2.5)
Ki Kj
was obeyed. This suggests that a key geometric invariance in the SQ family
involves the proportion of higher-order terms included in the classifier (see
Figure 6B). This rule also operates correctly for the special cases of linear
(k = 0) and full quadratic (k = K) classifiers. By contrast, another plausible
condition on d and k failed to bind together sets of isomorphic SQ classifiers:
classifiers that use the same fraction of their complete parameter set (see
Figure 6A).
Having established that two SQ classifiers with equal product term ratios
p = Kk produce performance curves that are isomorphic up to a scaling of
their capacities, the key question remaining was whether the scaling fac-
tor would be predicted by Ri/j . Assuming wi = wj as for the linear and full
quadratic cases, we note that the relative capacity of two isomorphic SQ clas-
sifiers is again independent of w and approaches ( ddji )2 for large di and dj . This
occurs because as di and dj grow large, the ratio of the two value capacities
considered separately, and the two choice capacities considered separately,
both individually approach ( ddji )2 . However, for the lower-dimensional cases
included in our empirical studies, the capacity ratio Ri/j does not entirely
shed its dependence on w and p. We found that a value of w = 4 produced
good fits to the empirically determined scaling factors for p-matched SQ
Capacity of a Subsampled Quadratic Classifier 1199
d = 20 d = 20
Parameters Used in Target Dimension
80 d = 30 80 d = 30
(% of total parameters)
60 60
40 40
20 20
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Parameters Used in 10 Dimensions Product Term Ratio p in 10 Dimensions (%)
(% of total parameters)
classifiers in 10, 20, and 30 dimensions, for a wide range of p values tested,
and for training sets drawn from both gaussian and uniform distributions2
(see Figure 7).
To eliminate uncertainty in the value of w, we trained binary-valued SQ
classifiers so that w could be assigned a value of 1 in the expression for Ri/j .
In this case, SQ classifiers were trained as before, but prior to testing, the
real-valued coefficients were mapped to ±1 according to their sign. This
radical quantization of weight values led to large increases in error rates
and gross alterations in the form of the error curves; compare Figure 8A to
Figure 7A. However Ri/j could now be evaluated with no free parameters
and continued to produce nearly perfect superpositions of error rate curves
within isomorphic families (see Figure 8).
The nonmonotonicities seen in these binary-valued cases are of unknown
origin. However, given that they scale across dimension, we infer they arise
from consistent, family-specific idiosyncrasies in the number, and inde-
pendence, of the dichotomies realizable for particular training set distri-
butions.
2 Fits using w = 3 were nearly as good; the scaling factor’s weak dependence on w is
explained by the fact that w appears in both the numerator and denominator of Ri/j .
1200 Panayiota Poirazi and Bartlett W. Mel
SQ Performance Curves with Analog B SQ Curves with Analog Weights have different
A shapes for Uniform vs. Gaussian Distributions
Weights Scale Across Dimesion
40 35
d k
35 30
10 12
Uniform
30 20 46
Classification Error (%)
3 Discussion
A SQ Performance Curves with Binary B SQ Curves with Binary Weights have Different
Weights Scale Across Dimension Shapes for Uniform vs. Gaussian Distributions
40 50
k
⊕ 21% 45
35 K
d k Uniform
40
30
Classification Error (%)
20 25
15 20
d k 15
10 d k
10 25 10 k
20 95 ⊕ 18% 10 10
5 K
30 210 5 20 40
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 1 2 3 4 5 6 7
Training Patterns/Parameter bits Training Patterns/Parameter Bits
Figure 8: Error curves for p-matched SQ classifiers trained with gaussian dis-
tribution and constrained to use binary valued ±1 weights. Superposition of
curves was achieved using capacity scaling factor Ri/j with no free parameters,
since w could be set to 1 in this case.
rth-order polynomials (Cover, 1965), nor has it been derived from a fun-
damental statistical mechanical perspective as has been done, for example,
for the simple perceptron (Riegler & Seung, 1997)—and we cannot predict
how difficult it will be to see through either of these approaches. In its favor,
however, the simple form of equation 2.1, coupled with the condition for
classifier isomorphism given in equation 2.5, together account for much of
the empirical variance in model capacity expressed by this rather complex
class of learning machines. It is therefore likely that these two quantitative
relations will be echoed in future theorems and proofs bearing on the class
of subsampled polynomial classifiers.
While choice capacity is in some sense invisible, it is nonetheless very
real. Thus, when a single optimally chosen nonlinear basis function is added
to a linear model, the system behaves as if it gained not only an additional
w bits of value, or coefficient, capacity, but a bonus capacity of log 2 (K) bits
owing to the choice that was made. When K is large and w is small, this
bonus can be highly significant. It seems plausible that the bit-counting
approach of equation 2.1 could be useful in quantifying the capacity of
other types of classifiers that adaptively subsample from large families of
nonlinear basis functions—for example, a classifier that draws an optimal
set of k radial basis functions from a complete set of K basis functions tiling
an input space.
The practical significance of choice capacity is most acute for hardware-
based learning systems, possibly including the brain, where in the limits
of low-resolution weight values (Petersen, Malenka, Nicoll, & Hopfield,
1202 Panayiota Poirazi and Bartlett W. Mel
Since in high dimension Xpos and Xneg are also orthogonal, the expected
√
distance between them is given by 1X = 2L. The optimal linear discrimi-
nant function is thus a hyperplane that bisects, and is perpendicular to, the
line connecting Xpos and Xneg . Given that Tpos and Tneg have equal prior prob-
ability, the expected classification error is given by the integrated probabil-
ity contained in Gpos (or Gneg ) lying beyond the discriminating hyperplane.
Since Gpos is factorial, this calculation reduces to a complementary error
p
function—the total probability lying beyond the distance 1X/2 = σ d/N
from the origin of a 1-dimensional gaussian. Thus, in the limit of high di-
mension with N À d, the expected classification error is
Ãr !
1 d
CE = erfc , (A.3)
2 2N
Acknowledgments
References
Adelson, E., & Bergen, J. (1985). Spatiotemporal energy models for the percep-
tion of motion. J. Opt. Soc. Amer., A2, 284–299.
Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive
learning networks: Development and applications in the united states of
algorithms related to gmdh. In S. Farlow (Ed.), Self-organizing methods in
modeling. New York: Marcel Dekker.
Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural
Computation, 1, 151–160.
Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford Univer-
sity Press.
Borg-Graham, L., & Grzywacz, N. (1992). A model of the direction selectiv-
ity circuit in retina: Transformations by neurons singly and in concert.
In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation
(pp. 347–375). Orlando, FL: Academic Press.
Cover, T. (1965). Geometrical and statistical properties of systems of linear in-
equalities with applications in pattern recognition. IEEE Trans. Electronic Com-
puters, 14, 326–334.
Ghosh, J., & Shin, Y. (1992). Efficient higher order neural networks for classifi-
cation and function approximation. International Journal of Neural Systems, 3,
323–350.
Giles, C., & Maxwell, T. (1987). Learning, invariance, and generalization in high-
order neural networks. Applied Optics, 26, 4972–4978.
Guyon, I., Boser, B., & Vapnik, V. (1993). Automatic capacity tuning of very large
vc-dimension classifiers. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in
1204 Panayiota Poirazi and Bartlett W. Mel
neural information processing systems, 5 (pp. 147–155). San Mateo, CA: Morgan
Kaufmann.
Heeger, D. (1992). Half-squaring in responses of cat striate cells. Visual Neurosci.,
9, 427–443.
Johnston, D., Magee, J., Colbert, C., & Christie, B. (1996). Active properties of
neuronal dendrites. Ann. Rev. Neurosci., 19, 165–186.
Karpinsky, M., & Werther, T. (1993). Vc dimension and uniform learnability of
sparse polynomials and rational functions. SIAM J. Comput., 22, 1276–1285.
Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In
T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation
(pp. 315–345). Orlando, FL: Academic Press.
Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional
interpretation of dendritic morphology. Phil. Trans. R. Soc. Lond. B, 298, 227–
264.
Mel, B. W. (1992a). The clusteron: Toward a simple abstraction for a complex
neuron. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural
information processing systems, 4 (pp. 35–42). San Mateo, CA: Morgan Kauf-
mann.
Mel, B. W. (1992b). NMDA-based pattern discrimination in a modeled cortical
neuron. Neural Comp., 4, 502–516.
Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neuro-
physiol., 70(3), 1086–1101.
Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation,
6, 1031–1085.
Mel, B. W. (1999). Why have dendrites? A computational perspective. In G. Stu-
art, N. Spruston, & M. Häusser (Eds.), Dendrites (pp. 271–289). Oxford: Oxford
University Press.
Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998a). Toward a single cell
account of binocular disparity tuning: An energy model may be hiding in
your dendrites. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in
neural information processing systems, 10 (pp. 208–214). Cambridge, MA: MIT
Press.
Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998b). Translation-invariant
orientation tuning in visual “complex” cells could derive from intradendritic
computations. J. Neurosci., 17, 4325–4334.
Ohzawa, I., DeAngelis, G., & Freeman, R. (1997). Encoding of binocular disparity
by complex cells in the cat’s visual cortex. J. Neurophysiol., 77.
Pearlmutter, B. A. (1992). How selective can a linear threshold unit be? In Inter-
national Joint Conference on Neural Networks. Beijing, PRC: IEEE.
Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopfield, J. J. (1998). All-or-
none potentiation at ca3–ca1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–
4737.
Poggio, T. (1975). Optimal nonlinear associative recall. Biol. Cybern., 9, 201.
Pollen, D., & Ronner, S. (1983). Visual cortical neurons as localized spatial fre-
quency filters. IEEE Trans. Sys. Man Cybern., 13, 907–916.
Rall, W., & Segev, I. (1987). Functional possibilities for synapses on dendrites
and on dendritic spines. In G. Edelman, W. Gall, & W. Cowan (Eds.), Synaptic
Capacity of a Subsampled Quadratic Classifier 1205