(omptraSout
Learning Theory
CAMBRIDGE
Learning Theory:
Kendall E. Atkinson
CAMBRIDGE
UNIVERSITY PRESS
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York www.cambridge.org
Information on this title: www.cambridge.org/9780521865593 Cambridge University Press 2007
This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing
agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
First published in print format 2007
ISBN13
9780511274077 eBook (EBL)
ISBN10
0511274076 eBook (EBL)
ISBN13
9780521865593 hardback
ISBN10
052186559X hardback
Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or
thirdparty internet websites referred to in this publication, and does not guarantee that any content on
such websites is, or will remain, accurate or appropriate.
Foreword
Contents
ix
Preface
xi
1 The framework of learning
1
1.1; Introduction
1
1.2; A formal setting
5
1.3; Hypothesis spaces and target functions
9
1.4; Sample, approximation, and generalization errors 11
1.5; The biasvariance problem
13
1.6; The remainder of this book
14
1.7; References and additional remarks
15
2 Basic hypothesis spaces
17
2.1; First examples of hypothesis space
17
2.2; Reminders I
18
2.3; Hypothesis spaces associated with Sobolev spaces 21
2.4; Reproducing Kernel Hilbert Spaces
22
2.5; Some Mercer kernels
24
2.6; Hypothesis spaces associated with an RKHS
31
2.7; Reminders II
33
2.8; On the computation of empirical target functions 34
2.9; References and additional remarks
35
3 Estimating the sample error
37
3.1; Exponential inequalities in probability
37
3.2; Uniform estimates on the defect
43
3.3; Estimating the sample error
44
3.4; Convex hypothesis spaces
46
3.5; References and additional remarks
49
4 Polynomial decay of the approximation error
54
4.1; Reminders III
55
4.2; Operators defined by a kernel
56
4.3; Mercers theorem
59
4.4; RKHSs revisited
61
4.5; Characterizing the approximation error
in RKHSs 63
4.6; An example
68
4.7; References and additional remarks
69
5 Estimating covering numbers
72
5.1; Reminders IV
73
5.2; Covering numbers for Sobolev smooth
kernels 76
5.3; Covering numbers for analytic kernels
83
5.4; Lower bounds for covering numbers
101
5.5; On the smoothness of box spline kernels
106
5.6; References and additional remarks
108
6 Logarithmic decay of the approximation error
109
6.1; Polynomial decay of the approximation error for C
kernels
110
6.2; Measuring the regularity of the kernel
112
6.3; Estimating the approximation error in RKHSs
117
6.4; Proof of Theorem 6.1
125
6.5; References and additional remarks
125
7 On the biasvariance problem
127
7.1; A useful lemma
128
7.2; Proof of Theorem 7.1
129
7.3; A concrete example of biasvariance
132
7.4; References and additional remarks
133
8 Least squares regularization
134
8.1; Bounds for the regularized error
135
8.2; On the existence of target functions
139
8.3; A first estimate for the excess generalization error140
8.4; Proof of Theorem 8.1
148
8.5; Reminders V
151
8.6; Compactness and regularization
151
8.7; References and additional remarks
155
9 Support vector machines for classification
157
9.1; Binary classifiers
159
9.2; Regularized classifiers
161
9.3; Optimal hyperplanes: the separable case
166
9.4; Support vector machines
169
9.5; Optimal hyperplanes: the nonseparable case
171
9.6; Error analysis for separable measures
173
9.7; Weakly separable measures
182
9.8; References and additional remarks
185
10 General regularized classifiers
187
10.1; Bounding the misclassification error in terms of the
generalization error
189
10.2; Projection and error decomposition
194
10.3; Bounds for the regularized error V(y, f) offY
196
10.4; Bounds for the sample error term involving fY 198
10.5; Bounds for the sample error term involving f^Y 201
10.6; Stronger error bounds
204
10.7; Improving learning rates by imposing noise conditions 210
10.8; References and additional remarks
211
214
222
References
Foreword
Index
This book by Felipe Cucker and DingXuan Zhou provides solid mathematical foundations and
new insights into the subject called learning theory.
Some years ago, Felipe and I were trying to find something about brain science and
artificial intelligence starting from literature on neural nets. It was in this setting that we
encountered the beautiful ideas and fast algorithms of learning theory. Eventually we were
motivated to write on the mathematical foundations of this new area of science.
I have found this arena to with its new challenges and growing number of application, be
exciting. For example, the unification of dynamical systems and learning theory is a major
problem.Another problem is to develop a comparative study of the useful algorithms currently
available and to give unity to these algorithms. How can one talk about the best algorithm
or find the most appropriate algorithm for a particular task when there are so many desirable
features, with their associated tradeoffs? How can one see the working of aspects of the
human brain and machine vision in the same framework?
I know both authors well. I visited Felipe in Barcelona more than 13 years ago for several
months, and when I took a position in Hong Kong in 1995, I asked him to join me. There
Lenore Blum, Mike Shub, Felipe, and I finished a book on real computation and complexity. I
returned to the USA in 2001, but Felipe continues his job at the City University of Hong Kong.
Despite the distance we have continued to write papers together. I came to know DingXuan
as a colleague in the math department at City University. We have written a number of papers
together on various aspects of learning theory. It gives me great pleasure to continue to work
with both mathematicians. I am proud of our joint accomplishments.
I leave to the authors the task of describing the contents of their book. I will give some
personal perspective on and motivation for what they are doing.Computational science
demands an understanding of fast, robust algorithms. The same applies to modern theories of
artificial and human intelligence. Part of this understanding is a complexitytheoretic analysis.
Here I am not speaking of a literal count of arithmetic operations (although that is a byproduct), but rather to the question: What sample size yields a given accuracy? Better yet,
describe the error of a computed hypothesis as a function of the number of examples, the
desired confidence, the complexity of the task to be learned, and variants of the algorithm. If
the answer is given in terms of a mathematical theorem, the practitioner may not find the
result useful. On the other hand, it is important for workers in the field or leaders in
laboratories to have some background in theory, just as economists depend on knowledge of
economic equilibrium theory. Most important, however, is the role of mathematical
foundations and analysis of algorithms as a precursor to research into new algorithms, and
into old algorithms in new and different settings.
I have great confidence that many learningtheory scientists will profit from this book.
Moreover, scientists with some mathematical background will find in this account a fine
introduction to the subject of learning theory.
Preface
1.1
Figure 1.1
Case 1.2 Case 1.1 readily extends to a classical situation in science, namely, that of learning a
physical law by curve fitting to data. Assume that the law at hand, an unknown function f : R
^ R, has a specific form and that the space of all functions with this form can be
parameterized by N real numbers. For instance, iff is assumed to be a polynomial of degree d,
then N = d +1 and the parameters are the unknown coefficients w0,..., wd of f. In this case,
finding the best fit by the least squares method estimates the unknown f from a set of
pairs {(xi,yi),..., (xm, ym)}. If the measurements generating this set were exact, then yi would
be equal to f (xi). However, in general one expects the values yi to be affected by noise. That
is, yi = f x) + e, where e is a random variable (which may depend on xi) with mean zero. One
then computes the vector of coefficients w such that the value
m
d
J2(fw(xi)  yi)2,
with fw(x) = ^2 wjxj
i=1
j=0
is minimized, where, typically, m > N .In general, the minimum value above is not 0. To solve
this minimization problem, one uses the least squares technique, a method going back to
Gauss and Legendre that is computationally efficient and relies on numerical linear algebra.
Since the values yi are affected by noise, one might take as starting point, instead of the
unknown f, a family of probability measures ex on R varying with x e R. The only requirement
on these measures is that for all x e R, the mean of ex is f (x). Then yi is randomly drawn
from eXi. In some contexts the xi, rather than being chosen, are also generated by a
probability measure pX on R. Thus, the starting point could even be a single measure p on R x
R  capturing both the measure pX and the measures ex for x e R  from which the pairs (xi,
yi) are randomly drawn.
A more general form of the functions in our approximating class could be given by
N
fw (x) = ^2 wrfi (x), i=1
where the fa are the elements of a basis of a specific function space, not necessarily of
polynomials.
Case 1.3 The training of neural networks is an extension of Case 1.2. Roughly speaking, a
neural network is a directed graph containing some input nodes, some output nodes, and
some intermediate nodes where certain functions are computed. If X denotes the input space
(whose elements are fed to the input nodes) and Y the output space (of possible elements
returned by the output nodes), a neural network computes a function from X to Y. The
literature on neural networks shows a variety of choices for X and Y, which can be continuous
or discrete, as well as for the functions computed at the intermediate nodes. A common
feature of all neural nets, though, is the dependence of these functions on a set of
parameters, usually called weights, w = {wj }jeJ. This set determines the function fw: X ^ Y
computed by the network.
Neural networks are trained to learn functions. As in Case 1.2, there is a target function
f : X ^ Y, and the network is given a set of randomly chosen pairs (x1, y1),..., (xm,ym) in X x
Y. Then, training algorithms select a set of weights w attempting to minimize some distance
from fw to the target function f : X ^ Y.
Case 1.4 A standard example of pattern recognition involves handwritten characters. Consider
the problem of classifying handwritten letters of the English alphabet. Here, elements in our
space X could be matrices with entries in the interval [0,1]  each entry representing a pixel in
a certain gray scale of a digitized photograph of the handwritten letter or some features
extracted from the letter. We may take Y to be
i=t
Here ei is the ith coordinate vector in R26, each coordinate corresponding to a letter. If A c Y is
the set of points y as above such that 0 < Xi < 1, for i = 1,..., 26, one can interpret a point in
A as a probability measure on the set {A, B, C,..., X, Y, Z}. The problem is to learn the ideal
function f : X ^ Y that associates, to a given handwritten letter x, a linear combination of the
ei with coefficients {Prob{x = A},Prob{x = B}, ...,Prob{x = Z}}. Unambiguous letters are
mapped into a coordinate vector, and in the (pure) classification problem f takes values on
these ei. Learning f means finding a sufficiently good approximation off within a given
prescribed class.
The approximation of f is constructed from a set of samples of handwritten letters, each of
them with a label in Y. The set {(x1, y1),..., (xm, ym)} of these m samples is randomly drawn
from X x Y according to a measure p on X x Y. This measure satisfies p(X x A) = 1. In
addition, in practice, it is concentrated around the set of pairs (x, y) with y = ei for some 1 <
i < 26. That is, the occurring elements x e X are handwritten letters and not, say, a digitized
image of the Mona Lisa. The function f to be learned is the regression function fp of p.
That is, fp (x) is the average of the y values of {x} x Y (we are more precise about p and the
regression function in Section 1.2).
Case 1.5 A standard approach for approximating characteristic (or indicator) functions of sets
is known as PAC learning (from probably approximately correct). Let T (the target
concept) be a subset of Rn and pX be a probability measure on Rn that we assume is not
known in advance. Intuitively, a set S c Rn approximates T when the symmetric difference
SAT = (S \ T) U (T \ S) is small, that is, has a small measure. Note that if fS and fT denote the
characteristic functions of S and T, respectively, this measure, called the error of S, is Rn fS
 fT  dpX. Note that since the functions take values in {0,1}, only this integral coincides
with /K (fS  fT )2 dpX.
Let C be a class of subsets of Rn and assume that T e C. One strategy for constructing an
approximation of T in C is the following. First, draw points x1,..., xm e Rn according to pX and
label each of them with 1 or 0 according to whether they belong to T. Second, compute any
function fS : Rn ^ {0,1}, fS e C, that coincides with this labeling over {xi,..., xm}. Such a
function will provide a good approximation S of T (small error with respect to pX ) as long as
m is large enough and C is not too wild. Thus the measure pX is used in both capacities,
governing the sample drawing and measuring the error set SAT.
A major goal in PAC learning is to estimate how large m needs to be to obtain an e
approximation of T with probability at least 1  8 as a function of e and 8.
The situation described above is noise free since each randomly drawn point xi e Rn is
correctly labeled. Extensions of PAC learning allowing for labeling mistakes with small
probability exist.
Case 1.6 (Monte Carlo integration) An early instance of randomization in algorithmics
appeared in numetical integration. Let f : [0,1]n ^ R. One way of approximating the
integral /xS[01]f (x) dx consists of randomly drawing points x1,..., xm e [0,1]n and computing
1m
Im(f) = f (xi).
m
i
=1
Under mild conditions on the regularity off, Im(f) ^ f f with probability 1; that is, for all e > 0,
lim Prob
Im(f) f (x) dx
m
^<x x1 ,...,xm
xe[0,1]n
Again we find the theme of learning an object (here a single real number, although defined in
a nontrivial way through f) from a sample. In this case the measure governing the sample is
known (the measure in [0,1]n inherited from the standard Lebesgue measure on Rn), but the
same idea can be used for an unknown measure. If pX is a probability measure on X c Rn, a
domain or manifold, Im(f) will approximate fxeXf (x) dpX for large m with high probability as
long as the points x1,..., xm are drawn from X according to the measure pX. Note that no noise
is involved here. An extension of this idea to include noise is, however, possible.
A common characteristic of Cases 1.21.5 is the existence of both an unknown function
f : X ^ Y and a probability measure allowing one to randomly draw points in X x Y. That
measure can be on X (Case 1.5), on Y varying with x e X (Cases 1.2 and 1.3), or on the
product X x Y (Case 1.4). The only requirement it satisfies is that, if for x e X a point y e Y
can be randomly drawn, then the expected value of y is f (x). That is, the noise is centered at
zero. Case 1.6 does not follow this pattern. However, we have included it since it is a wellknown algorithm and shares the flavor of learning an unknown object from random data.
The development in this book, for reasons of unity and generality, is based on a single
measure on X x Y. However, one should keep in mind the distinction between inputs x e X
and outputs y e Y.
A formal setting
1.2
Since we want to study learning from random sampling, the primary object in our
development is a probability measure p governing the sampling that is not known in advance.
Let X be a compact metric space (e.g., a domain or a manifold in Euclidean space) and Y
= Rk. For convenience we will take k = 1 for the time being. Let p be a Borel probability
measure on Z = X x Y whose regularity properties will be assumed as required. In the
following we try to utilize concepts formed naturally and solely from X, Y, and p.
Throughout this book, if f is a random variable (i.e., a realvalued function on a probability
space Z), we will use E(f) to denote the expected value (or average, or mean) of f and a2(f)
to denote its variance. Thus
E(f) =
f(z) dp and a2(f) = E((f  E(f))2) = E(f2)  (E(f))2.
zeZ
A central concept in the next few chapters is the generalization error (or least squares
error or, if there is no risk of ambiguity, simply error) of f, forf : X ^ Y, defined by
E(f) = Ep(f) =
(f (x)  y)2 dp
For each input x e X and output y e Y, (f (x)  y)2 is the error incurred through the use off as a
model for the process producing y from x. This is a local error. By integrating over X x Y (w.r.t.
p, of course) we average out this local error over all pairs (x,y). Hence the word error for
E(f).
The problem posed is: What is thef that minimizes the error E(f)? To answer this
question we note that the error E(f) naturally decomposes as a sum. For every x e X, let
p(y\x) be the conditional (w.r.t. x) probability measure on Y. Let also pX be the marginal
probability measure of p on X , that is, the measure on X defined by pX (S) = p(n1 (S)),
where n : X x Y ^ X is the projection. For every integrable function y: X x Y ^ R a version of
Fubinis theorem relates p, p(y\x), and pX as follows:
This breaking of p into the measures p(y\x) and pX corresponds to looking at Z as a product
of an input domain X and an output set Y. In what follows, unless otherwise specified,
integrals are to be understood as being over p, p (y\x) or pX .
Define fp : X ^ Y by
Y
fp(x) = ydp(y\x).
The function fp is called the regression function of p. For each x e X, fp(x) is the average of
the y coordinate of {x} x Y (in topological terms, the average of y on the fiber of x).
Regularity hypotheses on p will induce regularity properties on fp.
We will assume throughout this book that fp is bounded.
2
a2(x) = / (y  fp(x))2 dp(y\x).
Y
Fix x e X and consider the function from Y to R mapping y into (y  fp(x)). Since the
expected value of this function is 0, its variance is
a
2
p
Now average over X, to obtainThe number op is a measure of how well conditioned p is,
analogous to the
notion of condition number in numerical linear algebra.
Remark 1.7
i; It is important to note that whereas p and fp are generally unknown, pX is known in
some situations and can even be the Lebesgue measure on X inherited from Euclidean
space (as in Cases 1.2 and 1.6).
ii; In the remainder of this book, if formulas do not make sense or appears, then the
assertions where these formulas occur should be considered vacuous.
Proposition 1.8 For every f : X ^ Y,
E(f ) = f (f (x)  fp(x))2 dPX + op.
Proof From the definition ofp(x) for each x e X, fY (fp(x)y) = 0. Therefore,
E(f ) = j (f (x)  fp(x) + fp(x)  y)2
=
(f (x)  fp(x))2 +
(fp(x)  y)2
X
XY
+2
(f (x) fp(x))(fp(x) y)
XY
=
(f (x)  fp(x))2 + op + 2
(f (x)  fp(x)) (fp(x)  y)
X
X
Y
= j (f (x)  fp(x))2 + op.
1
The first term on the righthand side of Proposition 1.8 provides an average (over X) of the
error suffered from the use of f as a model for fp. In addition, since op is independent of f,
Proposition 1.8 implies that fp has the smallest possible error among all functions f : X ^ Y.
Thus op represents a lower bound on the error E and it is due solely to our primary object, the
measure p. Thus, Proposition 1.8 supports the following statement:
The goal is to learn (i.e., to find a good approximation of)fp from random
samples on Z .
1
Throughout this book, the square denotes the end of a proof or the fact that no proof is
given.
We now consider sampling. Let
z e Zm, z = ((X1, yO,..., (xm, ym))
be a sample in Zm, that is, m examples independently drawn according to p. Here Zm
denotes the mfold Cartesian product of Z. We define the empirical error of f (w.r.t. z) to be
1
Ez(f) = (f (xt)  yt)1.
m
t=i
Ez (f)
1
m
J2f(zi).
i=1
If f is a random variable on Z, we denote the empirical mean of f (w.r.t. z) by Ez(f). Thus,
For any function f : X ^ Y we denote by fY the function
fY : X x Y ^ Y
(x, y) ^ f (x)  y.
With these notations we may write E (f) = E(fy) and Ez (f) = Ezf). We have
already remarked that the expected value of (fp)Y is 0; we now remark that its
2
variance is a 2.
Remark 1.9 Consider the PAC learning setting discussed in Case 1.5 where X = Rn and T is a
subset of Rn .2 The measure pX described there can be extended to a measure p on Z by
defining, for A c Z,
p(A) = px ({x e X  (x,fr (x)) e A}),
1.3
1 Note, in this case, that X is not compact. In fact, most of the results in this book do not
require compactness of X but only completeness and separability.
Hypothesis
Learning processes do not take place in a vacuum. Some structure needs to be present at the
beginning of the process. In our formal development, we assume that this structure takes the
form of a class of functions (e.g., a space of polynomials, of splines, etc.). The goal of the
learning process is thus to find the best approximation offp within this class.
Let C(X ) be the Banach space of continuous functions on X with the norm
If IU = sup If (x).
We consider a subset H of C(X)  in what follows called hypothesis space  where algorithms
will work to find, as well as is possible, the best approximation for fp. A main choice in this
book is a compact, infinitedimensional subset of C(X), but we will also consider closed balls
in finitedimensional subspaces of C(X) and whole linear spaces.
Iffp e H, simplifications will occur, but in general we will not even assume that fp e C (X)
and we will have to consider a target function fn in H. Define fH to be any function
minimizing the error E(f) over f e H, namely, any optimizer of
2
min(f (x)  y)2 dp.
f sH Z
Notice that since (f ) = fX (f  fp)2 + a2, fn is also an optimizer of
Let z e Zm be a sample. We define the empirical target function fn? = fz to be a
function minimizing the empirical error Ez (f) over f e H, that is, an optimizer of
(1.1)
Notethat althoughfz is notproducedby an algorithm, itis closeto algorithmic. The statement of
the minimization problem (1.1) depends on p only through its dependence on z, but once z is
given, sois(1.1), and its solutionfz can be looked for without further involvement of p. In
contrast to fn, fz is empirical from its dependence on the sample z.Note finallythat E (fz)
and Ez (f) are different objects.
We next prove that fn andfz exist under a mild condition on H.
Definition 1.10 Let f : X ^ Y and z e Zm. The defect off (w.r.t. z) is
Lz(f ) = Lp,z(f ) = (f )  Ez(f ).
Notice that the theoretical error E(f) cannot be measured directly, whereas z(f) can. A
bound on Lz(f) becomes useful since it allows one to bound the actual error from an observed
quantity. Such bounds are the object of Theorems 3.8 and 3.10.
Let f1,f2 e C(X). Toward the proof of the existence of fn and fz, we first estimate the
quantity
Lz (fj)  Lzf2)
linearly by [/j  f2[ for almost all z e Zm (a Lipschitz estimate). We recall that a set U Z is
said to be full measure when Z \ U has measure zero.
Proposition 1.11 If,forj = 1,2, fj (x)  y < M on a full measure set U Z then, for all z e
Um,
Lz(fj) Lz(/2)<4M [fj f2
2
2
(fj(x)  y)2  (f2(x)  y) = (fj(x) + f2(x)  2y)(fj(x)  f2(x)),
we have
(fj)  m =
(fi(x) + f2(x)  2y)(fj(x)  f2(x)) dp
Z
< fx)  y) + (f2(x)  y^llfj  f2\\ d p
Z
< 2M [fj  f2\\
Also, for all z e Um, we have
1.4
For a given hypothesis space H, the error in H of a function f e H is the normalized error
En(f) = E(f)  E(fy).
Note that Ey(f) > 0 for all f e H and that Ey(fy) = 0. Also note that E(fy) and Ey(f) are
different objects.
Continuing the discussion after Proposition 1.8, it follows from our definitions and
proposition that
2
2
fz  fp) dpX + a p = E(fz) = H(fz) + E(fH).
(1.2)
The quantities in (1.2) are the main characters in this book. We have already noted that a2 is a
lower bound on the error E that is solely due to the measure p. The generalization error E(fz)
of fz depends on p, H, the sample z, and the scheme (1.1) defining fz. The squared distance fX
(fz fp)2 dpX is the excess generalization error of fz. A goal of this book is to show that
under some hypotheses on p and H, this excess generalization error becomes arbitrarily small
with high probability as the sample size m tends to infinity.
Now consider the sum Ey(fz) + E(fy). The second term in this sum depends on the choice
of H but is independent of sampling. We will call it the approximation error. Note that this
approximation error is the sum
A(H)
+ a2
where A(H) = fX (fy fp)2 dpX. Therefore, a2 is a lower bound for the approximation error.
The first term, Ey(fz), is called the sample error or estimation error.
Equation (1.2) thus reduces our goal above  to estimate fX (fz fp)2 or, equivalently,
E(fz)  into two different problems corresponding to finding estimates for the sample and
approximation errors. The way these problems depend on the measure p calls for different
methods and assumptions in their analysis.
The second problem (to estimate A(H)) is independent of the sample z. But it depends
heavily on the regression function fp. The worse behaved fp is (e.g., the more it oscillates), the
more difficult it will be to approximate fp well with functions in H. Consequently, all bounds for
A(H) will depend on some parameter measuring the behavior offp.
The first problem (to estimate the sample error Ey(fz )) is posed on the space H, and its
dependence on p is through the sample z. In contrast with the approximation error, it is
essentially independent of fp. Consequently, bounds for Ey (fz) will not depend on properties
offp. However, due to their dependence on the random sample z, they will hold with only a
certain confidence. That is, the bound will depend on a parameter 8 and will hold with a
confidence of at least 1 8.
This discussion extends to some algorithmic issues. Although dependence on the behavior
of fp seems unavoidable in the estimates of the approximation error (and hence on the
generalization error E(fz) of fz), such a dependence is undesirable in the design of the
algorithmic procedures leading to fz (e.g., the selection of H). Ultimately, the goal is to be
able, given a sample z, to select a hypothesis space H and compute the resulting fz without
assumptions on fp and then to exhibit bounds on fX (fz  fp)2 dpX that are, with high
probability, reasonably good in the measure that fp is well behaved. Yet, in many situations,
the choice of some parameter related to the selection of H is performed with methods that,
although satisfactory in practice, lack a proper theoretical justification. For these methods, our
best theoretical results rely on information about fp.
1.5
For fixed H the sample error decreases when the number m of examples increases (as we see
in Theorem 3.14). Fix m instead. Then, typically, the approximation error will decrease when
enlarging H, but the sample error will increase. The biasvariance problem consists of
choosing the size of H when m is fixed so that the error E(fz) is minimized with high
probability. Roughly speaking, the bias of a solution f coincides with the approximation
error, and its variance with the sample error. This is common terminology:
A model which is too simple, or too inflexible, will have a large bias, while one which has too much flexibility in
relation to the particular data set will have a large variance. Bias and variance are complementary quantities,
and the best generalization [i.e. the smallest error] is obtained when we have the best compromise between
the conflicting requirements of small bias and small
variance.
Thus, a too small space H will yield a large bias, whereas one that is too large will yield a
large variance. Several parameters (radius of balls, dimension, etc.) determine the size of H,
and different instances of the biasvariance problem are obtained by fixing all of them except
one and minimizing the error over this nonfixed parameter.
Failing to find a good compromise between bias and variance leads to what is called
underfitting (large bias) or overfitting (large variance). As an example, consider Case 1.2
and the curve C in Figure 1.2(a) with the set of sample points and assume we want to
approximate that curve with a polynomial of degree d (the parameter d determines in our
case the dimension of H). If d is too small, say d = 2, we obtain a curve as in Figure 1.2(b)
which necessarily underfits the data points. If d is too large, we can tightly fit the data
points
3
[18] p. 332.
Figure 1.2
but this overfitting yields a curve as in Figure 1.2(c). In terms of the error decomposition
(1.2) this overfitting corresponds to a small approximation error but large sample error.
As another example of overfitting, consider the PAC learning situation in Case 1.5 with C
consisting of all subsets of Rn. Consider also a sample {(xi, 1),..., (xk, 1), (xk+i,0),..., (xm, 0)}.
The characteristic function of the set S = {x1,..., xk} has zero sample error, but its
approximation error is the measure (w.r.t. pX) of the set T A {x1,..., xk}, which equals the
measure of T as long as pX has no points with positive probability mass.
1.6
In Chapter 2 we describe some common choices for the hypothesis space H. One of them,
derived from the use of reproducing kernel Hilbert spaces (RKHSs), will be systematically used
in the remainder of the book.
The focus of Chapter 3 is on estimating the sample error. We want to estimate how close
one may expectfz and fn to be, depending on the size of the sample and with a given
confidence. Or, equivalently,
How many examples do we need to draw to assert, with a confidence greater than
1 S, that X (fz )2 d PX is not more than e?
Our main result in Chapter 3, Theorem 3.3, gives an answer.
Chapter 4 characterizes the measures p and some families {HR}R>0 of hypothesis spaces
for which A(HR) tends to zero with polynomial decay; that is, A(HR) = O(Re) for some 0 > 0.
These families of hypothesis spaces are defined using RKHSs. Consequently, the chapter
opens with several results on these spaces, including a proof of Mercers theorem.
The bounds for the sample error in Chapter 3 are in terms of a specific measure of the size
of the hypothesis space H, namely, its covering numbers. This measure is not explicit for all of
the common choices of H. In Chapter 5 we give bounds for these covering numbers for most
of the spaces H introduced in Chapter 2. These bounds are in terms of explicit geometric
parameters of H (e.g., dimension, diameter, smoothness, etc.).
In Chapter 6 we continue along the lines of Chapter 4. We first show some conditions
under which the approximation error can decay as O(R~e) only if fp is C. Then we show a
polylogarithmic decay in the approximation error of hypothesis spaces defined via RKHSs for
some common instances of these spaces.
Chapter 7 gives a solution to the biasvariance problem for a particular family of
hypothesis spaces (and under some assumptions on fp).
Chapter 8 describes a new setting, regularization, in which the hypothesis space is no
longer required to be compact and argues some equivalence with the setting described
above. In this new setting the computation of the empirical target function is algorithmically
very simple. The notion of excess generalization error has a natural version, and a bound for it
is exhibited.
A special case of learning is that in which Y is finite and, most particularly, when it has
two elements (cf. Case 1.5). Learning problems of this kind are called classification
problems as opposed to the ones with Y = R, which are called regression problems. For
classification problems it is possible to take advantage of the special structure of Y to devise
learning schemes that perform better than simply specializing the schemes used for
regression problems. One such scheme, known as the support vector machine, is
described, and its error analyzed, in Chapter 9. Chapter 10 gives a detailed analysis for
natural extensions of the support vector machine.
We have begun Chapters 310 with brief introductions. Our intention is that maybe after
reading Chapter 2, a reader can form an accurate idea of the contents of this book simply by
reading these introductions.
1.7
The setting described in Section 1.2 was first considered in learning theory by V. Vapnik and
his collaborators. An account of Vapniks work can be found in [134].
For the biasvariance problem in the context of learning theory see [18, 54] and the
references therein.
There is a vast literature in learning theory dealing with the sample error. A pair of
representative books for this topic are [6, 44].
Probably the first studies of the two terms in the error decomposition (1.2) were [96] and
[36].
In this book we will not go deeper into the details of PAC learning.A standard reference for
this is [67].
Other (but not all) books dealing with diverse mathematical aspects of learning theory are
[7, 29, 37, 57, 59, 61, 92, 95, 107, 111, 124, 125, 132, 133, 136, 137]. In addition, a number
of scientific journals publish papers on learning theory. Two devoted wholly to the theory as
developed in this book are Journal of Machine Learning Research and Machine
Learning.
Finally, we want to mention that the exposition and structure of this chapter
In this chapter we describe several examples of hypothesis spaces. One of these examples
(or, rather, a family of them)  a subset H of an RKHS  will be systematically used in the
remainder of this book.
2.1
2.2
Reminders I
addition, when v = pX we simply write I p instead of the more cumbersome I PX.
Note that elements in L (X) are classes of functions. In general, however, one abuses
language and refers to them as functions on X. For instance, we say that f e Ly (X) is
continuous when there exists a continuous function in the class off.
The support of a measure on X is the smallest closed subset X of X such that v(X \ Xv) =
0.
A function f : X ^ R is measurable when, for all a e R, the set {x e X  f (x) < a} is a
Borel subset of X.
The space L (X) is defined to be the set of all measurable functions on X such that
II f II L(X) := sup f (x) < .
xeXv
Each element in L (X) is a class of functions that are identical on Xv.
A measure v is finite, when v(X) < .Also, we say that v is nondegenerate when, for
each nonempty open subset U c X, v(U) > 0. Note that v is nondegenerate if and only if X =
X.
If v is finite and nondegenerate, then we have a welldefined injection C(X) ^ LP(X), for
all 1 < p <.
When v is the Lebesgue measure, we sometimes denote L^ (X) by Lp(X) or, if there is no
risk of confusion, simply by Lp.
II;
We next briefly recall some basics about the Fourier transform. The Fourier transform
F : L*(Rn) ^ L*(Rn) is defined by
F(f )(w) =( e~iw'xf (x) dx.
Js.n
The function F(f) is well defined and continuous on Rn (note, however, that F( f) is a complexvalued function). One major property of the Fourier transform in L1(Rn) is the convolution
property
F(f * g) = F(f )F(g),
where f * g denotes the convolution of f and g defined by
(f * g)(x) =
f (x  u)g(u) du.
JRn
The extension of the Fourier transform to L2(Rn) requires some caution. Let C0(Rn) denote
the space of continuous functions on Rn with compact support. Clearly, C0(Rn) c L 1(Rn) n
L2(Rn). In addition, C0(Rn) is dense in L2(Rn). Thus, for any f e L2(Rn), there exists [fak}k>1 c
C0(Rn) such that \\fak f  ^ 0 when k ^ra. One can prove that for any such sequence
(fak), the sequence (F(fak)) converges to the same element in L2(Rn). We denote this
element by F(f) and we say that it is the Fourier transform of f. The notation f instead of F(f)
is often used.
The following result summarizes the main properties of F: L2(Rn) ^ L 2(Rn).
Theorem 2.3 (Plancherels theorem) Forf e L2(Rn)
i;
F (f )(w) = lim I eiw'xf (x) dx, where the convergence is for the
k
^<x j[kk]n
norm in L2 (Rn).
ii; F(f)\\ = (2n)n/2\\f .
iii; f (x) = lim  elw'xF(f )(w) dw, where the convergence is
k^ra (2n)n J[k,k]n
for the norm in L2(Rn). If f e L1 (Rn) n L2(Rn), then the convergence holds almost
everywhere.
iv; The map F: L 2(Rn) ^ L 2(Rn) is an isomorphism of Hilbert spaces.
III; Our third reminder is about compactness.
It is well known that a subset of Rn is compact if and only if it is closed and bounded. This
is not true for subsets of C (X). Yet a characterization of compact subsets of C(X) in similar
terms is still possible.
A subset S of C(X) is said to be equicontinuous at x e X when for every e > 0 there
exists a neighborhood V of x such that for all y e V and f e S,
 f (x) f (y) < e. The set S is said to be equicontinuous when it is so at every x in X .
Theorem 2.4 (ArzelaAscoli theorem) LetX be compact and S be a subset of C (X). Then
S is a compact subset of C (X) if and only ifS is closed, bounded, and
equicontinuous.
The fact that every closed ball in Rn is compact is not true in Hilbert space. However, we
will use the fact that closed balls in a Hilbert space H are weakly compact. That is, every
sequence [fn}neN in a closed ball B in H has a weakly convergent subsequence [fnk }keN, or,
in other words, there is some f e B such that
lim (fnk, g> = f, g>, Vg e H.
IV;
kWe close this section with a discussion of completely monotonic functions. This
discussion is on a less general topic than the preceding contents of these reminders.
A function f : [0, TO) ^ R is completely monotonic if it is continuous on [0, TO), CTO on (0,
TO), and, for all r > 0 and k > 0, (1)kf(k\r) > 0.
We will use the following characterization of completely monotonic functions.
Proposition 2.5 A function f : [0, TO) ^ R is completely monotonic if and only if, for
all t e (0, TO),
p TO
f (t) =
eta d v(o),
J0
where v is a finite Borel measure on [0, TO).
Definition 2.6 Let J : E ^ F be a linear map between the Banach spaces E and F. We say that
J is bounded when there exists b e R such that for all x e E with x = 1,  J(x) < b. The
operator norm of J is
IIJII = sup IIJ(x) .
Ix = 1
If J is not bounded, then we write J  = TO. We say that J is compact when the closure J (B)
of J (B) is compact for any bounded set B c E.
Example 2.7 (Sobolev spaces) Let X be a domain in Rn with smooth boundary. For every s
e N we can define an inner product in CTO (X) by
f, >s = f E Dafag.
JX
N<s
Here we are integrating with respect to the Lebesgue measure p on X inherited from
Euclidean space. We will denote by  s the norm induced by ( , )s. Notice that when s = 0,
the inner product above coincides with that of L2(X). That is,  0 =  . We define the
Sobolev space Hs(X) to be the completion of CTO (X) with respect to the norm  s. The
Sobolev embedding theorem asserts that for all r e N and all s > n/2 + r, the inclusion
Js : Hs(X )^ Cr (X )
is well defined and bounded. In particular, for all s > n/2, the inclusion
Js: Hs(X )^ C(X)
is well defined and bounded. From Rellichs theorem it follows that if X is compact, this last
embedding is compact as well. Thus, if BR denotes the closed ball of radius R in Hs(X) we may
take HR,s = H = Js (BR).
2.4
Definition 2.8 Let X be a metric space. We say that K: X x X ^ R is symmetric when K (x, t)
= K (t, x) for all x, t e X and that it is positive semidefinite when for all finite sets x =
{xi,..., xk} c X the k x k matrix K[x] whose (i, j) entry is K(xi, xj) is positive semidefinite. We
say that K is a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The
matrix K[x] above is called the Gramian of K at x.
For the remainder of this section we fix a compact metric space X and a Mercer kernel
K:X x X ^ R. Note that the positive semidefiniteness implies that K (x, x) > 0 for each x e
X .We define
CK := sup VK (x, x).
xeX
Then
CK = sup
K(x, t)
x,teX
since, by the positive semidefiniteness of the matrix K[{x, t}], for all x, t e X, (K(x, t))2 <
K(x,x)K(t, t).
For x e X , we denote by Kx the function
Kx: X ^ R
t ^ K (x, t).
The main result of this section is given in the following theorem.
Theorem 2.9 There exists a unique Hilbert space (HK, (, )HK ) of functions onX
satisfying the following conditions:
i; for all x e X, Kx e HK,
ii; the span of the set {Kx  x e X} is dense in HK, and
iii; for all f e HK andx e X,f (x) = {Kx, f )HK.
Moreover, HK consists of continuous functions and the inclusion IK : HK ^ C(X) is
bounded with \\IK  < CK.
Proof. Let H0 be the span of the set {Kx  x e X}. We define an inner product in Ho as
s
r
(f, g) = J2 aPjKfe, tj), for f =
aKxi, g =
PjKtj.
1<i<s
1=1
j=1
1<j<r
The conditions for the inner product can be easily checked. For example, if (f, f) = 0, then
for each t e X the positive semidefiniteness of the Gramian of K at the subset {x,}s= 1 U {t}
tells us that for each e e R
ss
aKx, xj)aj + 2 ^ aiK(xt, t)e + e2K(t, t) > 0. ij=1
i=1
However^]sij=1 aK(x,, xj)aj = (f,f) = 0. By letting e be arbitrarily small, we see that f (t) =
^f s=1 aiK(xi, t) = 0. This is true for each t e X; hence f is the zero function.
Let HK be the completion of H0 with the associated norm. It is easy to check that HK
satisfies the three conditions in the statement. We need only prove that it is unique. So,
assume H is another Hilbert space of functions on X satisfying the conditions noted. We want
to show that
H
(2
= HK and (, )H = ^ )HK .
.1)
We first observe that H0 c H. Also, for any x, t e X, (Kx, KT)H = K(x, t) = (Kx, Kt)HK . By
linearity, for every f, g e H0, (f, g)H = (, g)nK. Since both H and HK are completions of H0,
(2.1) follows from the uniqueness of the completion.
To see the remaining assertion consider f e HK and x e X. Then
 f (x)  = \{KX ,f ) H K  < \\f \\ H K \\ KX\\ H K = \\f W H J K (x, x).This implies that f , < CKf
\\HK and, thus, /K < CK. Therefore, convergence in  \\HK implies convergence in  OT, and
this shows that f is continuous since f is the limit of elements in H0 that are continuous.
In what follows, to reduce the amount of notation, we will write (, )K instead of (, )HK and 
K instead of  \\Hk.
Definition 2.10 The Hilbert space HK in Theorem 2.9 is said to be an Reproducing Kernel
Hilbert Space (RKHS). Property (iii) in Theorem 2.9 is refered to as the reproducing
property.
2.5
In this section we discuss some families of Mercer kernels on subsets of Rn. In most cases,
checking the symmetry and continuity of a given kernel K will be straightforward. Checking
that K is positive semidefinite will be more involved.
The first family of Mercer kernels we look at is that of dot product kernels. Let X = {x e
Rn : x < R} be a ball of Rn with radius R > 0. A dot product kernel is a function K: X x X ^
R given by
K(x, y) = ^ ad (x y)d,
where ad > 0 and J] adR2d < .
Proposition 2.11 Dot product kernels are Mercer kernels on X.
Proof. The kernel K is obviously symmetric and continuous on X x X. To check its positive
semidefiniteness, recall that the multinomial coefficients associated with the pairs (d, a)
satisfy
(x y)d =
&axaya, Vx,y e Rn.
a=d
Let {x1,..., xk} c X. Then, for all c1,..., ck e R
,
2
ij
i j =
c c K(x , x )
J2 Y.
d
j=1
d=0 a=d
i=1
i,j=1
Therefore, K is a Mercer kernel.
c x<a
The above extends to the multivariate case under a slightly stronger assumption on X .
Example 2.13 Let X c Rn containing 0 and the coordinate vectors ej, j = 1,..., n. Let K be the
Mercer kernel on X given by K (x, y) = 1 + x y. Then HK is the space of linear functions and
{1, xi, x2,..., xn} forms an orthonormal basis of HK .
Proof. Note that K(v, x) = 1 + v x = 1 + v1x1 + + vnxn with v = (vi,..., vn) e X c Rn.
We argue as in Example 2.12. For each 1 < j < n,
K  KoIlK = K(ej, ej)  2K(ej,0) + K(0,0) = 1.
But (Kej K0)(x) = (1 + Xj) 1 = Xj, and therefore 1 = \\Kej  K0K = \\Xj K. One can prove,
similarly, that (1, xj )K = 0and <1,1)K = 1.Now consider i = j:
K, Kej )K = K (ei, ej) = 1 = <1 + xi ,1 + xj )K
= 1K + (1, xi )K + (1, xj )K + (xi, xj )K .
Since we have shown that (1, xi)K = (1, xj)K = 0 and  1 K = 1, it follows that (xi, xj )K = 0.
We have thus proved that {1, x1, x2,..., xn} is an orthonormal system of HK. But each function
Kv = 1 + v x with v e X is contained in span{1,x1,x2,...,xn}, which is a closed subspace of
HK. Therefore, span{1, x1, x2,..., xn} = HK and {1, x1, x2,..., xn} is an orthonormal basis of HK .
k(*)eixj * eixft
c}cl(2n)~n
jl=1
Y cjceK (xj, xe)
j,i=\
to get
(2n)n
cjeij
d* > 0 ,
where   means the module in C and z is the complex conjugate of z. Thus, K is a Mercer
kernel on any subset of Rn.
Example 2.15 (A spline kernel) Let k be the univariate function supported on [2,2] given by
k(x) = 1  x/2 for 2 < x < 2. Then the kernel K defined by K (x, y) = k (x  y) is a Mercer
kernel on any subset X of R.
Proof. One can easily check that 2k (x) equals the convolution of the characteristic function
x[1,1] with itself. But x[1 ,1 ](f) = 2sin*/*. Thus, k(*) = 2(sin */*) 2 > 0 and the Mercer property
follows from Proposition 2.14.
Remark 2.16 Note that the kernel K defined in Example 2.15 is given by
K (x, y)
1 ^ if x  y< 2 0
otherwise,
and therefore CK = 1.
Multivariate splines can also be used to construct translationinvariant kernels. Take B =
[b1 b2 ... bq] to be an n x q matrix (called the direction set) such that q > n and the n x n
submatrix B0 = [b1 b2 ... bn] is invertible. Define
M
BO
1
(x)
dt
sin (f bj/2) f bj
/2
MB (f) = f]
j=i
for j = 1,. . . , q n. One can check by induction that its Fourier transform satisfies
Example 2.17 (A box spline kernel) Let B = [b1 b2 ... bq] be an n x q matrix where [b1 b2
... bn]is invertible. Choose k(x) =
(MB *
MB)(x) to be the box
n
spline with direction set [B, B]. Then, for all % e R ,
sin (f bj/2) f bj
/2
d v(a)
Proof. By Proposition 2.5, there is a finite Borel measure v on [0, TO) for which
for all t e [0, TO). It follows that
f TO
2
K(x,y) = f(x yll2) = exdv(a).
o x
0 Now note that for each o e [0, TO), the Fourier transform of e 11 1, 2 equals
(s/nj)ne Nf1 j4 a.Hence,
allxN2 = {2n)n
( 1 ) 2 e1 ^eix'f df.
Rn V a /
eaiwt
n
,ixf
/
n t 2 /4o
Therefore, reasoning as in the proof of Proposition 2.14, we have, for all x = (xi,..., xm) e Xm,
2
Rn Va
p(
cecjK (xe, xj ) =
(2n)n
Jo
jRn V
ej=1
Corollary 2.19 Let c > 0. The following functions are Mercer kernels on any subset X c
M n:
i;
(Gaussian) K(x,t) = eNxt,l2 jc2 .
ii; (Inverse multiquadrics) K(x, t) = (c2 + x  tN2 ) with a > 0.
Proof. Clearly, both kernels are continuous and symmetric. In (i) K is positive semidefinite by
Proposition 2.18 with f (r) = erjc . The same is true for (ii) taking f (r) = (c2 + r)a.
Remark 2.20 The kernels of (i) and (ii) in Corollary 2.19 satisfy CK = 1 and CK = ca,
respectively.
A key example of a finitedimensional RKHS induced by a Mercer kernel follows. Unlike in
the case of the Mercer kernels of Corollary 2.19, we will not use Proposition 2.18 to show
positivity.
n+dn
Example 2.1 (continued) Recall that Hd = Hd (Rn+1 ) is the linear space of homogeneous
polynomials of degree d in x0 , x1 ,..., xn. Its dimension (the number of coefficients of a
polynomial f e Hd ) is
The number N is exponential in n and d. We notice, however, that in some situations one may
consider a linear space of polynomials with a given monomial structure; that is, only a
prespecified set of monomials may appear.
We can make Hd an inner product space by taking
(f,g)w = J2
va(Cda)1
a=d
for f, g e Hd, f = ^2 wa xa, g = Y. va xa .This inner product, which we call the Weyl inner
product, is natural and has an important invariance property. Let O(n +1 ) be the orthogonal
group in Rn+1 , that is, the group of (n +1 ) x (n +1 ) real matrices whose action on Rn+1
preserves the inner product on Rn+1 ,
a(x) a(y) = x y, for all x,y e Rn+1 and all a e O(n + 1 ).
The action of O(n + 1) on Rn+1 induces an action of O (n + 1) on Hd. For f e Hd and a e O(n
+ 1) we define a(f ) e Hd by af (x) = f (a1 (x)). The invariance property of (, }w, called
orthogonal invariance, is that for all f, g e Hd,
{o(f), o(g))w = {, g>w.
Note that if  f w denotes the norm induced by {, )w, then
fMi < If wUf,
where x is the standard norm of x e R+1. This follows from taking the action of o e O(n + 1 )
such that o(x) = (x, 0 ,
0 ).
2.6
2.7
Reminders II
, m,
(2 .2 )
hj (x) = 0 , j = 1 ,..., p,
where f, gi, hj: Rn ^ R. The function f is called the objective function, and the equalities and
inequalities on gi and hj are called the constraints. Points x e Rn satisfying the constraints
are feasible and the subset of Rn of all feasible points is the feasible set.
Although stating this problem in all its generality leads to some conceptual clarity, it
would seem that the search for an efficient algorithm to solve it is hopeless. A vast amount of
research has thus focused on particular cases and the emphasis has been on those cases for
which efficient algorithms exist. We do not develop here the complexity theory giving formal
substance to the notion of efficiency  we do not need such a development; instead we
content ourselves with understanding the notion of efficiency according to its intuitive
meaning: an efficient algorithm is one that computes its outcome in a reasonably short time
for reasonably long inputs. This property can be found in practice and studied in theory (via
several welldeveloped measures of complexity available to complexity theorists).
One example of a wellstudied case is linear programming. This is the case in which
both the objective function and the constraints are linear. It is also a case in which efficient
algorithms exist (and have been both used in practice and studied in theory). A much more
general case for which efficient algorithms exist is that of convex programming.
A subset S of a linear space H is said to be convex when, for all x, y e S and all X e [0,1],
Xx + (1  X)y e S.
A function f on a convex domain S is said to be convex if, for all X e[0,1] and all x, y e S,
f (Xx + (1  X)y) < Xf (x) + (1  X)f (y). If S is an interval on R, then, for x0, x e S, x0 < x, we
have
f (x0 + X(x  x0 )) = f (Xx + (1  X)x0 ) < Xf (x) + (1  X)f (XQ)
and
f (XQ + X(x  XQ))  f (XQ) ^ f (x)  f (XQ)
X(x  XQ)
x  x0
f+ (XO) :
lim
t^(x0)+
f (t)  f (XO) t  XO
This means that the function t ^ (f (t)  f (x0))/(t  x0) is increasing in the interval [x0, x].
Hence, the right derivative
f (XO) :
lim
t^(xv)
f (t)  f (XO) t  XO
exists. In the same way we see that the left derivative
exists. These two derivatives, in addition, satisfy f (x0 ) < f+ (x0) whenever x0 is a point in the
interior of S. Hence, both f(x0) and f+ (x0) are nondecreasing in S.
In addition to those listed above, convex functions satisfy other properties. We highlight
the fact that the addition of convex functions is convex and that if a function f is convex and
C2 then its Hessian D2f (x) at x is positive semidefinite for all x in its domain.
The convex programming problem is the problem of finding x e Rn to solve (2.2) with f
and gi convex functions and hj linear. As we have remarked, efficient algorithms for the
convex programming problem exist. In particular, when f and the gi are quadratic functions,
the corresponding programming problem, called the convex quadratic programming
problem, can be solved by even more efficient algorithms. In fact, convex quadratic
programs are a particular case of secondorder cone programs. And second order cone
programming today provides an example of the success of interior point methods: very large
amounts of input data can be efficiently dealt with, and commercial code is available. (For
references see Section 2.9).
A remarkable property of the hypothesis space H = IK(BR), where BR is the ball of radius R in
an RKHS HK, is the fact that the optimization problem of computing the empirical target
function fz reduces to a convex programming problem.
Let K be a Mercer kernel, and HK its associated RKHS. Let z e Zm. Denote by HK,z the finitedimensional subspace of HK spanned by {Kx1 , ..., Kxm} and let P be the orthogonal projection
P: HK ^ HK,z.
Proposition 2.25 Let B c HK. If e HK is a minimizer of Ez in B, then P(f) is a minimizer
of Ez in P(B), the image ofB under P.
Proof. For all f e B and all i = 1 , m , f, KXi)K = (P(f), KXi)K. Since both f and P(f) are in HK,
the reproducing property implies that
f X) = f, KXi )K = (P(f), KXi )K = (P(f ))X).
It follows that Ez(f) = Ez(P(f)). Taking f to be a minimizer of Ez in B proves the statement.
Corollary 2.26 Let B c HK be such that P(B) c B. If Ez can be minimized in B then such
a minimizer can be chosen in P(B).
Corollary 2.26 shows that in many situations  for example, when B is convex  the
empirical target function fz may be chosen in HK,z. Recall from Theorem 2.9 that the norm  K
restricted to HK,z is given by
Y^cKx,
i=t
2
m
Therefore, when B = BR, we may take fz =
solution of the following problem:
min s.t
tmm
m
j=t
m= 1 c*KXi, where c* e Rm is a
i, xj )  yj
c K (x
i=t
2.9
The special dot product kernel K (x, y) = (c + x y)d for some c > 0 and d e N was
introduced into the field of statistical learning theory by Vapnik (see, e.g., [134]). General dot
product kernels are described in [118]; see also [101] and [79]. Spline kernels are discussed
extensively in [137].
Chapter 14 of [19] is a reference for the unitary and orthogonal invariance of (, )W. A
reference for the nondegeneracy of the Veronese variety mentioned in Proposition 2.21 is
section 4.4 of [109].
A comprehensive introduction to convex optimization is the book [25]. For secondorder
cone programming see the articles [3, 119].
For more families of Mercer kernels in learning theory see [107]. More examples of
box splines can be found in [41]. Reducing the computation of fz from HK to HK,z is
ensured by representer theorems [137]. For a general form of these theorems see
[117].
The main result in this chapter provides bounds for the sample error of a compact and convex
hypothesis space. We have already noted that with m fixed, the sample error increases with
the size of H. The bounds we deduce in this chapter show this behavior with respect to a
particular measure for the size of H: its capacity as measured by covering numbers.
Definition 3.1 Let S be a metric space and n > 0. We define the covering number N(S, n) to
be the minimal t e N such that there exist t disks in S with radius n covering S. When S is
compact this number is finite.
Definition 3.2 Let M > 0 and p be a probability measure on Z. We say that a set H of functions
from X to R is Mbounded when
sup  f (x)  y\<M f eH
holds almost everywhere on Z.
me
300M2 J '
( ,T2M
N H
) I
3.1
m
!>*)
i=1
>
<
a
2
me2'
One particular use of Chebyshevs inequality is for sums of independent random variables. If f
is a random variable on a probability space Z with mean E(f) = ^ and variance a2 (f) = a2,
then, for all e > 0 ,
This inequality provides a simple form of the weak law of large numbers since it shows that
when m ^ro, 1J2m=1 f(zi) ^ I1 with probability 1 .
1
m
E^i)
i=i
(3.1)
For any 0 < 8 < 1 and by taking e = ^a2/(m8) in the inequality above it follows that with
confidence 1  8 ,
The goal of this section is to extend inequality (3.1) to show a faster rate of decay. Typical
bounds with confidence 1  8 will be of the form c(log(2/8)/m) 2 +0 with 0 < 0 < 2 depending
on the variance of f. The improvement in the error is seen both in its dependence on 8  from
(
0)
t
2/8 to log(2/8)  and in its dependence on mfrom m 2 to m 2 + . Note that {fi = f(zi)} m=1
are independent random variables with the same mean and variance.
Proposition 3.4 (Bennett) Let {i}m=1 be independent random variables on a
probability space Z with means {Mi } and variances {of}. Set 2 := Y, m=i of. If for
each i
 Mi I < M holds almost everywhere, then for every e > 0
,m
Probj [&  Mi]
i=i
> e i < exm e
M
\
( Me ,
2
11
1
+ Ml)
we have
Proof. Without loss of generality, we assume m = 0. Then the variance of fi is a2 = E(f2).
Let c be an arbitrary positive constant that will be determined later. Then
I := Prom
 Mi ] > ej = Probjexpj^ c^J > ece J.
By Markovs inequality and the independence of {i}, we have
(3.2)
I < e ceexpj^ cfi^ = e ce ^ E(eci' i=1
i=1
Since \fi  < M almost everywhere and E(f0 = 0, the Taylor expansion for ex yields
, c.,
ceE(f)
cnMe2o2
E ec') = 1 + E
< 1 + EUsing 1 +1 < et, it follows that
2
r+TO P*MP2
cnM n 2 o, 2
t
^=2
on,
e=2 '
I < exm ce +
ecM  1  cM ,
M2
'
e=2
and therefore
Now choose the constant c to be the minimizer of the bound on the righthand side above:
0 Me
c = log 1 +   T
M
2
That is, ecM  1 = Me/E2. With this choice,
Mi) O]
I
1
  M M
+M +Mi l  1
This proves the desired inequality.
Let g : [0, +ro) ^ R be given by
= (1 + X) log(l +
Prob
Eft. J exp{MMg()
(3.3)
g(X) :
X) X
(E2 + 3M
(Bernstein)
(Hoeffding)
Prob
{ is. JJ ] >] pProof. The first inequality follows from (3.3) and the inequality
(3.4)
g(x) > 2 log(i + X), VX > o.
To verify (3.4), define a C2 function f on [0, x>) by
f (X) := 2log(1 + X)  2X + X log(1 + X).
We can see that f (0) = 0, f '(0) = 0, and f "(X) = X(1 + X)2 > 0 for X > 0. Hence f (X) > 0
and
log(1 + X)  X >   X log( 1 + X ) ,
V X > 0.
It follows that
A
g(X) = X log(1 + A) + log(1 + A) X > ^ log(1 + A), VX > 0.
This verifies (3.4) and then the generalized Bennetts inequality.
g(X) >
3X2
6 + 2X '
Since g(X) > 0, we find that the function h defined on [0, ) by h(X) = (6 + 2X)g(X)
3X2 satisfies similar conditions: h(0) = h'(0) = 0, and h"(X) = (4/(1 + X))g(X) > 0. Hence
h(X) > 0 for X > 0 and
Applying this to (3.3), we get the proof of Bernsteins inequality.
To prove Hoeffdings inequality, we follow the proof of Proposition 3.4 and use (3.2). As the
exponential function is convex and M < i < M almost surely,
c
& ( cM) ecM I cM
ec&
ecM
e^<
e+
e
2cMci
2cM
cM
cM 1 !
1
1
E (e ) < e
+ e
<2+22
t
t
2}
1 ~ (cM ) 1 (cM ) ^ (cM )
=2
+ 2 W =
1=0
2
2
)j j
t!
t=0 2 )j
((cM )2/2)
(cM)2/2
=
2 I
1
<
ex
j=0
t=1
j=0
j=0
(cM )2 2
holds almost everywhere. It follows from E(fj) = 0 and the Taylor expansion for ex that
This, together with (3.2), implies that I < exp {ce + m(cM)2/2}. Choose c = e/(mM2 ). Then I
< exp e2/(2mM2 )}.
Bounds for the distance between empirical mean and expected value follow from
Proposition 3.5.Corollary 3.6 Let % be a random variable on a probability space Z
with mean E(%) = u and variance a2(%) = a2, and satisfying %(z)  E(%) < M for
almost all z e Z. Then for all e > 0,
(Generalized Bennett)
Prob
zsZm
1
m
< exp
i=1
ms
M8
log 1
2M
+
^
m
1
(Bernstein)
Prob
%(zi) u > e
zeZm m
i=1
[1(Hoeffding)
Prob
%(zi) u > e
zeZm m
i=1
Proof. Apply Proposition 3.5 to the random variables {%i = %(zi)/m} that satisfy % E(%i)
< M/m, a2(%i) = a2 /m2, and ^ a2 = a2 /m.
Remark 3.7 Each estimate given in Corollary 3.6 is said to be a oneside probability inequality.
The same bound holds true when >e is replaced by < e. By taking the union of these
two events we obtain a twoside probability inequality stating that ProbzeZm j
%(zi) u\
> ej is
bounded by twice the bound occurring in the corresponding oneside inequality.
Recall the definition of the defect function Lz(f) = E(f) Ez(f). Our first main result,
Theorem 3.8, states a bound for Prob{Lz(f) > e} for a single function f : X ^ Y. This bound
follows from Hoeffdings bound in Corollary 3.6 by taking % = f = (f (x) y)2 satisfying
% < M2 when f is Mbounded.
Theorem 3.8 Let M > 0 andf: X ^ Y be Mbounded. Then, for all s > 0,
2
1 me ]
Pg!>{Lz(f) iel>1 exp 2 M4
< exp
< exp
m8
2 (a 2 + 3 M s)
m8
'7M2
Remark 3.9
(i) Note that the confidence (i.e., the righthand side in the inequality above) is positive and
approaches 1 exponentially quickly with m.
ii;
Mp = inf {M > 0  {(x,y) e Z : y  fp(x)\ > M} has measure zero} . Then takeM = P + Mp,
where P >
PX
iii;
xeXPX
It foilows from Theorem 3.8 that for any 0 < 8 < 1, Ez(f)  E(f) < M^/2 log(1 /8 )/m with
confidence 1  8 .
3.2
The second main result in this chapter extends Theorem 3.8 to families of functions.
Theorem 3.10 Let H be a compact Mbounded subset of C(X). Then, for all
Prob
zsZm
sup Lz(f) <
f sH
H,
8M
> 0,
Notice the resemblance to Theorem 3.8. The only essential difference is in the covering
number, which takes into account the extension from a single f to the family H. This has the
effect of requiring the sample size m to increase accordingly to achieve the confidence level
of Theorem 3.8.
Lemma 3.11 Let H = Sx U ... U Si and > 0. Then
Prob
zsZm
sup Lz(f) > < Prob supLz(f) >
f sH
=IsZm f sSj
Proof. The proof follows from the equivalence
sup Lz(f) >
3j < l s.t. sup Lz(f) >
f eH
f eSj
and the fact that the probability of a union of events is bounded by the sum of the
probabilities of those events.
Proof of Theorem 3.10 Let l = N (H, 4^) and considerfx,..., fl such that the disks Dj centered at
fj and with radius 4M cover H. Let U be a full measure set on which supf e H f (x)  y\ < M .By
Proposition 1.11, for all z e Um and all f e Dj,
\Lz(f )  Lz(fj)\ < 4M lf  fj U < 4M 4 M = . Since this holds for all z e Um and all f e Dj,
we get
sup Lz ( f ) > 2 ^ Lz ( fj ) > .
f eDj
We conclude that for j = 1,..., l,
Remark 3.12 Hoeffdings inequality can be seen as a quantitative instance of the law of
large numbers. An abstract uniform version of this law can be extracted from the proof of
Theorem 3.10.
Proposition 3.13 Let F be a family of functions from a probability space Z to R and d
a metric on F. Let U c Zbe of full measure and B, L > 0 such that
i; (z)  < B for all f e F andallz e U, and
ii; Lz(fi) Lz(f2) < Ld(fi, f2) for all fi,f2 e F andall z e Um, where
1
Hz)
Hzt )
Zm=
Lz(f) =
Then, for all e > 0,
m
1
N F,  2exp 2
L
Prob sup  z(f)<e
zeZm
feF
3.3
How good an approximation offn can we expectfz to be? In other words, how small can we
expect the sample error EH(fz) to be? The third main result in this chapter, Theorem 3.14,
gives the first answer.
Prob [n(fz) < e} > 1
16M,
e
+1
exp
me2
32M4 '
Theorem 3.14 Let H be a compact Mbounded subset of C(X). Then, for all e > 0,
Proof. Recall that H(fz) < E(fz)  z(fz) + z(fn)  (fn).
By Theorem 3.8 applied to the single function fn and , we know that
z (fn)  (fn) < 2 with probability at least 1  exp J  8M7 } On the other hand, Theorem 3.10
with e replaced by tells us that with
probability at least 1  M (H, 16M) exp  32M4 ,
e
sup Lz(f) = sup [(f)  z(f)} < fen
1
fen 2
( ,T6
) p  5JM4 j.
1  exp 2
4
me '8 M
holds, which implies in particular that (fz)  z(fz) < . Combining these two bounds, we
know that n(fz) < e with probability at least
me
32M 4
This is the desired bound.
Remark 3.15 Theorem 3.14 helps us deal with the question posed in Section 1.3. Given e, 8 >
0, to ensure that
Prob [n(fz) < e} > 1  8 ,
32M 4
N H
m>
e
ta 1
+ N(H, T6M))+ta(
(3.5)
it is sufficient that the number m of examples satisfies
To prove this, take 8 = {N (H, i6 M) + 1} exp J  3 ^4 } and solve for m. Note, furthermore, that
(3.5) gives a relation between the three basic variables e, 8, and m.
3.4
The dependency on e in Theorem 3.14 is quadratic. Our next goal is to show that when the
hypothesis space H is convex, this dependency is linear. This is Theorem 3.3. Its Corollary
3.17 estimates directly \\fz  fu\\P as well.
Toward the proof of Theorem 3.3, we show an additional property of convex hypothesis
spaces. From the discussion in Section 1.3 it follows that for a convex H, there exists a
function fa in H whose distance in L2 to fp is minimal. We next prove that if H is convex and
pX is nondegenerate, then fa is unique.
Lemma 3.16 Let H be a convex subset of C (X) such that fa exists. Then fa is unique
as an element in L2 and, for allf e H,
(fa f )2 < n(f).
X
In particular, if pX is not degenerate, then fa is unique in H.
Proof. Let s = fnf be the segment of line with extremities fa and f: Since H is convex, s c H.
And, since fa minimizes the distance in L2 to
fp over H, we have that for all g e s, fn  fp llp < g  fp p. This means that for each t e[0 ,1 ],
IlH  fp llp < Ilf + (1 t )fH  fp llp
= WH  fp llp + 2 t{ f  fa, fa  fp )p + 2 Ilf  fn llp} .
By taking t to be small enough, we see that f  fa, fa  fp) > 0. That is, the angle fpfaf is
obtuse, which implies (note that the squares are crucial)
lfn  f llp < Ilf  fp llp WH  fp llp;that is,
(fH  f )2 < E(f)  E(fH) = EH(f ).
X
This proves the desired inequality. The uniqueness offH follows by considering the line
segment joining two minimizers f^ and fJ^. Reasoning as above, one can show that both
angles fpfand fpfUfn are obtuse. This is possible only
if
fH= fH.
Corollary 3.17 With the hypotheses of Theorem 3.3, for all e > 0,
a*(fffH)) <e
>1
^(",i2 ) =xp
I ^I
BOO
a2ms 2c + 3 B
Lemma 3.18 Suppose a random variable % onZ satisfies E(%) = M > 0, and \%  M\< B
almost everywhere. If E(%2 ) < cE(%), then, for every e > 0 and 0 < a < 1,
holds.
Prob
zsZm
/
mu tz)
AJ + B
< exp
a2m( + B)B
2 (a2(t) + 3Ba/ + B/B)
Proof. Since % satisfies \%  M\ < B, the oneside Bernstein inequality in Corollary 3.6 implies
that
2 / /
a (t) + Ba + B B < c + B( + B) <
3
3
c+
B
3
( + B).
Here a2(t) < E(t2) < cE(t) = c. Then we find that
This yields the desired inequality.
We next give a ratio probability inequality involving a set of functions.
Prob
zsZm
E(g)  Ez(g) sup geQ E(g) + s
> 4a^fe
a ms 2c + 3 B
Lemma 3.19 Let G be a set of functions on Z and c > 0 such that for each g e G, E(g)
> 0, E(g2) < cE(g), and g  E(g) < B almost everywhere. Then, for every e > 0 and 0 <
a < 1, we have
Proof. Let {gj}J= c G with J = N (G, ae) be such that G is covered by balls in C(Z) centered
on gj with radius ae.
E(gj)  Ez (gj ^
,
Prob
> as < exp zszm
E(gj) + s
a2 ms
2c + 3 B
Applying Lemma 3.18 to f = gj for each j, we have
Ez (g)  Ez (gj )\ r A
E(g)  E(gj )
VE(g) + e
The latter implies that
< a*Js and
VE(g) + s
< a*/s.
For each g e G, there is some j such that g  gj\\C (Z) < ae. Then Ez (g)  Ez(gj) and E(g) E(gj) are both bounded by ae. Hence
E(g )
j + e = E(gj) E(g) + E(g) + e < aVeVE(g) + e + (E(g) + e)
E(g)
<
+e +(E(g) +e) < 2(E(g) +e).
It follows that ^E(gj) + e < 2^jE(g) + e. We have thus seen that (E(g)  Ez (g))/ E(g) + e >
4a Je implies (E(gj)  Ez (gj))/y/E(g) + e > 2a Je and hence (E(gj)  Ez (gj))//E(gjT+ e >
a^/e. Therefore,
E(g)  Ez (g) Prob sup  z^zm geg VE(g) + s
> 4aV^l < V Probi E(gj}  Ez(gj} > aVsl, U z^zm E(gj) + s
which is bounded by J exp Ja2me/(2c + 2B) J. We are in a position to prove Theorem 3.3.
Proof of Theorem 3.3 Consider the function set
G = {(f (x)  y)2  (fn(x)  y)2 : f e H} .
Each function g in G satisfies E(g) = H4) > 0. Since H is Mbounded, we have M2 < g(z) <
M2 almost everywhere. It follows that g E(g)  < B := 2M2 almost everywhere. Observe that
g(z) = (f (x)  fn(x)) [(f (x)  y) + (fn(x)  y)], z = (x,y) e Z.
a2 me
,
8 M 2 + 2M 2
1  N (G, ae) exp
It follows that g(z) < 2Mf (x)  fH(x)\ and E(g2) < 4M2 fX(f  fH)2. Taken together with Lemma
3.16, this implies E(g2) < 4M 2u(f ) = cE(g) with c = 4M2. Thus, all the conditions in Lemma
3.19 hold true and we can draw the following conclusion from the identity Ez(g) = %z(f )
for every e > 0 and 0 < a < 1 , with probability at least
< 4a^fe
H(J)  n,z(f ) sup
f eH
y/n(f ) + e
holds, and, therefore, for all f e H, %(f) < H,Z(f) + 4a*Js^/%(f) + e. Take a = V2/8 and f
= fz. Since H,Z (fz) < 0 by definition offz, we have
(f )
H z < Ve/2VH(fz) + e.
Solving the quadratic equation about y/sHf), we have H( fz) < e.
Finally, by the inequality gi g2\\c(Z) < \\(fi(x) f2(x)) [(fi(x) y)+ (f2(x) y)l
\\C(Z) < 2M \\fi f2\\c(X), it follows that
N (G, ae) < N(H,2M) .
The desired inequality now follows by taking a = 42/8.
Remark 3.20 Note that to obtain Theorem 3.3, convexity was only used in the proof of
Lemma 3.16. But the inequality proved in this lemma may hold true in other situations as
well. A case that stands out is when fp e H. In this case fH = fP and the inequality in Lemma
3.16 is trivial.
3.5
The exposition in this chapter largely follows [39]. The derivation of Theorem 3.3 deviates
from that paper  hence the slightly different constants.
The probability inequalities given in Section 3.1 are standard in the literature on the law of
As in our development, given a hypothesis class H, these errors allow one to define a target
function f' and empirical target function fZ and to derive a decomposition bounding the
excess generalization error
E '(fZ)  Ef(fH) < {E f(fzf)  Et(fzf)} + {Et(f%)  E *(f%)).
(3.6)
The second term on the righthand side of (3.6) converges to zero, with high probability
when m ^<x>, and its convergence rate can be estimated by standard probability
inequalities.
The first term on the righthand side of (3.6) is more involved. If one writes fz(z) = f(yfZ
(x)),then E'(fzf) E'Z(fZ) = jz &(z)dp  m 5=1 fz(zi). But fz is not a single random variable; it
depends on the sample z. Therefore, the usual law of large numbers does not guarantee the
convergence of this first term. One major goal of classical statistical learning theory [134] is
to estimate this error term (i.e., E' (fZ)  E' (fZ')). The collection of ideas and techniques used
to get such estimates, known as the theory of uniform convergence, plays the role of a
uniform law of large numbers. To see why, consider the quantity
sup \EZ(f)  E'(f ),
(3.7)
f eH
which bounds the first term on the righthand side of (3.6), hence providing (together with
bounds for the second term) an estimate for the sample error E ' (fZ' )E ' (f'). The theory of
uniform convergence studies the convergence of this quantity. It characterizes those function
sets H such that the quantity (3.7) tends to zero in probability as m ^<x>.
Definition 3.23 We say that a set H of realvalued functions on a metric space X is uniform
GlivenkoCanteUi (UGC) if for every t > 0,
i1mf
lim supProb sup sup
f X) f (x) dp
m
t^+<x p
=
X
m>if en
where the supremum is taken with respect to all Borel probability distributions p on X, and
Prob denotes the probability with respect to the samples xi, x2 ,... independently drawn
according to such a distribution p.
The UGC property can be characterized by the VY dimensions of H, as has been done in
[5].
Definition 3.24 Let H be a set of functions from X to [0,1] and Y > 0. We say that A c X is VY
shattered by H if there is a number a e R with the following property: for every subset E of A
there exists some function fE e H such that fE (x) < a y for every x e A \ E, and fE (x) > a
+ y for every x e E. The VY dimension of H, VY (H), is the maximal cardinality of a set A c X
that is VY shattered by H.
The concept of VY dimension is related to many other quantities involving capacity of
function sets studied in approximation theory or functional analysis: covering numbers,
entropy numbers, VC dimensions, packing numbers, metric entropy, and others.
The following characterization of the UGC property is given in [5].
Theorem 3.25 Let H be a set of functions from X to [0,1], Then H is UGC if and only if
the VY dimension of H is finite for every Y > 0,
Theorem 3.25 may be used to verify the convergence of ERM schemes when the
hypothesis space H is a noncompact UGC set such as the union of unit balls of reproducing
kernel Hilbert spaces associated with a set of Mercer kernels. In particular, for the Gaussian
kernels with flexible variances, the UGC property holds [150].
Many fundamental problems about the UGC property remain to be solved. As an example,
consider the empirical covering numbers.
Definition 3.26 For x = (xi)f=1 e Xm and H c C(X), the empirical covering numberNx,
(H,x, n) is the covering number of Hx := {(f (xi))m= 1 : f e H} as a subset of Rm with the
following metric. For f, g e C(X) we take dx(f, g) = maxj<m [f (xt) g(xi). The metric
entropy of H is defined as
Hm(H, n) = sup logN<x>(H,x, n), m e N, n> 0.
xeXm
It is known [46] that a set H of functions from X to [0,1] is UGC if and only if, for every n
> 0, limm^OT Hm(H, n)/m = 0. In this case, one has Hm(H, n) = O(log2 m) for every n > 0. It is
conjectured in [5] that Hm(H, n) = O(log m) is true for every n > 0. A weak form is, Is it true
that for some a e [1,2), every UGC set H satisfies
We continue to assume that X is a compact metric space (which may be a compact subset of
Rn). Let K be a Mercer kernel on X and HK be its induced RKHS. We observed in Section 2.6
that the space HK,R = IK(BR) may be considered as a hypothesis space. Here IK denotes the
inclusion IK : HK ^ C(X). When R increases the quantity
A(fp,R): = inf E(f)  E(fp) = inf f  fp L f SHKR
If IIK <RLPX
(which coincides with the approximation error modulo a^) decreases. The main result in this
chapter characterizes the measures p and kernels K for which this decay is polynomial, that
is, A(fp,R) = O(R9) with 0 > 0.
Theorem 4.1 Suppose p is a Borelprobability measure o"Z. LetK beaMercer kernel o"
X a"d LK : L2X ^ L2X be the operator give" by
LKf (x) =
K(x, t)f (t)dpX (t), x e X.
X
Let 0 > 0. Ifp e Range(L0f/(4+20)), that is, fp = L0J(4+20)(g) for some g e L2, the" A(fp,R) <
22+0 11g12+20 R0. Conversely, if pX is "o"dege"erate
and A(fp, R) < CR0 for some constants C and 0, thenfp lies in the range of L0f/i'4+20)ffor all e > 0.
Although Theorem 4.1 may be applied to spline kernels (see Section 4.6), we show in
Theorem 6.2 that for C kernels (e.g., the Gaussian kernel) and under some conditions on pX,
the approximation error decay cannot reach the order A(fp, R) = O(R0) unless fp is C itself.
Instead, also in Chapter 6 , we derive logarithmic orders like A(fp, R) = O((log R)0) for analytic
kernels and Sobolev smooth regression functions.
Reminders III
4.1
We defined compactness of an operator in Section 2.3. We next recall some other basic
properties of linear operators and a main result for operators satisfying them.
Definition 4.3 A linear operator L: H ^ H on a Hilbert space H is said to be selfadjoint if, for
all f, g e H, (Lf, g) = (f, Lg). It is said to be positive (respectively strictly positive) if it is
selfadjoint and, for all nontrivial f e H, (Lf ,f) > 0 (respectively (Lf ,f) > 0).
Theorem 4.4 (Spectral theorem) Let L be a compact selfadjoint linear operator on a
Hilbert space H. Then there exists in H an orthonormal basis {p]_, p 2,...} consisting
of eigenvectors of L. If Xk is the eigenvalue corresponding to pk, then either the set
{Xk} is finite or Xk ^ 0 when k ^<x>. In addition, maxk> 1 A.k  = L. If, in addition, L
is positive, then Xk > 0 for all k > 1, and ifL is strictly positive, then X k > 0 for all k >
1.
We close this section by defining the power of a selfadjoint, positive, compact, linear
operator. If L is such an operator and 0 > 0, then Le is the operator defined by
(y ,
4.2
kpk^
'ck^k<pk.
In the remainder of this chapter we consider v, a finite Borel measure on X, and L2(X), the
Hilbert space of square integrable functions on X. Note that v can be any Borel measure.
Significant particular cases are the Lebesgue measure p and the marginal measure pX of
Chapter 1.
Let K: X x X ^ R be a continuous function. Then the linear map
LK : L2 (X) ^ C(X)
given by the following integral transform
(LKf )(x) =
K(x, t)f (t) dv(t), x e X,
is well defined. Composition with the inclusion C(X) ^ Lv 2 (X) yields a linear operator LK :
L^(X) ^ L^(X), which, abusing notation, we also denote by LK .
The function K is said to be the kernel of LK, and several properties of LK follow from
properties of K. Recall the definitions of CK and Kx introduced in Section 2.4.
Proposition 4.5 IfK is continuous, then LK : L^(X) ^ C(X) is well defined and
compact. In addition, \\LK  < Vv(X)CK. Here v(X) denotes the measure ofX.
Proof. To see that LK is well defined, we need to show that LKf is continuous for every f e
L^(X). To do so, we consider f e L^(X) and xi,x2 e X. Then
Two more important properties of LK follow from properties of K. Recall that we say K is
positive semidefinite if, for all finite sets (x1 , . . . , xk} c X, the k x k matrix K [x] whose (i, j)
entry is K (xt, Xj) is positive semidefinite.
Proposition 4.6
i; IfK is symmetric, thenLK : L2(X) ^ L2(X) is selfadjoint.
ii; If, in addition, K is positive semidefinite, then LK is positive.
K(x, t)f (x)f (t) d v (x) d v (t)
Xx
lim
(v(X ))'
k2
k
J^K (xi, xj )f (xi )f (xj) ij=1
lim
k^(X>
(v(X)) k 2
2
fxT K [x]fx,
Proof. Part (i) follows easily from Fubinis theorem and the symmetry of K. For Part (ii), just
note that
where, for all k > 1, xi, . . . , xk e X is a set of points conveniently chosen and fx = (f (X1 ), . . . , f
(xk))T. Since K[x] is positive semidefinite the result follows.
In what follows we fix a Mercer kernel and let {fak e L2(X)} be an orthonormal
basis of L2(X) consisting of eigenfunctions of LK. We call the fak orthonormal
eigenfunctions. Denote by Xk, k > 1, the eigenvalue of LK corresponding to fak. If Xk
> 0, the function fak is continuous. In addition, it lies in the RKHS HK. This is so
since
1
1 j
fak
(x) =
LK(fak)(x) =
K(x, t)fak(t) dv(t),
X
A
k
k
fa
l k IIK
1
X
K
\\KK lfak(t)l dv(t)
and, thus, fak can be approximated by elements in the span of {Kx  x e X}.
<
Remark 4.9 When v is nondegenerate, one can easily see from the definition of the
integral operator that LK has no eigenvalue 0 if and only if HK is dense in L2(X).
In fact, the orthonormal system above forms an orthonormal basis of HK when pX
is nondegenerate. This will be proved in Section 4.4. Toward this end, we next
prove Mercers theorem.
4.3
Mercers theorem
If f e L2(X) and {<1, f;, } is an orthonormal basis of L;(X), f can be uniquely written as f
= Y.k>1 akfk, and, when the basis has infinitely many functions, the partial sums
1 akfk
converge to f in L;(X). If this
convergence also holds in C(X), we say that the series converges uniformly to f. Also, we
say that a series
ak converges absolutely if the series ak  is
convergent.
When LK has only finitely many positive eigenvalues {Xk m 1 , K(x, t) =
rm=i h & (x)fa (t).
Theorem 4.10 Let vbea Borel, nondegenerate measure onX, andK: X xX ^ R a Mercer
kernel. Let be the kth positive eigenvalue of LK, and fa the corresponding
continuous orthonormal eigenfunction. For all x, t e X,
K (x, t) = ^ Xk fa (x)fa (t),
k>1
where the convergence is absolute (for each x, t e X x X) and uniform (onX x X).
Proof. By Theorem 4.8, the sequence {fafa fa}k>1 is an orthonormal system of HK . Let x e X.
The Fourier coefficients of the function Kx e HK with respect to this system are
f/Xkfk, Kx )K s/Xk fk
where Theorem 2.9(iii) is used. Then, by Parsevals theorem, we know that
J2l^kkfa(x); = J2XkIfk(x); < \\KxIlK = K(x,x) < CK.
k>1
k>1
Hence the series Xk fk (x) ; converges. This is true for each point x e X.
m+i
fak (x)fak (t)
k=m
\ 1/2 /m+i
\ 1/2
h \tk (t)\2
h \4>k (x) 2
k=m
k =m
/m+i
\ 1/2
< CK
h\fak(x)\2 ,
m+i
<
Now we fix a point x e X. When the basis {fak}k>i has infinitely many functions, the
estimate above, together with the CauchySchwarz inequality, tells us that for each t e X,
\km
(Kx ,f >L?(X )
X
K(x, y)f (y) dv 0.
which tends to zero uniformly (for t e X). Hence the series J]k>i Xk$k (x)$k (t) (as a function of
t) converges absolutely and uniformly on X to a continuous function gx. On the other hand, as
a function in L2(X), Kx can be expanded by means of an orthonormal basis consisting of [$k}
and an orthonormal basis ^ of the nullspace of LK. For f e ^,
Hence the expansion of Kx is
K
x 'y ](Kx, fak)L2(X)fak y 'LK(fak)(x)fak ^ ^k<fik(x)fak.
k>1
k>1
k>1
Thus, as functions in L2(X), Kx = gx. Since v is nondegenerate, Kx and gx are equal on a
dense subset of X. But both are continuous functions and, therefore, must be equal for each t
e X .It follows that for any x, t e X, the series J]k>1 Xk$k (x)$k (t) converges to K(x, t). Since
the limit function K(x, t) is continuous, we know that the series must converge uniformly on
X x X.
fak (x)2 dv
X
4.4
RKHSs revisited
In this section we show that the RKHS HK has an orthonormal basis {VXX} derived from the
integral operator LK (and thus dependent on the measure v).
Theorem 4.12 Let v be a Borel, nondegenerate measure onX, andK: X xX ^ R a Mercer
kernel. Let Xk be the kth positive eigenvalue of LK, and /k the corresponding
continuous orthonormal eigenfunction. Then {VX/k : Xk > 0} is an orthonormal basis
of HK.
Proof. By Theorem 4.8, {^/X/k : Xk > 0} is an orthonormal system in HK. To prove the
completeness we need only show that for each x e X, Kx lies in the closed span of this
orthonormal system. Complete this system to form an orthonormal basis {VX/k : Xk > 0}U {fj:
j > 0} of HK. By Parsevals theorem,
llKx IlK =
(Kx^h$k^K + K, fj)2K.
kj
Therefore, to show that Kx lies in the closed span of {VX/k : Xk > 0}, it is enough to
require that
llKx IlK = E(K , VX&k )K,
k
that is, that
K(x, x) = ^2 Xk (Kx, /k)K = ^2 Xk/k (x)2.
kk
Theorem 4.10 with x = t yields this identity for each x e X.
Since the RKHS HK is independent of the measure v, it follows that when v is nondegenerate and dim HK = <x>, LK
has infinitely many positive eigenvalues Xk, k > 1, and
HK =
f = ^ a^Xk/k : {ak}^=i e t2
k=1
When dim HK = m < TO, LK has only m positive (repeated) eigenvalues. In this case,
In this section we prove Theorem 4.1. It actually follows from a more general characterization
of the decay of the approximation error given using interpolation spaces.
Definition 4.15 Let (B,  ) and (H,  \\H) be Banach spaces and assume H is a subspace
of B. The Kfunctional K: B x (0, TO) ^ R of the pair (B, H) is defined, for a e B and t > 0, by
K(a,t) := mf{a b\\+t\\b\\H}.
(4.2)
It can easily be seen that for fixed a e B, the function K(a, t) of t is continuous,
nondecreasing, and bounded by a (take b = 0 in (4.2)). When H is dense in B, K(a, t) tends
to zero as t ^ 0. The interpolation spaces for the pair (B, H) are defined in terms of the
convergence rate of this function.
For 0 < r < 1, the interpolation space (B, H)r consists of all the elements a e B such
that the norm
ar := sup {K(a, t)/tr\
t>0
is finite.
Theorem 4.16 Let (B,  ) be a Banach space, and (H,  H) a subspace, such that b
< C0bH for allb e H and a constant C0 > 0. Let 0 < r < 1. If a e (B, H)r, then, for all
R > 0,
^(a,R) := inf {a b2} < a2/(1'r)R2r/(1r).
mH<R
Conversely, if ^.(a, R) < CR2r/(1r) for all R > 0, then a e (B, H)r and Mr < 2C(1r)/2.
Proof. Consider the function f (t) := K(a, t)/t. It is continuous on (0, +ro). SinceK(a,t) < 
a,inft> 0 f (t)} = 0.
f (tR,e)
(1  e)R.
K(a, tRe)
tR,e
Fix R > 0. If supt> 0 f (t)} > R, then, for any 0 < e < 1, there exists some tR,e e (0, +ro)
such that
By the definition of the Kfunctional, we can find be e H such that
a  be II +tR,e llbe H < K(a, tR,f)/(1  e).
K(a, tR,e)
(1  e)tR,e
IIbe IIH <
It follows that
K(a, tR,e)
1
1
___1
1e'
a  be I <
and
But the definition of the norm  a r implies that
< aNr.
K(a, tR,e)
R,e
Therefore
a  be  " K(a,tR,e) r/(1r) K(a, tR,e)
"
<
. (1 e)tR,e _
/ i \1/(1r)
< Rr/drW
(Haiir)Wr).
Thus,
^(a,R) < inf a  be 2} < a?/(1r)R2r/(1r);
0<e<1 l
J
\\H <
K(a, t) (1  e)t
<
1
supf (u)} < R
1e
u>0
and
a  bt,e
^ K(a, t) < 1  e '
A(a, R) < inf
t>0
a  bte 2 [ < (inf {K(a, t)} /(1
t> 0
< inf {\Ia\Irf} /(1
t> 0
e)
2
= 0.
Hence
This again proves the desired error estimate. Hence the first statement of the theorem holds.
Conversely, suppose that A(a, R) < CR2r/(1r) for all R > 0. Let t > 0. Choose Rt = (VC/t)1r.
Then, for any e > 0, we can find bte e H such that
\\bt,e IIH < Rt and a  bM 2 < CR2r/(1r)(1 + e)2.
It follows that
K(a, t) < a  bt,e ll+tbt,e IIH < VCRr/(Xr)(1 + e)
+ tRt < 2(1 + e)C(1r)/2tr.
Since e can be arbitrarily small, we have
K(a, t) < 2C(1r)/2tr.
Thus, 11a r = supt>0 {K(a, t)/tr}< 2C(1 r)/2 < TO.
The proof shows that if a e H, then A(a,R) = 0 for R > aH. In addtion, A(a, R) = O(R2r/
(1r)
) if and only if a e (B, H)r. A special case of Theorem 4.16 characterizes the decay of A for
RKHSs.
Corollary 4.17 Suppose p is a Borelprobability measure onZ. Let 0 > 0.Then A(fp, R) =
O(R0) if and only if fp e (L2X, H+)0/(2+0), where H+ is the closed subspace of HK
spanned by the orthonormal system {VVfk : Ak > 0} given in Theorem 4.8 with the
measure pX.
Proof. Take B = Lpx and H = H+ with the norm inherited from HK. Then b\ < vXjlbllH for all
b e H. The statement now follows from Theorem 4.16 taking r = 9/(2 + 9).
LrK (L2(X ))
1/2
N
\K
ak xk9/(4+29ll1/\/Xkf,
=
ak2X
2, 2/(2+9)
,2/(2+9)
2
<X
N
Choose f = Y^N=1 akX9k/(4+29)fk e HK. We can see from Theorem 4.8 that
k=1
K k=1
In addition,
\Wfp  f L =
J2ak Xk
k> N
9/(4+29)
fk
= E a2X/(2+9) < xN+rwgw2.
L
k>N
L 2
PX
2
E
c
\'
22m<Xk<22(m1) Ak
r2e <
<2
(ck  b r )
(
r2e
22m<Xk <22(m1) k
+2
(bf)2
r2e
22m<Xk <22(m1)
k
Write fp = J2 k ck $k and fm = Hk b{m)$k. Then, for all 0 < e < r,
which can be bounded by
4me
21+2m(r2e) \f f \2 ^ + 2 1+2(1m)(1r+2e) f \2 < C2/(2+e)25
pmL
PX
mK
+ C 2/(2+e)25+2(1r)+4e(1m)
h <1
c
v r2e
k
<
160
2/(2+e) 16f 1
< TO.
Therefore,
6/(4+26),
This means that fp e Range(LK
An example
4.6
In this section we describe a simple example for the approximation error in RKHSs.
Example 4.19 Let X = [1,1], and let K be the spline kernel given in Example 2.15, that is, K(x,
y) = max{1 x y/2,0}. We claim that HK is the Sobolev space H 1(X) with the following
equivalent inner product:
f,)K = f',g')L2[U] + 1 (f (1) + f (1)) (g(1) + g(1)).
(4.3)
Assume now that pX is the Lebesgue measure. For 0 > 0 and a function fp e L2[1,1], we also
claim that A(fp,R) = O(R0) if and only if \fp(x + t) fp (x) IIL 2[1,11] = O(t0/(2+0)).
To prove the first claim, note that we know from Example 2.15 that K is a Mercer kernel.
Also, Kx e H1 (X) for any x e X .To show that (4.3) is the inner product in HK, it is sufficient to
prove that If, Kx)K = f (x) for any f e H 1(X) and x e X .To see this, note that K'x = 2 X[i,x)
2 X(x,i] and Kx(1) + Kx(1) = 1. Then,
f, Kx )K = 2fXf '(y) dy 1 f V '(y) dy +1 (f (1) + f (1)) = f (x).
To prove the second claim, we use Theorem 4.16 with B = L2(X) and H = HK = H 1(X). Our
conclusion follows from the following statement: for 0 < r < 1 andf e L2,K(f,t) = O(tr)
ifandonlyif f (x+1) f (X)L2[1 1t] = O(tr).
To verify the sufficiency of this statement, define the function ft: X ^ R by
1 ft
ft
(x) =
f (x + h) dh.
t
0
Taking norms of functions on the variable x, we can see that
1 f1
If  ft L =
1
f (X + h)  f (x) dh
1
0
/0
1 f1
C
<
1
o
r
L2
1
Chr dh =
If
(x + h)  f (X)IL2 dh
1
0
\H 1(X) =
 (f (X + f)  f (X))
< C1
r1
L2
and
Hence
K(f, 1 ) < If f L2 + n\f1 llfl 1 (X) < (C/(r + 1) + C)f.
Conversely, if K(f, f) < Cf, then, for any g e H1 (X), we have
If (X + f)  f (X)\L2 = \f (X + i)  g(x + f) + g(x + i)  g(x)
+ g(x)  f (X)\L2 ,
which can be bounded by
2\f  g\L2 +
g'(x + h) dh < 2\f  g\L2 +
\g'(x + h)\L2 dh
0
L2
0
< 2\f  g\L2 + 1\g\H 1 (X).
Taking the infimum over g e H1 (X), we see that
48 4
error
4.7
For a proof of the spectral theorem for compact operators see, for example, [73] and Section
4.10 of [40].
Mercers theorem was originally proved [85] for X = [0,1] and v, the Lebesgue measure.
Proofs for this simple case can also be found in [63, 73].
Theorems 4.10 and 4.12 are for general nondegenerate measures v on a compact space X.
For an extension to a noncompact space X see [123].
The map $ in Theorem 4.14 is called the feature map in the literature on learning theory
[37, 107, 134]. More general characterizations for the decay of the approximation error being
of type O(p(R)) with p decreasing on (0, +ro) can be derived from the literature on
approximation theory e.g., ([87, 94]) by means of Kfunctionals and moduli of smoothness. For
interpolation spaces see [16].
RKHSs generated by general spline kernels are described in [137]. In the proof of Example
4.19 we have used a standard technique in approximation theory (see [78]). Here the function
needs to be extended outside [1,1] for defining ft, or the norm ^2 should be taken on [1,1
t]. For simplicity, we have omitted this discussion.
The characterization of the approximation error described in Section 4.5 is taken from
[113].
Consider the approximation for the ERM scheme with a general loss function f in Section
3.5. The target function ff minimizes the generalization error Ef over H. If we minimize instead
over the set of all measurable functions we obtain a version (w.r.t. f) of the regression
function.
xeX.
Definition 4.20 Given the regression loss function f the f regression function is given by
The approximation error (w.r.t. f) associated with the hypothesis space H is defined as
E f(fH)  Ef (ff) = min E f(f )  Ef (ff).
Proposition 4.21 Let f be a regression loss function. If e > 0 is the largest zero of f
and y < M almost surely, then, for all x e X,ff (x) e [M  e, M + e].
then
f(f)  f(f) < C
X
< C If  f L < C Ilf f IILL2 .
f (X)  f(x)\sd PX
PX
PX
i;
If f is C1 on [M, M] and its derivative satisfies (4.4), then
f(f)  f(f) < C lf  ff 1++ < C lf  ff
ii; If f "(u) > c > 0 for every u e [M, M ], then
f(f)  f(f) > If  f II+2 .
The bounds for the sample error described in Chapter 3 are in terms of, among other
quantities, some covering numbers. In this chapter, we provide estimates for these covering
numbers when we take a ball in an RKHS as a hypothesis space. Our estimates are given in
terms of the regularity of the kernel. As a particular case, we obtain the following.
Theorem 5.1 Let X be a compact subset of Rn, and Diam(X) := maxx,yeX x  y its
diameter.
i; IfK e Cs(X x X) for some s > 0 andX has piecewise smooth boundary, then there
is C > 0 depending on X and s only such that
R\
2n/s
ii;
R\n/2
ln N{IK(BR), n) > Ci ln .
Here C1 is a positive constant depending only on a and f.
Part (i) of Theorem 5.1 follows from Theorem 5.5 and Lemma 5.6. It shows how the
covering number decreases as the index s of the Sobolev smooth kernel increases.Acase
where the hypothesis of Part (ii) applies is that of the box spline kernels described in Example
2.17. We show this is so in Proposition 5.25.
When the kernel is analytic, better than Sobolev smoothness for any index s > 0, one can
see from Part (i) that ln N(IK (Br), n) decays at a rate faster than (R/n)e for any e > 0. Hence
one would expect a decay rate such as (ln(R/v))s for some s. This is exactly what Part (ii) of
Theorem 5.1 shows for Gaussian kernels. The lower bound stated in Part (ii) also tells us that
the upper bound is almost sharp. The proof for Part (ii) is given in Corollaries 5.14 and 5.24
together with Proposition 5.13, where an explicit formula for the constant C1 can be found.
5.1
Reminders IV
To prove the main results of this chapter, we use some basic knowledge from function spaces
and approximation theory.
Approximation theory studies the approximation of functions by functions in some good
family  for example, polynomials, splines, wavelets, radial basis functions, ridge functions.
The quality of the approximation usually depends on, in addition to the size of the
approximating family, the regularity of the approximated function. In this section, we describe
some common measures of regularity for functions.
I; Consider functions on an arbitrary metric space (X, d). Let 0 < s < 1. We say that a
continuous function f on X is Lipschitzs when there exists a constant C > 0 such that for all
x, y e X,
f (x) f (y)\<C(d(x,y))s.
We denote by Lip (s) the space of all Lipschitzs functions with the norm
IIf 11 Lip(s) := \f I Lip(s) + \\f \\C(X),
1f
lLip(s) :
sup
x=yeX
I f (x)  f (y) (d (x, y))s
where  Lip(s) is the seminorm
This is a Banach space. The regularity of a function f e Lip(s) is measured by the index s. The
(1)rjf (x + jt).
Let X be a closed subset of Rn and f : X ^ R. For r e N, t e Rn, and x e X such that x, x +
t,..., x + rt e X, define the divided diference
In particular, when r = 1, Ajf (x) = f (x + t)  f (x). Divided differences can be used to
characterize various types of function spaces. Let
Xr,t = {x e X  x,x + t,...,x + rt e X}.
1f
Lip*(s,LP(X))
:= sup t s
teRn
1/p
For 0 < s < r and 1 < p < TO, the generalized Lipschitz space Lip*(s, Lp(X)) consists of
functions f in Lp(X) for which the seminorm
is finite. This is a Banach space with the norm
1f
Hup*(s,LP(X)) := f Lip*(s,LP(X)) + 1 HLp(X).
The space Lip*(s, C(X)) is defined in a similar way, taking

(x)
f )) := sup n sup
teRn
xeXrt
and
1f1
Lip*(s,C (X)) := \f Lip*(s,C(X)) + f 1C(X).
Clearly,  f HUp*(sLp(x)) < (KX))1/p f HUp*(sC(x)) for allp < TO when X is compact.
When X has piecewise smooth boundary, each function f e Lip*(s, Lp(X)) can be extended to
Rn. If we still denote this extension by f, thenthere exists a constant CX,s,p depending on X, s,
and p such that for all f e Lip*(s,Lp(X)),
IIf II Lip*(s,LP (Rn))  CX ,s,p>f Wup*(s,LP(X ))'
Whenp = <x>, we write  f Up*s instead of  f llUp*(sC(X^.Inthis case, under some mild
regularity condition for X (e.g., when the boundary of X is piecewise smooth), for s not an
integer, s = t + s0 with t e N, and 0 < so < 1, Lip*(s, C(X)) consists of continuous functions
on X such that Daf e Lip(s0), for any a = (a1,...,an) e Nn with a t. In particular, if X =
[0,1]n or Rn. In this case, Lip*(s, C(X)) = Cs(X) when s is not an integer, and Cs(X) c Lip*(s,
C(X)) when s is an integer. Also, C 2(X) c Lip(1) c Lip*(1, C(X)),the last being known as the
Zygmund class.
Again, the regularity of a function f e Lip*(s, Lp(X)) is measured by the index s. The bigger
the index s, the higher the regularity of f.
5.2
smooth kernels
Recall that if E is a Banach space and R > 0, we denote BR(E) = {x e E : x < R}.
If the space E is clear from the context, we simply write BR.
Lemma 5.2 Let E c C(X) be a Banach space. For all n, R > 0,
N (BR, n) = N (Bi, R).
Proof. The proof follows from the fact that {B( fi, n), , B( fk, n)} is a covering of BR if and
only if {B (fi/R, n/R), , B (fk/R, n/R)} is a covering of Bi.
It follows from Lemma 5.2 that it is enough to estimate covering numbers of the unit ball.
We start with balls in finitedimensional spaces.
Theorem 5.3 Let E be a finitedimensional Banach space, N = dim E, and R > 0. For 0
< n < R,
N(BR, n) <
+l
+i
n
N
ji
and f ( B ( f ( t ) , n ) , Vj e{2,,l}.
i=i
Let n > 0. Suppose that N(BR, n) > ((2R/n) + i)N .Then BR cannot be covered by ((2R/n) +
i)N balls with radius n. Hence we can find elements f(i), , f(e> in BR such that
Therefore, for i = j,  f f j) > n.
Set f(j ) = Em=i xim j em e BR and
x( j )
=(x(j),
,xN J'1)
e
RN. Then,
for
i=j,
x(l) x( j)  > n.
Also, \x( j)\ < R.
Denote by Br the ball of radius r > 0 centered on the origin in (RN, \ \).Then
U
{xU) + 2Bi} C ^ j=i
and the sets in this union are disjoint. Therefore, if x denotes the Lebesgue measure on RN,
x ^U p+ 2 B; } j= (x( j > + 2 B;) <,, (Br}).
It follows that
t (2 )N x (Bo < (R+2 f x &),
and thereby
t ^)=(2R+)*
This is a contradiction. Therefore we must have, for all n > 0,
2R
+1
n
If, in addition, n > R, then BR can be covered by the ball with radius n centered on the
origin and hence N(BR, n) = 1.
The study of covering numbers is a standard topic in the field of function spaces. The
asymptotic behavior of the covering numbers for Sobolev spaces is a wellknown result. For
example,
the ball BR(Lip*(s,C([0,1]))) in the generalized
Lipschitz space on [0,1] satisfies
(
)n/s
(
)n/s
R
nn
n
1/s
But ln(1 + t) < t for all t > 0, so we have
and hence
(
r
4
ln N(B1 (Lip(s)), n) < ln2 +  ln3 + 2 
\nj
(
4 \1/S
<4
\nj \nj
We now prove the lower bound. Set e = (2n)1/s and x as above. For i = 1,..., d 1, define
f to be the hat function of height 2 on the interval [x; e, Xi + e]; that is,
2 t if 0 < t < e
fi (Xi +1)
2 + 2t
if e < t < 0
0
if t [e,e].
Note that fi (xj) = 2 8j.
For every nonempty subset I of {1,..., d 1} we define
fi (x) = ^2 fi(x).
ieI
If I1 = I2 , there is some i e (I1 \ I2 ) U (I2 \ I1 ) and
n
II fI1 fI2 11^(X) > f (xi) fI2 (xi) = ^.
It follows that N(B1 (Lip(s)), n) is at least the number of nonempty subsets of {1,..., d 1}
(i.e., 2d1 1), provided that each fI lies in B1 (Lip(s)). Let us prove that this is the case.
Observe that f is piecewise linear on each [xi,xi+i] and its values on xi are either ^ or 0.
Hence  fi \\C(X) < n < 5 . To evaluate the Lipschitzs seminorm of fi we take x, x + t e X with
t > 0. If t > e, then
Ifi(x +1 )  f (x)\ ^ 2  fj\\C(X) n 1
ts
<
es < es 2 .
If t < e and xi e (x, x + t) for all i < d  1, then fI is linear on [x, x +1] with slope at most 2 e
and hence
 fi (x +1)  fi (x)\
(n/2e)t n is
n i
i
ts
~ ts
2e
2e
4
If t < e and xi e (x, x +1) for some i < d  1, then
 fi (x + t)  fi (x)  < \ fi (x + t)  fi (xi ) + \ fi (xi)  fi (x) 
n
n
n
< 2 ~(x + 1  xi) + 2 ~ (xi  x) = 2 ~ t
and hence
 fi (x +1 )  fi (x) 
n A,
n s
i
ts
2e
2
4
Thus, in all three cases,  f  Lip(s) < 2, and therefore  f  Lip(s) < 1. This shows fi e Bi(Lip(s)).
Finally, since n < 4 , we have e < i and d > 2. It follows that N(Bi(Lip(s)), n) > 2di  i > 2d 2
r
i / Oii/S
i(i
ln N(Bi (Lip(s)), n) > 2 ln M 2^ )
 ln4 > ^.2^
Now we can give some upper bounds for the covering number of balls in RKHSs. The
bounds depend on the regularity of the Mercer kernel. When the kernel K has Sobolev or
generalized Lipschitz regularity, we can show that the RKHS HK can be embedded into a
generalized Lipschitz space. Then an estimate for the covering number follows.
Theorem 5.5 Let X be a closed subset of Rn, and K: X x X ^ R be a Mercer kernel. If s
> 0 and K e L i p*(s, C(X x X)), then HK c Li p*(2, C(X)) and, for allr e N, r > s,
II
f Hup*(s/2) <7 2r+lWK Li p*(s) II f IlK, Yf 6 HK .
Proof. Let s < r 6 N andf 6 HK .Let x, t 6 Rn such that x, x+1,..., x+rt 6 X. By Theorem
2.9(iii),
rf (x) =
j=0
= ( ( j ) (1)riKx+jt f
j=0
Here (0, t) denotes the vector in R2n where the first n components are zero.
A[0X)K (x + jt, x)
< \K
Lip*(s)
s
By hypothesis, K 6 Li p*(s, C(X x X)). Hence
This yields
\Af (x) < If K
j ) lKLip*(s) llts 1 < ^KLip*(s)Ilf IKlts/2.
Therefore,
If iup*(2) <prlKluP*(s) II f IK.
Combining this inequality with the fact (cf. Theorem 2.9) that
55
II f HLip*(s/2 ) <7
W HuP*(s) II f IlK The proof of the theorem is complete.
Lemma 5.6 Let n e N, s > 0, D > 1, andx* e Rn. IfX c x* + D[0, l]n and X has piecewise
smooth boundary then
N(BR(Lip*(s, C(X))), n) < N(BCX,SDSR( Lip* (s, C([0,1]n))), n/2),
where CX,s is a constant depending only on X and s.
f *(x) = f
x  x*
D
Proof. For f e C([0,1]n) define f * e C(x* + D[0,1]n) by
Then f * e Lip*(s,C(x* + D[0,1]n)) if and only if f e Lip*(s,C([0,1]n)). Moreover,
D J f
l Hup*(s,C([0,1]n)) < If *^Lip*(s,C(x*+D[0,1]n)) < If 1 Lip*(s,C([0,1]n)).
Since X c x* + D[0,1]n and X has piecewise smooth boundary, there is a constant CX ,s
depending only on X and s such that
BR(Lip*(s, C(X))) c {f *X  f * e BCX,sR(Lip*(s,C(x* + D[0,1]n)))}
C {f *X I f e BCX,SDSR(L\P*(S, C([0 ,1 ]n)))}.
Let {f1,...,fN} be an 2net of BCx sDsR(Lip*(s, C([0,1]n))) and N its covering number. For each
j = 1,...,N take a function g*IX eBR(Lip*(s,C(X))) with gj eBCX,sDsR(Lip*(s,C([0,1]n))) and gj fjLip*(s,C([0,1])) < 2 if it exists. Then {g*X I j = 1,..., N} provides an nnet of BR(Lip*(s, C (X))):
each f e BR(Lip*(s,C(X))) can be written as the restriction g*X to X of some function g e BCx
n
sDsR(Lip*(s, C([0,1] ))), so there is some j such that
llg  fj 1 Lip*(s,C([0,1])) < 2. This implies that
1fg
j* X II Lip*(s,C(X)) < lg*  gj 1 Lip*(s,C(x*+D[0,1]n))
< llg  gj 11 Lip*(s,C ([0,1 ]n)) < 2.
This proves the statement.
Proof of Theorem 5.1(i) Recall that Cs(X) c Lip*(s, C(X)) for any s > 0. Then, by Theorem
5.5 with s < r < s + 1, the assumption K e Cs(X x X) implies that IK(BR) c By2,+2Ky t
R(Lip*(s/2, C(X))). This, together with
(52; and Lemma 5.6 with D > Diam(X), shows Theorem 5.1(i).
When s is not an integer, and the boundary of X is piecewise smooth, Cs (X) = Lip*(s, C
(X)). As a corollary of Theorem 5.5, we have the following.
Proposition 5.7 Let X be a closed subset of Rn with piecewise smooth boundary, and
K: X x X ^ R a Mercer kernel. If s > 0 is not an even integer andK e Cs(X x X), then
HK c Cs/1(X) and
II f Hup*(s/2 ) <72r+1HKHuP*(s) II f IIK, Vf e HK.
Theorem 5.5 and the upper bound in (5.2) yield upperbound estimates for the covering
numbers of RKHSs when the Mercer kernel has Sobolev regularity.
Theorem 5.8 Let X be a closed subset of Rn with piecewise smooth boundary, and K:
X x X ^ R a Mercer kernel. Let s > 0 such that K belongs to Lip*(s, C(X x X)). Then,
for all 0 <n < R,
R 2n/s
ln N (IK(BR), n) < C ,
where C is a constant independent ofR and n.
It is natural to expect covering numbers to have smaller upper bounds when the kernel is
analytic, a regularity stronger than Sobolev smoothness. Proving this is our next step.
2r+l
5.3
In this section we continue our discussion of the covering numbers of balls of RKHSs and
provide better estimates for analytic kernels. We consider a convolution kernel K given by
K(x, t) = k (x 1), where k is an even function in ^2(R) and k(%) > 0 almost everywhere on
Rn. Let X = [0,1]n. Then K is a Mercer kernel on X . Our purpose here is to bound the covering
number N (IK(BR), n) when k is analytic.
We will use the Lagrange interpolation polynomials. Denote by ns(R) the space of real
polynomials in one variable of degree at most s. Let t0,..., ts e R be different. We say that wl,s
e ns(R), l = 0,, s, are the Lagrange interpolation polynomials with interpolating points
{t0,..., ts} when E/=0 wl,s (t) = 1 and
wl,s(tm)
Sl,m, l, m e {0,1,..., s}.
wl,s(t)
56
js{0 ,t,...,s}\{l}
tt t t
J l j
It is easy to check that
satisfy these conditions.
^st(st  1) (st  j + 1)
w
l,s(t) := 2^JJ
j=l
'
j
l
(1) j1 .
(5.3)
We consider the set of interpolating points {0,1,2,... ,1} and univariate functions
{wl,s(t)}sl=0 defined by
Since
W j ) (1)Jlzl = (z  1)}, l=0 l
the following functions of the variable z are equal:
Z*Zt)z! = E SW  ') :,(s' J + 1)(z  1)J.
(5.4)
J
l=0
J=0
Wl,s
Sl,m, l, m e{0,1,..., s}.
(5.5)
In particular, J]s=0 wl,s(t) = 1. In addition, it can be easily checked that
Wl,s(t)
je{0 ,1 ,...,s}\{l}
t  j/s l/s  j/s
je{0 ,1 ,...,s}\{l}
st  j l  j
This means that the wl,s are the Lagrange interpolation polynomials, and hence
The norm of these polynomials (as elements in C([0,1])) can be estimated as follows:
Lemma 5.9 Let s e N, l e{0,1,..., s}. Then, for all t e [0,1],
Wl,s(t) < ^ s ^ '
Proof. Let m e{0,1,..., s  1} and st e (m, m + 1). Then for l e{0,1,..., m  1},
\wl,s(t)\
nj=1(st j)Y\J=l+1(st  j)Uj=m+1(st  j)
l!(s  l)!
(m + 1) ! (s  m)!
s
<
<s.
_
(st  l)l!(s  l)!
l
When l e{m + 1,..., s},
\wl,s(t)\
ujL0(st  j) ulj=L+i(st  j)n]=l+1(st  j)
l!(s  l)!
(m + 1)!(s  m)!
s
<
<s.
(l  m)l!(s  l)!
l
The case l = m can be dealt with in the same way.
We now turn to the multivariate case. Denote XN := {0,1,..., N}n. The multivariate
polynomials {wa,N (x)}aeXN are defined as
n
Wa,N (x) = Y[ Waj,N (Xj), x = (Xl, ..., Xn), a = (ai,..., an). (5.6)
j=1
We use the polynomials in (5.6) as a family of multivariate polynomials, not as interpolation
polynomials any more. For these polynomials, we have the following result.
Lemma 5.10 Let x e [0,1]n andN e N. Then
(5.7)
Wa,N (x) < (N2N )n
a eXN
e~ie Nx
WaN (x)ei0 a
57
UGXN
<n
1
+ 2N
n1
N
max 0; 
1< j<n
(5.8)
and, for 0 e[j, \]n,
holds.
Proof. The bound (5.7) follows directly from Lemma 5.9.
Nt
= (i + (z  i))Nt = J2
j=0
. Nt(Nt  1) (Nt  j + 1)
j!
(z  1)j.
It follows that for n e[2, 2] and z = e in,
, Nt(Nt  1) (Nt  j + 11
N
inNt  ^;
(ein  1)j
To derive the second bound (5.8), we first consider the univariate case. Let t e [0,1]. Then
the univariate function zNt is analytic on the region z  1 < 2. On this region,
<
j=N +1
Nt(Nt  1) (Nt  j + 1)
j
nN.
j=0
and
N
N
Nt(Nt  1) (Nt  j + 11 _,inn
.
l
WIN(t)e inl =
(e  1)j
l=0
j=0
< 1 +  n N < 1 +
j!
1
2N
i
%
7
1 ____
(5.9)
N
einNt
J^WIN (t )e~in'1
< lnl
N
(5.10)
This, together with (5.4) for z = e in, implies
l=0
Now we can derive the bound in the multivariate case. Let 0 e [2,2]n. Then ei0Nx = \\nm=i ei0mNXm
. We approximate ei0mNXm by
E*m= 0 wamN (Xm)e tdm'am for m = 1 , 2 , . . . , n. We have
id Nx ida
= n
e WaN (x)e
m=
a.XN
1
s=1
N
n
X eidm'Nxm WamN(xm)eidm'am
idsa
s=m+
as=0
1
am=0
m Wl
m=1 v
N
nm
1 +
<n1 +
1
2N
n
1 /
N
J2 Was,N (xs)e
\N
58
(.ms 1
Applying (5.9) to the last term and (5.10) to the middle term, we see that this expression can
be bounded by
Thus, bound (5.8) holds.
It.
, 1 \ 2n 2 3
Tk(
N) = n 1 +
2
3
lj l
(2n)n max
k()
l<j'< f e[N/2N/2]n
2N
2
+ 1 + (N 2N )n (2n)n
.[N/2N/2]n
k () d .
We can now state estimates on the covering number N(IK(BR), n) for a convolutiontype
kernel K(x, t) = k (x t). The following function measures the regularity of the kernel
function k:
The domain of this function is split into two parts. In the first part, . [ N/2, N/2]n, and
therefore, for j = 1,..., n, (\j/N)N < 2N; hence this first part decays exponentially quickly as
N becomes large. In the second part, . [N/2, N/2]n, and therefore is large when N is
large. The decay of k (which is equivalent to the regularity of k; see Part (III) in Section 5.1)
yields the fast decay of Tk on this second part. For more details and examples of bounding Tk
(N) by means of the decay of k (or, equivalently, the regularity of k), see Corollaries 5.12 and
5.16.
Theorem 5.11 Assume that k is an even function in ^2(R) and k() > 0 almost
everywhere on
Rn. Let K(x, t) = k (x
t) for
x, t
.
[0,1]n. Suppose
limNTk (N) = 0. Then, for 0 < n < R,
n
/ (x)  / (N)
^XN
Wa
N (x)
= (/, Kx
\
WaN (x)Kaj
aeXN K
d
2
% [N/2 ,N/2]n
k (%)
eiN Nx
59
 ^2 wa,N (x)e
t%a
N
2
d%
aiXN
< n2 1 +
2 n2
/
2N %
2N
.
%)(m d%.
N
% e[ N/2,N/2]n
1< j<n
For the second region, we apply (5.7) in Lemma 5.10 and obtain
2
%
% i[N/2,N/2]n
k (%)
e tN Nx ^2 wa,N(x)e
a XN
d%
< (1 + (N2N )n)2
'
tN a
k (%) d %.
1 \2n 2
QN (x) < n ( 1 + 2N
j
(2n)n^
k(%)(
d%
(1 + (N 2N )n)2
%
(2n)n
% i[N/2,N/2]
;[ N/2,N/2]n
k(%) d% = Tk(N).
N
Hence
sup
xs[0,1]n
k (0) 2 Wa,N (X)k ^X N)
a GXN
ap
Wa,N (X)k
Wp,N (X)
a,fiiXN
^
N
< Tk(N).
(5.14)
Since N satisfies (5.12), we have
f
f(x)
(Na) WaN(x)
aeXN
<1.
2
Also, by the reproducing property, f(f) = {f,Ka/N)K < fK VK(a/N, a/N) < R^/kifl). Hence
+ iy /!
llf (N )Ht2(ZN ) <
' .
!
2
Here (XN) is the i space of sequences {x(a)}aeXN indexed by XN.
Apply Theorem 5.3 to the ball of radius r := RVk(0)(N + 1)n/! in the finitedimensional
space i!(XN) and e = n/(!(N!N)n). Then there are {cl : l = 1 , . . . , [(!r/e + 1)X]} c i2(XN) such
that for any d e i2(XN) with d Hi2(XN) < r, we can find some l satisfying
lld  C lli2{XN) < e
This, together with Lemma 5.10, yields
c
h(XN)
Y \Wa,N (x)
aeXN
< (N2n)ne < n/2.
C (X )
Y da Wa,N (x)  Y c^aN (X) aeXN aeXN
Here i(XN) is the i space of sequences {x(a)}aeXN indexed by XN that satisfies the
relationship Hcll^^) < ci2 (XN) for all c e i(XN).
Thus, with d = {f (a)}, we see that  f (x) J2aeXN clawa,N(x)llc(X) can be bounded by
(x) f
f (N) Wa,N (x)
C (X)
+
Y daWaN (x)  Y claWa,N (x)
aeXN
aeXN
< n.
C (X)
aeXN
l
We have covered IK (BR) by balls with centers
aeXN c awa,N (x) and radius
n. Therefore,
N(IK(BR), n) <
2r
#X N
e +1
That is,
2r
ln N(IK(BR), n) < (N + 1)n ln + 1< (N + 1)n ln SvkOKN + 1)n/2(N2N)
,R
n
To see how to handle the function Tk (N) measuring the regularity of the kernel, and then
to estimate the covering number, we turn to the example of Gaussian kernels.
Corollary 5.12 Let a > O, X = [0,1]n, and K(x, y) = k (x  y) with
2
k(x) = exp  2 , x e Rn.
a
Then, for
O < n < R,
. R 54n n
R 90n2
ln N (IK(BR), n) < 3ln +
2 + 6
(6n + 1) ln +
2 + 11n + 3
n a2
n
a2
(5.15)
holds. In particular, when O < n < R exp{(90n2/a2)  11n  3}, we have
R \ n+1
ln N (IK(BR), n) < 4n(6n + 2)( ln  .
(5.16)
Proof. It is well known that
%) = (a jn)nea2^ l2/4.
(5.17)
Hence k(f) > O for any f e Rn.
Let us estimate the function Tk. For the first part, with 1 < j < n, we have
1
(2n)~1
a Vnea2f2/4 dfi < 1
fie[N/2N/2]
when l = j. Hence (2n)n
j
(a4n)nea^l2/
[N/2,N/2]n
2N
<
<
( d
\N J
a *Jn r
N/2
2n N/2
(
\/n \ a Nj
rN
2
2 \2N / 2N + 1\N+(1/2)
e
1/(6(2N +1))
2
N
K
aN J
\ 2e )
V2N + 1
j=1 Jfjt[N/2N/2]
ea2 j2/4 dfj
< n^n T'e(a2/4)(t2t/2)e(a2/4)(t/2) dt N
n
JN/2
h [N/2N/2]n
na
2
2
<
ea N(N1)/16_^ea N/16
 vne
a2 e
2
2
(a /16)N
8n
_(a 2/i^w 2
ea Jn
If we combine these two estimates, the function Tk satisfies
2
/16)N 2
*
T (N)
1+
1
< nil +
kT~XJ
2n2
2N
1 \ 2n2
< * 1 + 2n /
<e
Notice that when N > n + 3,
and
(1 + (N2n)n)2  212n+4Nn
^e(2/16)N2+4nNln2
n4
N
, +2
16en ln2
an
N
nN
<e+ 2nN
16n
an
(5.18)
4
1 1
< e +; max ,
<' + a Jn
16n 2n
N
When N > 2ln  + 5 and N >  In  + f  In(a*Jn)/(nln2), we know that each term in the
estimates for Tk is bounded by (n/(2R))2/2. Hence (5.12) holds.
Finally, we choose the smallest N satisfying
N>
80n ln 2
a2
+ 3 ln + 5.
n
Then, by checking the cases a > 1 and a < 1, we see that (5.12) is valid for any 0 < n < R.
By Theorem 5.11,
ln N (IK(BR), n) <
R
80n ln 2
3 ln2 +
2
+6
na
<
R
54n
3 ln +
+6
n
a
2
)((2ln^ nN + ln R + ln8)
R 90n2
(6n + 1) ln + 22+ 11 n + 3
n
a
Choose N > 80n ln2/a2. Then
This proves (5.15).
When 0 < n < Re(902/a2)113, we have
R \ n+1
ln n '
This yields the last inequality in the statement.
One can easily derive from the covering number estimates for X = [0,1]n such estimates
for an arbitrary X.
Proposition 5.13 Let K (x, y) = k (x  y) be a translationinvariant Mercer kernel on Rn
and X Rn. Let A > 0 and KA be the Mercer kernel on X A = [0,1]n given by
KA(x,y) = k(A(x  y)), x,y e[0,1]n.
Then
i;
IfX x* + [A/2, A/2]n for some x* e X, then, for all n, R > 0, N(IK(BR), n) <
N(IKA(BR), n).
ii;
IfX 2 x* + [A/2, A/2]n for some x* e X, then, for all n, R > 0, N(IK(BR), n) >
N(IKA(BR), n).
Proof.
=
ctcjk(xi  xj) = ^2 ctcjkl A
xi  x*
+ t0
xj  x*
 t0
i,j=1 m
YA
c
=
j
,
i
i j=
Ai
i,j=i
xi  x*
+ t0 ,
*
+ t0 ) =
I>K Ax*
i=i
+t 0
xj  x
2
KA
i; Denote t0 = (5 ,5 , . . . , 5 ) e[0,1]n. Let g = iXi ciKXi e IK(BR). Then
the last line by the definition of KA. Since g e IK(BR), we have
IXx* t e IK A(BR).
1=1
' +t0
If {f1,. . . , fN} is an nnet of IKA(BR) on XA = [0,1]n with N = N(IKA (BR), n), then there is some j
e { 1 , . . . , N} such that
< n.
i=i
C([0,i]n)
sup
te[0,1]n
J2c>K A t,
xi  x*
+ t0  fj (t)
< n.
i=1
This means that
sup
xeX
J^CiK A
i=1
x  x*
A
xi  x*
+ t0, A+ t0
fj
x  x*
A
+ t0
< n.
Take t = ((x  x*)/A) +1 0 . When x e X c x* + [A/2, A/2]n, we have t e[0,1]n. Hence
sup
xeX
m
J^Cik (x  xi)  fj
i=1
xx*
A
+ t0
= sup
xeX
m
j2ciKxt(x)  fj
i= 1
x  x*
A
+ t0
< n.
This is the same as
This shows that if we define fj*(x) := fj(((x  x*)/A) + t0), the set f*,..., fj*} is an nnet of the
function set {^j=1 CiKXi e IK (BR )} in C (X). Since this function set is dense in IK(BR), we have
N(IK(BR), n) < N = N(IKA (BR), n).
(ii) If X 2 x* + [A/2, A/2] and {^1,..., gN} is an nnet of IK(BR) with N = N(IK(BR), n),
then, foreach g e BR, wecanfind somej e {1,...,N} suchthat g gj\\C(X) < n.
Let f = J2l=1 cKtA e IKA (BR). Then, for any t e X A,
m
mm
f (t) =
^2,
Cik(A(t  ti)) =
^2
We now note that the upper bound in Part (ii) of Theorem 5.1 follows from Corollary 5.14.
We next apply Theorem 5.11 to kernels with exponentially decaying Fourier transforms.
Theorem 5.15 Let k be as in Theorem 5.11, and assume that for some constants Co > 0
and X > n(6 + 2ln 4),
k () < C0 eK ", Vf e Rn.
ln N(IK(BR), n) <
4
ln (1/A)
R
ln + 1 + C1 n
n
ln (1/A) + 1
R
ln + C2 n
(5.19)
Denote A := max{1/ek,4n/e/2}. Then for 0 < n < 2R*/C0A(2n 1)/4,
Ci := 1 +
2ln(32C0) ln (1/A)
, C2 := ln(8^C02n/2(A3/&2n)C^j.
holds, where
M)2N(
urn
( f ) l  j l df  c f
e^" N
f e[N/2,N/2]n \N J
Jf e[N/2,N/ 2 ] n
V
2N
e '"""I TT  d f
NNNn1 i
\fj\Ne\fj\ dfj
N
jfj
e[N/2,N/2]
< n2_C N Nn1N!  XN+1NN
< 2C0V2n2(1/12)+1
k
n1/2
(eX)N+1,
the last inequality by Stirlings formula. Hence the first term of Tk (N) is at most
\ 2 n2
N
(2n)n2CV2rt
2(1/12)+1
 4C0 Nn(1/2) ^
1 \N +1
Nn(1/2) (eX)N +1
Here we have bounded the constant term as follows
n3 (1 + 2 N f" (2n)2V2n2(1/12)+1 = ^il+i/^)
n 1
<n
(1 + (1/2))2\
^
2(1/12)+2
2n J
9\
=n
n1
V2n
(1/12)+2
8n 7
V2n
(f) df  Co I
em"df.
f e[N/2,N/2]n
J f >N/2
(5.20)
For the other term in Tk (N), we have
To estimate this integral, we recall spherical coordinates in Rn:
f1 = r cos 01
f2 = r sin 01 cos 02
f3 = r sin 01 sin 02 cos 03
fn1 = r sin 01 sin 02 ... sin 0n2 cos 0n1 fn = r sin 01 sin 02 ... sin 0n2 sin 0n1,
where r e (0, TO), 0 1 , . . . , 0n2 e [0,0), and 0n1 e [0,2n). For a radial
'n
r(n/2) JN/2
3n
log(1/A)
<
e
< 1.
(1 + (N2N)"f (2n)2C/12N"~le~XN/2 < 4CoN3n (eC)"
Then, for N > 4n/ln(1/A),
(5.21)
Tk (N) < 8CoAN/2.
Thus, for o < n < 2R^/COA(2" 1 )/4 , we may take N e N such that N > 4n/ln(1/A) and N > 2
to obtain
67
8CoAN/2 <
y < 8CoA(N1 )/2 . (5.22)Under this choice, (5.12) holds. Then, by Theorem 5.11,
lnN (IK(BR), n) < (N + 1)" ln(^8y/k(0j(N + 1)n/2(N2N)nRj.
4 R + ln .
ln(1/A) n
2ln(32C0)
ln(1/A)
N<1+
Now, by (5.22),
Also, since
(N + 1)n/2(N2n)" < 2n/2(A3/82")N,
lnN(IK(BR), n) <
4
,_Ri 0 , 2ln(32Co)
ln + 2 +
ln(1/A) n
ln (1/A)
n
ln 8 vk(0 )2 n / 2
we have
(A3/82")1+2ln(32Co)/ln(1/A) (R/n)(4/ ln(1/A))+1
Finally observe that
5.4
In this section we continue our discussion of the covering numbers of balls in RKHSs and
provide some lowerbound estimates. This is done by bounding the related packing numbers.
Definition 5.17 Let S be a compact set in a metric space and n > 0. The packing number
M(S, n) is the largest integer m e N such that there exist m points x i , . . . , xm e S being nseparated; that is, the distance between xt and xj is greater than n if i = j.
Covering and packing numbers are closely related.
Proposition 5.18 For any n > 0,
M(S, 2n) < N(S, n) < M(S, n).
Proof. Let k = M(S,2n) and {a 1 , . . . , ak} be a set of 2nseparated points in S. Then, by the
triangle inequality, no closed ball of radius n can contain more than one at. This shows that
N(S, n) > k.
To prove the other inequality, let k = M(S, n) and {a 1 , . . . , ak} be a set of nseparated
points in S. Then, theballs B(at, n) cover S. Otherwise, there would exist a point ak +1 whose
distance to aj, j = 1 , . . . , k, was greater than n and one would have M(S, n) > k + 1.
The lower bounds for the packing numbers are presented in terms of the Gramian matrix
K [x] = (K (xt, xj}) J=1 ,
(5.24)
where x := (x1 ; . . . , xm} is a set of points in X. Denote by K [x]1 12 the norm of K [x]1 (if it
exists) as an operator on Rm with the 2norm.
We use nodal functions in the RKHS HK to provide lower bounds of covering numbers.
They are used in the next chapter as well to construct interpolation schemes to estimate the
68
approximation error.
Definition5.19 Letx := (x1,xm} cl.Wesaythat{u}"! = 1 i s a s et o f nodal functions associated
with the nodes x1,, xm if Ui e span(Kx1 , . . . , KXm) and
u (x )
i j = Sij.
The following result characterizes the existence of nodal functions.
Proposition 5.20 Let K be a Mercer kernel on X and x :={x i , . . . , xm} c X. Then the
following statements are equivalent:
i; The nodal functions {ui }f 1 exist.
ii; The functions [KXi }f 1 are linearly independent.
iii; The Gramian matrix K[x] is invertible.
iv; There exists a set of functions {f }1=1 e HK such that f (xj) = Sj for i, j = 1 , . . . ,
m.
In this case, the nodal functions are uniquely given by
m
u(x) = Y (K [x]~l)ijKXj (x), i = 1 , . . . , m.
(5.25)
j=i
Moreover, for each x e X, the vector (ui(x))t[=1 is the unique minimizer in Rm of the
quadratic function Q given by
mm
Q(w) = ^2 WiK(xi,xj)wj  2^2WiK(x,xi) + K(x,x), w e Rm.
=1
=1
Proof.
i; ^ (ii). The nodal function property implies that the nodal functions {ui} are linearly
independent. Hence (i) implies (ii), since the mdimensional space span{ui}1=1 is
contained in span{Kxi }"=1.
ii; ^ (iii). A solution d = ( d 1 , . . . , dm) e Rm of the linear system
K[x]d = 0
satisfies
djKx
j
j=1
K
Yd Y, K(xi, xj)dj
i=1 j=1
= 0.
2
Then the linear independence of {Kxj }mi=1 implies that the linear system has only the zero
solution; that is, K[x] is invertible.
Yj= 1 (K [x] 1 ) ijKxj satisfy
iii; ^ (iv). When K[x] is invertible, the functions {f }"L1 given by fi =
m
fi (Xj) = ^2,(K[x] X')i,tK(xe, Xj) = (K[x] 1K[x])ij = Sj.
t=i
These are the desired functions.
iv; ^ (i). Let Px be the orthogonal projection from HK onto span{KXi }f 1. Then for i, j = 1, . . . ,
m,
Px(fi )(Xj ) = (Px (fi), Kxj >K = {fi, Kxj >K = fi (x} ) = Sj.
So {ui = Px(fi)}f=l are the desired nodal functions. Theuniqueness ofthe nodal functions
follows from the invertibility of the Gramian matrix K[x].
Since the quadratic form Q can be written as Q(w) = wTK[x]w  2bTw + K(x, x) with the
positive definite matrix K [x] and the vector b = (K(x,xi))1=1, we know that the minimizer
w* of Q in Rm is given by the linear system K[x]w* = b, which is exactly
u (x))m=l.
When the RKHS has finite dimension l, then, for any m < l, we can find nodal functions
{uj}m= 1 associated with some subset x = {xi,...,xm} c X, whereas for m > l no such nodal
functions exist. When dim HK = TO, then, for any m e N, we can find a subset x = {x1,..., xm} c
X that possesses a set of nodal functions.
Theorem 5.21 Let K be a Mercer kernel onX, m e N, and x = {x1,..., xm} c X such that K
[x] is invertible. Then
M(IK(BR), n) > 2m  1
69
K
[x]"
[*]')K*.
jt Kt. n'
jej t=\
m
j'ej s=l
K
K
=n
[x]"0E (K[x]"1)(K[x]).t
jj'ej t=1
K
J2
s=1
=n
(K
E
ix]"0
jj'ej
= n'2
(K [x])1e)j
1
1=1
What is left is to show that the functions uj lie in BR. To see this, take 0 = j c {1 , . . . , m}.
Then
<
n'2W[x])1et1 (j) < n'2vmw[x])1et2 (j)
<
n'2jmMW[x])  1 2 = n/2 m(K[x])1 2 ,
where e is the vector in t2(j) with all components 1. It follows that uj K < n'Vm i l K [x]1 1 1 /2,
and uj e BR since K[x] 1 12 < m iR/n')2 .
Thus lower bounds for packing numbers and covering numbers can be obtained in terms
of the norm of the inverse of the Gramian matrix. The latter can be estimated for convolutiontype kernels, that is, kernels K(x, y) = k (xy) for some function k in Rn, in terms of the
Fourier transform k of k.
Proposition 5.22 Suppose K (x, y) = k (xy) is a Mercer kernel onX = [0,1]n and the
E
c
then
70
ia$
2
d $.
Bounding from below the integral over the subset [Nn, Nn]n, we see that
2
cTK[x]c > (2n)nNn
inf k(n)
ne[Nn ,N n ]n
c e
a
ia$
d$
Rn
k
= IIc22 (X )Nn
inf
($) .
1 (XN
)
$ e[Nn ,N n ]n
It follows that the smallest eigenvalue of the matrix K [x] is at least
Nn
inf k ($) ,
$ e[N n N n ]n
from which the estimate for the norm of the inverse matrix follows.
Combining Theorem 5.21 and Proposition 5.22, we obtain the following result.
Theorem 5.23 Suppose K (x, y) = k (x  y) is a Mercer kernel onX = [0,1]n and the Fourier
transform ofk is positive. Then, for N e N,
lnN (IK(BR), D > ln M(IK(BR), n) > ln2{Nn  1},provided N satisfies
inf
t(H) > (n)2.
He[Nn N n ]n
R
As an example, we use Theorem 5.23 to give lower bounds for covering numbers of balls
in RKHSs in the case of Gaussian kernels.
Corollary 5.24 Let o > 0, n e N, and
k(x) = exp j^r} , x e Rn.
Set X = [0,1]n, and let the kernel K be given by
K(x, t) = k(x  t), x, t e[0 ,1 ]n.
Then, for 0 < n < R (o *Jn/2)n/2e no2n2/&,
+ ln(os/n) ln 2 .
ln N(IK(BR), n) > ln 2
on n
n
2
22
ln R
} n 2
 + 1 ln2
n
Proof. Since k is positive, we may use Theorem 5.23.
5.5
The only result of this section, Proposition 5.25, shows a way to construct box spline kernels
with a prespecified smoothness r e N.
Proposition 5.25 Let B0 = [ b i , . . . , bn] be an invertible n x n matrix. Let B = [ B 0 B 0 . . .
B 0 ] beansfoldcopyofB0 andk(x) = (MB* MB)(x) be induced by the convolution of the
box spline MB with itself. Finally, let X c R n , and let K: X x X ^ R be defined by K(x,
y) = k (x  y). Then K e Cr (X x X) for all r < 2 s  n.
k
(H)
Proof. By Example 2.17, the Fourier transform k of k(x) = (MB * MB)(x) satisfiesTo get the
smoothness of K we estimate the decay of k. First we observe that the function t ^ (sin t)/t
satisfies, for all t e (1 ,1 ),
12
< <
sin t
t
1 + t
sin t
2
t
<
2
2
1+
<
4
2
1 +1 '
Also, when t e (1,1), (sin t)/t < 1. Hence, for all t e R,
It follows that for all % e Rn,
'sin((% bj)/2)\2
n
j=A (% bj)/2
' ~ nn=1 (1 +\ ( % bj)/2\2)'
n+
j=1 V
% bj
2
n ( nf\
1 A 2
1
1
=
+ "4" 1 > 1 + z nj = 1 + 41
j=1
j=1
2
If we denote n = Bj% , then % bj = bj % = nj and
k
(%) <
s
<
s
(1 + y % y2)s.
4n
min{1, A0 12 /4}
4n
1 +1 M 2 I I % y 2
But   n y 2 = Bj%y2 >A02%2, where X0 is the smallest (in modulus) eigenvalue of B0. It
follows that for all % e Rn,
JRn
2
2
(1 + y % y ?k(%) d% <
4n
2s
min{1, A0 12 /4}
1
1 + y% y2
2sp
d % < TO
Therefore, for any p < 2s  ,
5.6
Properties of function spaces on bounded domains X are discussed in [120]. In particular, one can find conditions on X (such as
having a minimally smooth boundary) for the extension of function classes on a bounded domain X to the corresponding classes on
Rn.
Estimating covering numbers for various function spaces is a standard theme in the fields of function spaces [47] and
approximation theory [78, 100]. The upper and lower bounds (5.2) for generalized Lipschitz spaces and, more generally, TriebelLizorkin spaces can be found in [47].
The upper bounds for covering numbers of balls of RKHSs associated with Sobolev smooth kernels described in Section 5.2
(Theorem 5.8) and the lower bounds given in Section 5.4 (Theorem 5.21) can be found in [156]. The bounds for analytic translation
invariant kernels discussed in Section 5.3 are taken from [155].
The bounds (5.23) for the Fourier transform of the inverse multiquadrics can be found in [82] and [105], where properties of
nodal functions and Proposition 5.20 can also be found.
For estimates of smoothness of general box splines sharper than those in Proposition 5.25, see [41].
In Chapter 4 we characterized the regression functions and kernels for which the approximation error has a decay of order O(Re
). This characterization was in terms of the integral operator LK and interpolation spaces. In this chapter we continue this
discussion.
We first show, in Theorem 6.2, that for a C kernel K (and under a mild condition on pX) the approximation error can decay
as O(Re) only if fp is C as well. Since the latter is too strong a requirement on fp, we now focus on regression functions and
kernels for which a logarithmic decay in the approximation error holds. Our main result, Theorem 6.7, is very general and allows for
several applications. The result, which will be proved in Section 6.4, shows some such consequences for our two main examples of
analytic kernels.
Theorem 6.1 LetX be a compact subset of Rn with piecewise smooth boundary andf p e
6.1
1X1
kernels
In this section we use Corollary 4.17, Theorem 5.5, and the embedding relation (5.1) to prove
that a CX kernel cannot yield a polynomial decay in the approximation error unless fp is CX
itself, assuming a mild condition on the measure pX.
We say that a measure v dominates the Lebesgue measure on X when d v(x) >
C0 dx for some constant C0 > 0.Theorem 6.2 Assume X c Rn has piecewise smooth
boundary and K is a C Mercer kernel on X. Assume as well that pX dominates the
Lebesgue measure on X. If for some 0 > 0
f
p R :=,giK f11 f  g^Lk )=O(R0),
thenfp is C onX.
Proof. Since pX dominates the Lebesgue measure p we have that pX is nondegenerate. Hence,
H+ = HK by Remark 4.18. By Corollary 4.17, our decay assumption implies thatfp e (Lp,x (X),
HK)0/(2+0). We show that for all s > 0, fp e Lip*(s, Ljk(X)). To do so, we take r e N, r > 2s(2 +
0)/0 > s,and t e Rn. Let g e HK and x e Xr,t. Then
Artfp(x) = Art (fpg)(x)+Artg(x) = [ (1)r J(fpg)(x +jt)+Artg(x).
j = 0 VJ/
Let t = 2s(2 + 0)/0. Using the triangle inequality and the definition of
\up*(t/2,C(X)), it follows that
f
l1/2
'/
\
\Artfp(x)\2 dx
<J
II fp  g\\L2(X) + ll^glL^ X )
X
J
rJ
j=oV /
2 II fp  g\L2(X) + Vp(X) ll gLip*(t/2,C(X)) ll t^ .
Since K is Cg e HK, and r > , we can apply Theorem 5.5 to deduce that
n g i i u p*(t/2,c (X )) <y 2 r +1 K n LiP*(t) i i g \ K .
Also, dpx(x) > Codx implies that  fp  g l l ^ x ) (1A/C0)II fp  glL^ X) and p(X) 1 /C0. By
taking the infimum over g e HK, we see that
\L dx] < 7C0get I*' 1 fpgiLx<X)
+y 2r+1 K Lip*(t) llg IK t\t/2}
2r +
 vC (2 r
* "*>)
inf
II fp  gIMPX(X) + \\t\\l,2\\g\\K .
geHK
Since fp e (L2X (X), HK)e/(2+e), by the definition of the interpolation space in terms of the Kfunctional, we have
inf
W fp  glLPX(X) + W t l P l I g l k = K fp, W t  P
geHK
 C 0\ t w e / 2 ( 2 + e ) = C 0w t r ,
where C0 may be taken as the norm of f p in the interpolation space. It follows that
ifpiuP*(iL2(X)) = sup m\ s\f \Artfp(x)\2dx\
2
teRn
Xr,t
vC
+J2+KUP^ C < ^.
Therefore, fp e Uip*(s,L^(X)). By (5.1) this implies fp e Cd(X) for any integer d < s  . But s
can be arbitrarily large, from which it follows that fp e C(X).
6.2
The approximation error depends not only on the regularity of the approximated function but
also on the regularity of the Mercer kernel. We next measure the regularity of a Mercer kernel
K on a finite set of points x = {x i , . . . , xm} c X. To this end, we introduce the following
function:
1/2 '
m
m
K(x,x)  2^2wK(x,xi) + ^2
inf
wiK(xi,xj)wj
m
weR i=1
i,j = 1
(6.1)
We show that by choosing wi appropriately, one has eK (x) ^ 0 as x becomes dense in X .It is
the order of decay of eK (x) with respect to the density of x in X that now measures the
regularity of functions in HK. The faster the decay, the more regular the functions.
As an example to see how tK (x) measures the regularity of functions in HK, suppose that
for some 0 < s < 1, the kernel K is Lip(s); that is,
\K (x, y) K (x, t)\<C(d (y, t))s, Wx, y, t e X,
where C is a constant independent of x, y, t. Define the number
dx : = max min d (x, Xi )
xX i < m
to measure the density of x in X. Let x e X. Choose xt e x such that d(x, xt) < dx. Set the
coefficients {wj}m= 1 as wt = 1, and Wj = 0 if j = t. Then
m
K(x,x)2^2wiK(x,xi)+ ^2 wiK(xi,xj)Wj = K(x,x)2K(x,xt)+K(xt,xt). i=1
ij = 1
The Lip(s) regularity and the symmetry of K yield
K(x,x) 2K(x, xt) + K(xt,xt) < 2C(d(x, xt))s < 2Cdsx.
Hence
tK(x) < 2CdX.
In particular, if X = [0,1] and x = {j/N}N=0, then dx <
, and therefore
(s  1 )
y  tis
(y  u)s1f (s)(u)du
s!
Il f
(s)
< 1 f (s)I
1
< s!1
Here Rs( f) is the linear operator representing the remainder of the Taylor expansion and
satisfies
s1
J2 wi,s1 (t)( f (l/(s
l=0
1))  f (t))
1)2s1
s!
Il f(s)
(6.2)
Proof. Let x e X. Then x = N + 111 for some fi e { 0, 1, . . . , N  s + 1}n and t e [0,1]n. Choose
the coefficients in the definition of eK (x) to be
wa =
Y ,s1(t) K (xx)  K x
+
n
WY ,s1(t)
Y e{0,...,s1}
Y e{0,...,s1}n
s!
1
d sf
d
ts
C
([0
1
s1 s1 s1
n
t
Yi1 Yi
,t
i+1, ... , tt
n
< ((s  1)2s1)n1
Y
76
Using Lemma 5.9 for Yj, j = i, we conclude that for a function f on [0,1]n and for i =
1 , . . . , n , Y e{0,...,s1}'
WY ,s1(t) f
ny
s
1)2s1)n s!
1=1
C ( [0 ,1 ]n)
d
s
f
d
ts
 f (t )
Replacing Y /(s  1) by u each time for one i e { 1 , , n}, we obtain
(1 + ((s  1)2s1)n'j
(N
i=1
Un\ ((s  1)2s1)n S s \ s s!
d sK
d ys
C (X xX )
x)t
The behavior of the quantity eK (x) is better if the kernel is of convolution type, that is, if
K(x, y) = k (x y) for an analytic function k.
Theorem 6.4 Let X = [0,1] and K (x, y) = k (x y) be a Mercer kernel on X with
k(f) < Coek^', Vf e Rn
for some constants C0 > 0 and X >4 + 2n ln4. Then, for x = {N}ae{0,1,...,N1}n with N >
4n/ ln min{eX, 4neX/2}, we have
(
\ 1 4n 1\N/ 2
fK (x)
<4 C 0 max ex
Proof. Let XN : = {0, . . . , N  1}n. For a fixed x e X, choose the coefficients Wi in (6.1)tobe wa,N
(x). Then the expression of the definition of eK (x) becomes QN (x) given by (5.13). It follows
that eK(x) < supxeX QN (x). Hence by (5.14),
CK(x) < Tk (N).
But the assumption of the kernel here verifies the condition in Theorem 5.15. Thus,
we can apply the estimate (5.21) for Tk (N) to draw our conclusion here. i=1
Kx ,f
Recall the nodal functions {Hi = Hi,x}1=1 associated with a finite subset x of X, given by
(5.25). We use them in the RKHS HK on a compact metric space (X, d) to construct an
interpolation scheme. This scheme is defined as follows:
m
ix( f)(x) =^2f (x i)Hi(x), x e X,f e C(X).
(6.3)
i=1
It satisfies Ix(f )(xi) = f (xi) for i = 1, . . . , m.
The error of the interpolation scheme for functions in HK can be estimated as follows.
Proposition 6.5 Let K be a Mercer kernel and x = {x1, . . . , xm} c X such that K[x] is
invertible. Define the interpolation scheme Ix by (6.3). Then, for f e HK,
lIx(f)  f lie(X) < eK(x)y f IIK
and
Ix (f )IK <llf IIK .
Proof. For x e X
Ix ( f )(X)  f (X) =
f (Xi )Ui (X)  f (X) = Ui (x)(Kxi ,f )K  Kx,f )K
i=1
K
Ix ( f )(x)  f ( X )  <
Ilf IIK.
K
m
J^Ui (x)Kxi  Kx i=1
the second equality by the reproducing property of HK applied to f. By the CauchySchwarz
inequality in HK,
Since (Ks, Kt )K = K (s, t), we have
'Y^Ui(x)KXi  Kx = K (x, x)  2^2 Ui (x)K (x, Xi ) Ui (x)K (Xi, Xj )Uj (x).
K
i= 1
i=1
ij=1
By Proposition 5.20, the quadratic function
mm
Q(w) = K(x,x)  2^2wiK(x,Xi) +^2 wiK(Xi,Xj)wj
i=1
i,j=1
is minimized over Rm at (ui(x))m_j. Therefore,
m
< K (x). K
J^ut (x)KXi  Kx 1=1
It follows that
Ix(f )(x) f (x)\<CK(x) f K.
This proves the estimate for /x(f) f C(X).
To prove the other inequality, note that, since Ix (f) e HK and
Ix (f )(Xi) = f (Xi), for i = 1, . . . , m,
0 = Ix( f )(Xi)  f (Xi) = (Kxi, Ix( f )  f )K.
This means that Ix(f ) f is orthogonal to span{KXi}"L 1. Hence Ix(f ) is the orthogonal
projection off onto span{KXi}"L 1 and therefore Ix(f )yK < f K.
Proposition 6.5 bounds the interpolation error Ix(f) f CX) in terms of the regularity of
K and the density of x in X measured by eK (x), when the approximated function f lies in the
RKHS (i.e., f e HK).
In the remainder of this section, we deal with the interpolation error when the
approximated function is from a larger function space (e.g., a Sobolev space), not necessarily
from HK. To this end, we need to know how large HK is compared with the space where the
approximated function lies. For a convolutiontype kernel, that is, a kernel K (x, y) = k (x
y) with k ($) > 0, this depends on how slowly k decays. So, in addition to the function eK
measuring the smoothness of K, we use the function Ak : R + ^ R + defined by
Ak(r) :=( inf &) ' ,
[rn ,rn ]n
measuring the speed of decay of k. Note that Ak is a nondecreasing function. We also use the
i,j eX
= E (2n)n J ()ei(j/Ny d
jeXN e[M n M n ]n
2
< E (2n)~n f J (N )ej Nnd
n
jeXN e[n n ]
fM (N)Nn\2 d <Nn\\ f 2 
< (2n)n n
e[n ,n ]
Then
I Ix ( fM )\K <K [x ]  1 \ 2 N n  f Il L
2
But, by Proposition 5.22, tf [x] 1 2 <Nn(Ak(N))2. Therefore, lIx( fM)k <llf II2^k(N).
This proves the statement in (i).
fM (x)Ix( fM )(x) = (2n)n
e[M n M n ]
f () eix J2 u(x)eaj d.
,ix ,
JSXN
By the CauchySchwarz inequality,
n
ix
2 11/2
ix
j d
ii; Let x e X. Then
The first term on the right is bounded by  f L2 Ak(M), since k(f) > A2(M) for f e [Mn,Mn]n. The second term is
/
\ 1/2
k (0)  2^2 uj (x)k (x  xj) + '22 ui (x)k (xi  xj )uj (x) ,
jeX
ijeX
which can be bounded by Tk (N) according to (5.14) with {0, 1, . . . , N}n replaced by {0, 1, . . . ,
N  1}n. Therefore,
4\M n M n ]n
Thus, all the statements hold true.
0 < M < NR
s  n/2
Proof. Take N to be NR.
Let M e (0,N]. Set the function fM as in Lemma 6.6. Then, by Lemma 6.6, lIx(fM)IIK <   f 
2Ak(N) <R
and
\ f  Ix (fM )\\L2(X ) < \\fM  Ix (fM )\C (X) + \\f  fM \\L2 (X)
< Ak(M) f 2Tk(N) +\\f Is(nM)s.
If s > , then
ll f  fM 1C (X) < (2n)n
V(m H
He[MnMn]n
s  n/2
Hence the second statement of the theorem also holds.
Corollary 6.8 Let X = [0,1]n, s > 0, and f e Hs(Rn). If for some a1, a2, Ci, C2 > 0, one has
k(%) > C1(1 + %)a1, V% e R'and
Tk (N) < C2 N2 ,
VN e N,
(a1 +2 s)
if a1
2s
2a2
Y=
2s ai ,
if a1 + 2s < 2a2.
If, in addition, s > n/2, then
C(n2s)/ai2sn/2
I / R \Y'
3\l f 2 +
. = ~
Vs  n / 2
where
2
v'= ( aaa+ fai + s  n > a
I 2an,
if ai + 2s  n < 2a2.
Proof. By the assumption on the lower bound of k, we find
1_
Ak (r) <
(1 + Vn r)
ai/2
C
Corollary 6.9 Let X = [0,1]n, s > 0, and f e Hs(Rn). If for some a1, a2, Si, S2, Ci, C2 > 0, one
has
and
Tk (N) < C2 exp {S2Na2}, VN e N, then, for R > (1 + A/C1) f 2,
inf
f gL2(X) < 2= f 2 + SsM2sns/2ll f lls lnR + ln
IlglK <R
C1
where A, B are constants depending only on a1, a2, S1, S2, and
C1
Y s
2
k(f) > C1 exp {S1H a1},
Vf e Rn
Y=
02
2a
if a
0?
1 < a2
2 if 1 > 2.
If, in addition, s > 2, then, for R > (1 + A/C1)\\ f 2,
Y( 2 s)
C2 B
C1 \ry2
inf f  g H C(X) < B' = f 2 + f s ln R + ln
(X)
IlglK<R 1
<
C1IU2 +s
+
f 2
where B' is a constant depending only on a1, a2, S1, S2, n, and s.
Proof. By the assumption on the lower bound of k, we find
Ak (r) < = exp j y (nn r)1 J .
It follows that for R/ f 2 > max{1/Ci, 1/k(0)},
1/ n i i i ^ M
AH
R/U f
k
112)
> ln
RC[VM
12
nn
R Ci
,
U f U2
R\

M = ^  ln
vnn
inf u f  guL2(x) < C2^12 exp
11^ <R
C
2
,
V u2 f 2112
)
2
2 151 /1
llgk <R
x/CT
ln RCI V2,"1 + 5f,2V/2 f jj(ln RC11
27
2
where
j
is given in our statement. For R >  f 2A/CT, M < R < A 1 (R/ f 2)
6.4
Now we can apply Corollary 6.9 to verify Theorem 6.1. We may assume that X c [0,1]n. Since X
has piecewise smooth boundary, every function f e Hs can be extended to a function F e
Hs(Rn) such that  f Hs(Rn) < CX  f 11Hs(X), where the constant CX depends only on X, not
on f e Hs(X).
i; From (5.17) and (5.18), we know that the condition of Corollary 6.9 is satisfied with
2
C1 = (a^)11, 81 = ^, ax = 2 and
C2 > 0, a2 = 1,82 = ln min{16n, 2n}.
Then the bounds given in Corollary 6.9 hold. Since a1 > a2, we have Y = 8 and the first
statement of Theorem 6.1 follows with bounds depending on CX .
ii; By (5.23), we see that the condition of Corollary 6.9 is valid. Moreover,
Ci > 0,
81 = c + e, a1 = 1
and
1
ec/2
C
.
2 > 0, a2 = 1, 82 = lnmin ec,
2
2
2
2
4n
Then a1 = a2 and the bounds of Corollary 6.9 hold with Y = 5. This yields the second
statement of Theorem 6.1.
6.5
[157].
Let K be a Mercer kernel, and HK its induced RKHS. Assume that
i;
K e Cs(X x X), and
ii;
the regression function fp satisfies fp e Range(LK/(4+20)), for some 0 > 0.
Fix a sample size m and a confidence 1 8, with 0 < 8 < 1. To each R > 0 we associate a
hypothesis space H = HKR, and we can consider fa and, for z e Zm, fz. The biasvariance
problem consists of finding the value of R that minimizes a natural bound for the error E(fz)
(with confidence 1 8). This value of R determines a particular hypothesis space in the family
of such spaces parameterized by R, or, to use a terminology common in the learning
literature, it selects a model.
Theorem 7.1 Let K be a Mercer kernel on X c Rn satisfying conditions (i) and (ii)
above.
(i) We exhibit, for each m e N and 8 e [0,1), a function
Em,8 = E: R+ ^ R
7.1
A useful lemma
CiXqi < x
sqixqi _ xs.
To prove the second statement, letx > max {(a )1/(s qi')  i = 1, ... , }, where we set q = 0.
Then, for i = 1, ... , , ci < 1 xsqi. It follows that
that is, y(x) > 0.
Remark 7.3 Note that given ci, c2, ... , c and s, qi, q2, ... , ql, one can efficiently compute (a
good approximation of) x* using algorithms such as Newtons method.
7.2
We first describe the natural bound we plan to minimize. Recall that E (fz) equals the sum
PX
PX
(7.3)
By Lemma 7.2, it follows that there is a unique positive solution Q* of (7.3) and, thus, a
unique positive solution R* of A'(R) + e'(R) = 0. This solution is the only minimum of E since
E(R) ^ when R ^ 0 and when R ^.
We finally prove Part (iii). Note that by Lemma 7.2, the solution of the equation induced by
(7.2)
satisfies
/
/
\ d 1/(d +1)
v*(m, 8) < max
600ln(1/8) / 600C / 12 V\
mmV
II
Therefore, v*(m, 8) ^ 0 when m ^ TO. Also, since 1/R* is a root of (7.3), Lemma 7.2 applies
again to yield
1
( 4(Mp +IIfp ,)I/K II v* (m, 8) Y/(e+1)
max
R*<
22+dngii2+*e
)
(
2
1/(
2)
WK II v* (m, 8) \ *+ 
X
\ 22+*g2+*0
from which it follows that R* ^TO when m ^TO. Note that this implies that lim A(R*) < lim
g 2+0R0 = 0.
(22+e Ng 2+*e) Q* v*(m, 8)
X
22+0
Q2
= 0,
M2v*(m, S)
m^ro
m^ro
= lim (\\IK R*+ Mp + \\fp\\ro)2v*{m, S) = 0.
m^ro
This finishes the proof of the theorem.
7.3
Let R > 1 in the proof of Theorem 7.1. Then M < (IK\\+MP + \\fp ro)R and we may take
e(R) = (\\IKII +Mp + f \\ro)2R2v*(m, S)
as an upper bound for the sample error with confidence 1  S. Hence, under conditions (i) and
(ii), we may choose
E(R) = (\\IKII +Mp + \fp ro)2R2v*(m, S) + 22+e 2+eRe.
With this choice,
/ e \ ^/(2+e)
R* = (v*(m, S)) ^^MailK  +Mp + \\fp Wro)2/(2+6) ^
tends to infinity as m does so and
e/(2+e)
e \2/(7,+9)
E(R )
* = <2 + 2
X
me/((2+e>(1+2n/s)>
^0
as m ^ro.
PX
small e.
7.4
In this chapter we have considered a form of the biasvariance problem that optimizes the
parameter R, fixing all the others. One can consider other forms of the biasvariance problem
by optimizing other parameters. For instance, in Example 2.24, one can consider the degree
of smoothness of the kernel K. The smoother K is, the smaller HK is. Therefore, the sample
error decreases and the approximation error increases with a parameter reflecting this
smoothness.
We have already discussed the biasvariance problem in Section 1.5. Further ideas on this
problem can be found in Chapter 9 of 18 and in [95].
Bounds for the roots of real and complex polynomials such as those in Lemma 7.2
are a standand theme in algebra going back to Gauss. A reference for several such
Least squares
regularization
We now abandon the setting of a compact hypothesis space adopted thus far and change the
perspective slightly. We will consider as a hypothesis space an RKHS HK but we will add a
penalization term in the error to avoid overfitting, as in the setting of compact hypothesis
spaces.
In what follows, we consider as a hypothesis space H = HK  that is, H is a whole linear
space  and the regularized error EY defined by
Ey(f) = j (f (x)  y)2 d p + YII f IlK
for a fixed Y > 0. For a sample z, the regularized empirical error Ez,Y is defined by
1 m
EZ,Y (f) = m J2 (yi  f (xi ))2 + Y i f IIK .
m
i=1
One can consider a target function fY minimizing EY (f) over HK and an empirical target fz,Y
minimizing Ez,Y over HK. We prove in Section 8.2 the existence and uniqueness of these target
and empirical target functions. One advantage of this new approach, which becomes apparent
from the results in this section, is that the empirical target function can be given an explicit
form, readily computable, in terms of the sample z, the parameter Y , and the kernel K.
Our discussion of Sections 1.4 and 1.5 remains valid in this context and the following
questions concerning fz,Y require an answer: Given Y > 0, how large is the excess
generalization error E (fz,Y)  E (fp)? Which value of Y minimizes the excess
generalization error? The main result of this chapter provides some answer to these
questions.
Theorem 8.1 Assume that K satisfies logN (B\, n) < C0(1/n)s* for some s* > 0, and p
satisfies fp e Range(LK/2) for some 0 <9 < 1. Take Y* = mz with Z < 1/(1 + s*). Then,
for every 0 < S < 1 and m > mg, with confidence 1  S,
fz,y* (x) fp(x)f d PX < C'o \og(2/S)m~ez
X
holds.
Here C'0 is a constant depending only on s*, Z, CK, M, C0, and \\L~Ke/2fp , and mS
depends also on S. We may take
mg := max j(108/Co)1/s*(log(2/S))1+1/s*, (1/(2c))2/(Z1/(1+s*),
where c = (2CK + 5)(108C0)1/(1+s*}.
At the end of this chapter, in Section 8.6, we show that the regularization approach just
introduced and the minimization in compact hypothesis spaces considered thus far are closely
related.
The parameter y is said to be the regularization parameter. The whole approach
outlined above is called a regularization scheme.
Note that y* can be computed from knowledge of m and s* only. No information on fp is
required. The next example shows a simple situation where Theorem 8.1 applies and yields
bounds on the generalization error from a simple assumption on fp.
Example 8.2 Let K be the spline kernel on X = [1,1] given in Example 4.19. If pX is the
Lebesgue measure and \\fp(x +1) fP(x)\\L2([i,1t]) = O(te) for some0 < e < 1, then, by
the conclusion of Example 4.19 and Theorem 4.1, fp e Range^L^e}/2^ for any e > 0.
Theorem 5.8 also tells us that log N(B1, n) <
E
f
(fz,y* ) E{fp) = \\fz,y*
p II L2
L
PX
o
2
log Sm
(e/3)+e
C0 (1/n)2. So, we may take s* = 2. Choose y* = mZ with Z =
yields
with confidence 1 S.
12e
8.1
Let X, K, fY, and fz,Y = fz be as above. Assume, for the time being, that fY and fz,y exist.
Theorem 8.3 LetfY e HK andfz be as above. Then E(fz)  E(fp) < E(fz)  E (fp) + Y IfzIlK
which can be bounded by
{E(fY)  E(fp) +
The definition offz implies that the second term is at most zero. Hence E(fz)  E (fp) + YI fz IlK
is bounded by (8.1).
The first term in (8.1) is the regularized error of fY. We denote it by D(Y); that is,
D(Y) := E(Y)  E(fp) + Y IIVY IlK = inf (E(f)  E(fp) + Y IIf IlK}.
f eHK
(8.2)
Note that by Proposition 1.8,
D(
Y) = Wfy  fp lip + Y IIfY III > Wfy  fp lip.
We call the second term in (8.1) the sample error (this use of the expression differs
slightly from the one in Section 1.4).
In this section we give bounds for the regularized error. The bounds (Proposition 8.5
below) easily follow from the next general result.
Theorem 8.4 Let H be a Hilbert space and A a selfadjoint, positive compact operator
on H. Let s > 0 and Y > 0. Then
(i) For all a e H, the minimizer b of the optimization problem minbsH (b  a2 + Y lAsb2)
exists and is given by
b = (A2s + Y Id)1A2sa.
(ii) For 0 < 0 < 2s,
a
= J2
Y
k>1
A
k+Y
a
k Pk.
where a =
 a
2
k>1
Y
2
A
k+
a
Y
20
k
a2
k>1
A
k+Y
A
k+Y A
0
k
Assume A 0a\\ = {'^ka'2/Adk}1/2 <
A
0
TO.
<
Y0
\\A0a2.

 a
2 + Y A1?2 = ^(h + y =
k+Y y
k>1
(
_k + y)2
(N
k>1
_k + Y
(iii) For 0 <9 < 1, A lb = J2 k>i {(V_k )/(_k + _)) ak fk .Hence
which is bounded by
_k + _y \_k + _y _
Y9
(_)' < Y9A9all2
k>1
A 3(b  a) = Y,
Y
k>
(
Y)
1 _k + S/_
a
k fk.
iv; When 1 <9 < 3, we find that
\\A1(b  a)4 = J2
2
k>
(
)2
1 _k + _ _k
= Y9T
k>1
U1 1 < Y 91 A9 a,2.
Ak + _
_k + _
_
It follows that
Thus all the estimates have been verified.
Bounds for the regularized error D(_) follow from Theorem 8.4.
f_
f
p IIK < Y
(9 1)/2
L9/2fp
p
Proposition 8.5 Let X c Rn be a compact domain and K a Mercer kernel such that for some
0 <9 < 2 fp e Range(LK/2). ThenProof. Apply Theorem 8.4 with H = L2, s = 1, A = L/2, and a
= fp, and use that L1/2f  = A1f  = f K .We know that fY is the minimizer of
min ( f  fp 2 + Y  f Hi ) = min ( f  fp 2 + y  f  ) f et2 K
feHx K
since  f K = for f ^ HK. Our conclusion follows from Theorem 8.4 and Proposition 1.8.
8.2
Let X, K,fY, and fz,Y = fz be as above. Since the hypothesis space HK is not compact, the
existence offY andfz,Y is not obvious. The goal of this section is to prove that both fY and fz,Y
exist and are unique. In addition, we show that fz,Y is easily computable from Y , the sample z,
and the kernel K on the compact metric space X .
Proposition 8.6 Let v = pX in the definition of the integral operator LK. For all Y > 0 the
function fY = (LK + Y M)1LK/P is the unique minimizer of EY over HK.
Proof. Apply Theorem 8.4 with H = L^iX), s = 1, A = L^2, and a = fp. Since, for all f e HK, f K
\\L~K6/2fp 2.
where a = (a1,..., am) is the unique solution of the wellposed linear system in Rm
iY m Id + K [x])a = y.
Here, we recall that K[x] is the m x m matrix whose (i,j) entry is K(Xi,Xj), x = (x1,...,xm)
e Xm, and y = iy1,...,ym) e Ym such that z = i(X1, y1),..., (Xm, ym)).
Proof. Let Hif) = Ynhiyi  f (Xi))2 + Y Ilf lK. Take v to be a Borel, nondegenerate measure on
X and LK to be the corresponding integral operator.
Let {pk }k >1 be an orthonormal basis of L2 (X) consisting of eigenfunctions of LK, and let
{Xk}k>1 be their corresponding eigenvalues. By Theorem 4.12, we can then write, for any f e
HK, f = EXk>o ckPk with f fK = EXk>o C2fXk.
ForeverykwithXk > 0, dHfdCk = m YT^Mi f X))Pk(Xi)+2y(CkfXk). If is a minimum of
H, then, for each k with Xk > 0, we must have d H fd ck = 0 or, solving for ck,
m
c
k = Xk ^ \aipk(xiX
i=1
where at = (yi  f (xi))fym. Thus,
m
^ 2,
^2
f (X) =
ck Pk (X) =
Xk^ai pk (Xi)pk (x)
Xk >0
Xk >0
i=1
mm
= m a^2 XkPk (Xi )Pk (X) = ^2 aK(Xi, x),
i= 1 Xk >0
i=1
where we have applied Theorem 4.10 in the last equality. Replacing f (Xi) in the definition of
ai above we obtain
ai =
yi  Em=1 ajK (Xj, Xi) Y m
Multiplying both sides by Y m and writing the result in matrix form we obtain (Y m Id + K
[x])a = y, and this system is well posed since K [x] is positive semidefinite and the result of
adding a positive semidefinite matrix and the identity is positive definite.
8.3
In this section we bound the confidence for the sample error to be small enough. The main
result is Theorem 8.10.
In what follows we assume that
(8.3)
then
*,
_ f 108log( 1/5) (108C0y/(1+l
v (m, 5) < max
,
.
mm
s
Proof. Observe that g(n) < h(n) := C0(1 /n)  mn. Since h is also strictly decreasing and
continuous on (0, +x), we can take A to be the unique positive solution of the equation h(t) =
log 5.Weknowthat v*(m, 5) < A.Theequation h(t) = log 5 can be expressed as
t1+s* 54log(1/5) s* 54C0 = 0. mm
Then Lemma 7.2 with d = 2 yields A < max{108log(1/5)/m, (108C0/m)1/(1+s J}. This verifies the
bound for v*(m, 5).
As an example, for Cs kernels on X c Rn, the decay of v*(m, 8) shown in Lemma 8.9 with
s* = 2 yields the following convergence rate.
(fz(x) fp(x)f dpx < Ci X
lg(2) +
1
mY m1/(1+s*)Y
+ Y9 +
Corollary 8.12 Assume that K satisfies logN (B\, n) < C0(1/n)s* for some s* > 0, and p
satisfies fp e Range(LK/2) for some 0 <9 < 1. Then, for all Y e (0,1] and all 0 < 8 < 1,
with confidence 1 8,
log
holds, where C1 is a constant depending only on s, CK, M, C0, and \\LK9/2fp . IfY = m1/
((1+9)(1+s*)), then the convergence rate is Jx(fz (x) fp(x))2 dpx < 6C1 log(2/8)m9/((1+9)
(1 s ))
+* .
Proof. The proof is an easy consequence of Theorem 8.10, Lemma 8.9, and Proposition 8.5.
For C kernels, s* can be arbitrarily small. Then the decay rate exhibited in Corollary 8.12 is
m(1/2)+e for any e > 0, achieved with 9 = 1. We improveTheorem 8.10 in the next section,
where more satisfactory bounds (with decay rate m1+s) are presented. The basic ideas of the
proof are included in this section.
To move toward the proof of Theorem 8.10, we write the sample error as
1 m
E (fz )
 Ez(fz ) +
Ez (fy) E (fy) =
E(fi)  fl(zt )
m
where
g := (fz(x)
 y)2  (fp(x)
y)2 and & :=
(fy (x)  yf
2
(fp(x)  y) .
The second term on the righthand side of (8.4) is about the random variable f2on Z. Since
its mean E(f2) = E (fY) E (fp) is nonnegative, we may apply the Bernstein inequality to
m
m
J2&(Zt)  E(f2) <
4Ck log (1/8
k
+ 3M +
+ 3M + 1 D(y)
Proposition 8.14 For every 0 < 8 < 1, with confidence at least 1  8,
)  Efe) < t
1=1
which implies that a2(&) < E(f) < cV(y). Now we apply the oneside Bernstein inequality in
Corollary 3.6 to f2. It asserts that for any t > 0,
1  exp > 1  exp 2dp(y) + t)
mt
2( a 2fe) + 1 Bt)
mt
m
m
< t*
)  Efe)
i=1 holds. But
t* = ^y log(1/5) + ^^y log(1/5^ + 2cmlog(1/5)V(y) j jm
< 4cy + ^2clog(1/5)V(y)/m.
By Lemma 8.13, c < 2CV(y)/y + 18M2. It follows that
^2clog(1/5)V(y)/m
t* <
8C log(l/g) 3my
3M log(l/5)
m
D(y) +
72M2 log(l/5)
3m
+ 2CK
log(l/5)
/my
D(y)
+ 3MD(y).
This implies the desired estimate.
The first term on the righthand side of (8.4) is more difficult to deal with because f1
involves the sample z throughfz. We use a result from Chapter 3, Lemma 3.19, to bound this
term by means of a covering number. For R > 0, define FR to be the set of functions from Z to
R
FR := {(f (x)  y)2  (fp (x)  y)2 :f e BRJ .
(8.5)
<
f GBR VE (f ) (fp) +
Bi,
e
(CK + 3)2R2
exp
me
54(CK + 3)2R2 '
Proof. Consider the set FR. Each function g e FR has the form g(z) = (f (x)  yf  (.fp(x)  y)2
with f e BR. Hence E(g) = E(f)  E(fp) = II f  fp II 2 > 0, Ez (g) = Ez (f)  Ez(fp), and
g(z) = (f (x)  fp(x)){ (f (x)  y) + (fp(x)  y)}.
Since f OT < CKf K < CKR and fp(x)\ < M almost everywhere, we find
that
( )( )
Vn > 0.
(8.6)
Since an n/(2(MR + CKR2))covering of Bi yields an n/(2(M + CKR)) covering of BR, and vice
versa, we see that for any n > 0, an n/(2(MR + CKR2))covering of B1 provides an ncovering of
FR. That is,
But R > M and 2(1 + CK) < (CK + 3)2. So our desired estimate follows. Now we can derive
the error bounds. For R > 0, denote
W(R) := {z e Zm : fzK < R}.
Proposition 8.16 For all 0 < 8 < 1 and R > M, there is a set VR c Zm with p(VR) < 8 such
that for all z e W(R) \ VR, the regularized error EY(fz) = E(fz)  E(fp) + Y\\fzl\K
isboundedby
2(CK + 3)2R2v*(m, 8/2) + 8CK lcg2/8) + 6M + 4 D(Y)
MY
(48M2 + 6M) log (2/8)Proof. Note that ^E(f)  E(fp) + e^fe < 1 (E(f)  E(fp)) + e. Using the
quantity v*(m, S), Proposition 8.15 with e = (CK + 3)2R2v*(m, ) tells us that there is a set
VR c Zm of measure at most  such that
E(f)  E(fp)  (Ez(f)  Ez(fp)) < 2(E(f)  E(fp)) + (CK + 3)2
R2v*(m, S/2),
Vf e BR, z e Zm \ VR.
In particular, when z e W(R) \ VR, fz e BR and
i=1
E(x)  m^ ^(Zi ) = E (fz)  E (fp)  (Ez(fz )  Ez (fp))
< 1(E(fz)  E(fp)) + (CK + 3)W(m, 5/2).
Now apply Proposition 8.14 with 5 replaced by . We can find another set VR c Zm of
measure at most 2 such that for all z e Zm \ VR,
1m
( 4CK log (2/5)
&(Zi)  E(fc) < K
+ 3M + 1 )V(y)
m
my
Y IIfz III < z,y (fz) < z,y (0) = m ^  0)2 < M2,
the last almost surely. Therefore, fzK < M/^/y for almost all z e Zm.
Lemma 8.17 says that W(MU/V) = Zm up to a set of measure zero (we ignore this null set
later). Take R := M/^/y > M. Theorem 8.10 follows from Proposition 8.16.
8.4
In this section we improve the excess generalization error estimate of Theorem 8.10. The
method in the previous section was rough because we used the bound fzK < M/^/y shown in
Lemma 8.17. This is much worse than the bound for fY given in Lemma 8.13, namely, fY K <
VD(y)U/Y. Yet we expect the minimizer fz of Ez,Y to be a good approximation of the minimizer
fY of EY . In particular, we expect fzK also to be bounded by, essentially, S/D(Y)U/Y . We
prove that this is the case with high probability by applying Proposition 8.16 iteratively. As a
consequence, we obtain better bounds for the excess generalization error.
Lemma 8.18 For all 0 < 8 < 1 and R > M, there is a set VR c Zm with P(VR) < 8 such that
W(R) W(amR + bm) U VR,
where am := (2CK + 5^/v*(m, 8/2)/y and
Y
m\
V(y) + (7M + 1) 2 log (2/5 )Proof. By Proposition 8.16, there is a set VR c Zm with p(VR) < 5
such that for all z e W(R) \ VR,
Y y fz IlK < 2(CK + 3)W (m, S/2) + 8CK loe(2/S) + 6M + 4 D(y)
\
my
J
2
(48M + 6M) log(2/5) m
This implies that
Lemma 8.19 Assume that K satisfies logN (B\, n) < C0(1/n)s for some s* > 0. Take Y =
mZ with Z < 1/(1 + s*). For all 0 < 8 < 1 and m > ms, with confidence 1  38/( 1/(1 +
s*)  Z
IIfzIlK < C^/log(2/5)^D(Y)/Y + 1)
holds. Here C2 > 0 is a constant that depends only on s*, Z, CK, C0, and M, and ms e N
depends also on 8.
Proof. By Lemma 8.9, when m > (108/C0)1/s (log(2/8))1+1/s ,
v*(m, 8/2) < (108C0/m)1/(1+s*>
(8.7)
1/(2
(8.8)
2/(Z1/(1
+ bm ^ ^ am < Me
j=0
JmJ (f/21/(2+2s*))+f/2
+ bm < Me + bn
It follows that the measure of the set W(R(J)) is at least 1  J8 > 1  3 8/( 1/(1 + s*)  Z) .By the
definition of the sequence, we have
Here we have used (8.8) and am < 2 in the first inequality, and then the restriction J > 2/(1 /(1 +
s*)  Z) > Z/(1/(1 + s*)  Z) in the second
inequality. Note that cJ < ((2CK + 5)(108Co + 1))3/^1/(1+s ) ^. Since Y = mZ, bm can be bounded
as
bm <J2 log(2/5
2CK + V6M + 4)7V(y)/Y + 7M + 1 .
Thus, R(J) < C2^log(2/5)(VD(y)/y + 1) with C2 depending only on s*, f, CK, C0, and M.
When s* ^ 0 and 0 ^ 1, we see that the convergence rate 0Z can be arbitrarily close to 1.
8.5
Reminders V
Then, there exist real numbers p, X, not both zero, such that
pDF (c) + XDH (c) = 0.
(8.9)
If p = 0 above, we can take p = 1 and we call the resulting X the Lagrange multiplier of
the problem at c.
8.6
Let z = (z1,..., zm) with zi = X, y0 e X x Y for i = 1,..., m. We also write x = (x1,...,xm) and y =
(y1,...,ym). Assume that y = 0 and K[x] is invertible. Let a* = K[x]1y, f * = YT=1 a*KXi, and
= \\f * \\2K = yK[x]1y.
Let EZ(Y) and Ez(R) be the problems
1m
min
(f (xi)  yi)2 + YII f \\K
i=1 s.t. f e HK
and
m
min m
i=1
(f (xt)  yi)2
Proposition 8.23
i;
For all Y > 0,fz,Y is the minimizer of Ez(Az(y)).
ii; Let R e (0, R0). Thenfz,R is the minimizer of Ez (Tz (R)).
Proof. Assume that
1
m
di
y'2
\l fz,y IIK = (Ym + di)3
which is positive for all Y e [0, +ro) since y = 0 by assumption. Differentiating with respect to Y
,
This expression is negative for all Y e [0, +ro). This shows that Az is strictly decreasing in its
domain. The first statement now follows since Az is continuous, Az(0) = R0, and AZ(Y) ^ 0 when
Y ^^.
To prove the second statement, consider Y > 0. Then, by Proposition 8.23(i) and (ii),
fz,Y = fz,Az (Y) = fz,Tz(Az(Y)).
To prove that Y = TZ(AZ(Y)), it is thus enough to prove that for Y, Y' e (0, +TO), iffz,Y = fz,Y/, then
Y = Y'. To do so, let i be such that y = 0 (such an i exists since y = 0). Since the coefficient
vectors forfz,Y andfz,Y/ are the same a with P1a = (y/(Ym + di^^x, we have in particular
y;
y;
,
Y m + di
Y 'm + di
whence it follows that Y = Y'.
Corollary 8.25 For all R < R0, the minimizer fz,R ofEz (R) is unique.
Proof. Let Y = rz(R). Thenfz,R = fz,Y by Proposition 8.23(ii). Now use that fz,Y is unique.
Theorem 8.21 now follows from Propositions 8.23 and 8.24 and Corollary 8.25.
Remark 8.26 Let E(y) and E(R) be the problems
min J (f (x)  y)2 dp + Y II f IlK s.t. f e HK
and
min
s.t. f e B(HK, R),
respectively,whereR, Y > 0. DenotebyfY andfR their minimizers, respectively. Also, let R1 = llfp
K if fp e HK and R1 = ro otherwise. A development similar to the one in this section shows the
existence of a decreasing global homeomorphism
A : (0 , +ro) ^ (0 ,Ri)
satisfying
8.7
The problem of approximating a function from sparse data is often ill posed. A standard
approach to dealing with illposedness is regularization theory [36,54,64,102,130].
Regularization schemes with RKHSs were introduced to learning theory in [137] using spline
kernels and in [53,134,133] using general Mercer kernels. A key feature of RKHSs is ensuring
that the minimizer of the regularization scheme can be found in the subspace spanned by
{KXi }m=1. Hence, the minimization over the possibly infinitedimensional function space HK
is reduced to minimization over a finitedimensional space [50]. This follows from the
reproducing property in RKHSs. We have devoted Section 2.8 to this feature. It is extended to
other contexts in the next two chapters.
The error analysis for the least squares regularization scheme was considered in [38] in
terms of covering numbers. The distance between fz,Y and fY was studied in [24] using stability
analysis. In [153], using leaveoneout techniques,
E (E fy)) <
2C2 2
2C
PX
These results are capacityindependent error bounds. The error analysis presented in this
chapter is capacity dependent and was mainly done in [143]. When fp e HK and s* < 2, the
learning rate given by Theorem 8.1 is better than capacityindependent ones.
A proof of Proposition 8.20 can be found, for instance, in [11].
For some applications, such as signal processing, inverse problems, and numerical
analysis, the data (xi )f=1 may be deterministic, not randomly drawn according to pX
.Then the regularization scheme inff e%K z,Y (f) involves only the random data
(yi)f=1. For active learning [33, 81], the data (xi)f=1 are drawn according to a userdefined distribution that is different from pX. Such schemes and their connections
Support vector
In the previous chapters we have dealt with the problem of learning a function f : X ^ Y when
Y = R. We have described algorithms producing an approximation fz of f from a given sample
z e Zm and we have measured the quality of this approximation with the generalization error E
as a ruler.
Although this setting applies to a good number of situations arising in practice, there are
quite a few that can be better approached. One paramount example is that described in Case
1.5. Recall that in this case we dealt with a space Y consisting of two elements (in Case 1.5
they were 0 and 1). Problems consisting of learning a binary (or finitely) valued function are
called classification problems. They occur frequently in practice (e.g., the determination,
from a given sample of clinical data, of whether a patient suffers a certain disease), and they
will be the subject of this (and the next) chapter.
A binary classifier on a compact metric spaceX is a function f : X ^{1, 1}. To provide
some continuity in our notation, we denote Y = {1,1} and keep Z = X x Y. Classification
problems thus consist of learning binary classifiers. To measure the quality of our
approximations, an appropriate notion of error is essential.
Definition 9.1 Let p be a probability distribution on Z := X x Y. The misclassification error
R( f) for a classifier f : X ^ Y is defined to be the probability of a wrong prediction, that is, the
measure of the event {f (x) = y},
R(f) := Prob {f (x) = y} =
Prob(y = f (x)  x) dPX. (9.1)
zsz
X yeY
Our target concept (in the sense of Case 1.5) is the set T :={x e X  Prob{y = 1  x} > 2 },
since the conditional distribution at x is a binary distribution.
One goal of this chapter is to describe an approach to producing classifiers from samples
(and an RKHS HK) known as support vector machines.
Needless to say, we are interested in bounding the misclassification error of the classifiers
obtained in this way. We do so for a certain class of noisefree measures that we call weakly
separable. Roughly speaking (a formal definition follows in Section 9.7), these are measures
for which there exists a function fsp e HK such that x e T fsp (x) > 0 and satisfy a decay
condition near the boundary of T. The situation is as in Figure 9.1, where the rectangle is the
set X, the dashed regions represent the set T, and their boundaries are the zero set of fsp.
Support vector machines produce classifiers from a sample z, a real number Y > 0 (a
regularization parameter as in Chapter 8 ), and an RKHS HK. Let us denote by Fz,Y such a
classifier. One major result in this chapter is the following (a more detailed statement is given
in Theorem 9.26 below).
Theorem 9.2 Assume p is weakly separable by HK. Let B1 denote the unit ball in HK.
(i) If log N (B1 , n) < C0(1/nf for some p, C0 > 0 and all n > 0, then, taking Y = m,^ (for
some 8 > 0), we have, with confidence 1  8,
^ < e(( mn^ 2 ) 2
where r and C are positive constants independent ofm and 8.(ii) If log N (Bi, n)
< C0(log(1/n))P for some p, C0 > 0 and all 0 < n < 1, then, for sufficiently large m
and some j3 > 0 , taking Y = m^, we have, with confidence 1  8,
m g mf log2
R(FZ,Y) < C
9.1
sgn( f )(x)
Binary classifiers
1
1
if ProbY ( y = 1  x)
> ProbY ( y = 1

x)
if ProbY(y = 1  x)
< ProbY(y = 1

x).
(9.2)
For any classifier f and any x e Kp, we have ProbY (y = f (x)x) = 5 . Hence statement (i) holds.
For the second statement, we observe that for x e X \ Kp, Proby (y = fc (x)  x) > Proby (y
= fc (x)  x). Then, for any classifier f, we have either f (x) = fc(x) or Prob(y
= f (x) 
x)
= Prob(y
= fc(x)
 x) > Prob(y =
fc(x) 
x).
Hence R(f) > R(fc), and equality holds if and only iff and fc are equal almost
peverywhere on X /Kt
The classifier fc is called Bayes rule.
Remark 9.4 The role played by the quantity KP is reminiscent of that played by op in the
regression setting. Note that KP depends only on p. Therefore, its occurrence in Proposition
9.3(i)  just as that of op in Proposition 1.8  is independent off. In this sense, it yields a lower
bound for the misclassification error and is, again, a measure of how well conditioned p is.
As p is unknown, the best classifier fc cannot be found directly. The goal of classification
algorithms is to find classifiers that approximate Bayes rule fc from samples z e Zm.
A possible strategy to achieve this goal could consist of fixing an RKHS HK and a Y > 0 ,
finding the minimizer fz,Y of the regularized empirical error (as described in Chapter 8 ), that is,
1
(9.3)
1
W (xi )<0 },
argmin
f eHK m
X
=0 a.e.
f (x)
m
gmin J2 X{ytf (Xi )<0 }.
(9.4)
f eUK m i=i
Note that in practical terms, we are again minimizing over HK. But we are now minimizing
a different functional.
It is clear, however, that iff is any minimizer of (9.4), so is af for all a > 0. This shows that
the regularized version of (9.4) (regularized by adding the term Y If K to the functional to be
minimized) has no solution. It also shows that we can take as minimizer a function with norm
1. We conclude that we can approximate the Bayes rule by sgn(fz0), wherefz 0 is given by
1m
f 0:
(9
z = argmin m
X{yif (Xi)<0}.
.5)
f eUK m i=i II f IIK=1
We show in the next section that although we can reduce the computation of fz 0 to a
nonlinear programming problem, the problem is not a convex one. Hence, we do not possess
efficient algorithms to findfz 0 (cf. Section 2.7). We also introduce a third approach that lies
somewhere in between those leading to problems (9.3) and (9.5). This new approach then
occupies us for the remainder of this (and the next) chapter. We focus on its geometric
background, error analysis, and algorithmic features.
1
ar
9.2
Regularized classifiers
A loss (function) is a function $ : R ^ R+. For (x, y) e Z and f : X ^ R, the quantity $ (yf
(x)) measures the local error (w.r.t. $). Recall from Chapter 1 that this is the error resulting
from the use off as a model for the process producing y from x. Global errors are obtained by
averaging over Z and empirical errors by averaging over a sample z e Zm.
Definition 9.5 The generalization error associated with the loss $ is defined as E$(f) = j $
(yf (x)) dp.
The empirical error associated with the loss $ and a sample z e Zm is defined as
1 m
E$(f) :=  V $(yif (xt)).
m
i=1
Iff e HK for some Mercer kernel K, then we can define regularized versions of these errors. For
Y > 0, we define the regularized error
E$(f) :=j $(yf (x)) dp + Y f K
and the regularized empirical error
1 m
Ez$,Y(f)
:=
$(yf (Xi)) + Y If IlK.
m
1=1
Examples of loss functions are the misclassification loss
Mt )
0
if t > 0
1
if t < 0 and the leastsquares loss $is = (1  t)2. Note that for functions f : X ^ R and
points x e X such that f (x) = 0, $0(yf (x)) = /{y=sgn(f (X))}; that is, the local error is 1 if y
and f (x) have different signs and 0 when the signs are the same.
Proposition 9.6 Restricted to binary classifiers, the generalization error w.r.t. $0 is
the misclassification error R; that is, for all classifiers f,
R(f) = E $0 (f).
In addition, the generalization error w.r.t. $is is the generalization error E. Similar
statements hold for the empirical errors.
Proof. The first statement follows from the equalities
R(f) = f X{yf (x)=1 } dp = f $0 (yf (x)) dp = E$0 (f).
JZ
JZ
For the second statement, note that the generalization error E(f) of f satisfies
Recall that HK,z is the finitedimensional subspace of HK spanned by {Kx1,..., Kxm} and P :
HK ^ HK,z is the orthogonal projection. Corollary 2.26 showed that when H = IK (BR), the
empirical target function fz for the regression problem can be chosen in HK,z. Proposition 8.7
gave a similar statement for the regularized empirical target function fz,Y (and exhibited
explicit expressions for the coefficients of fz,Y as a linear combination of {Kx1,..., Kxm}). The
proofs of Proposition 2.25 and Corollary 2.26 readily extend to show the following result.
Proposition 9.7 Let K be a Mercer kernel on X, and f a loss function. Let also B HK,
> 0, and z e Zm. If e HK is a minimizer of Ez in B, then P(f ) is a minimizer of Ez in
P(B). If, in addition, P(B) B and Ez can be minimized in B, then such a minimizer
can be chosen in P(B). Similar statements hold for .
E
z,y
We can use Proposition 9.7 to state the problem of computing fz0 as a nonlinear
programming problem.
Corollary 9.8 We can take
Remark 9.9 Take p = p0 and consider the problem of minimizing on {f e HK  f K = 1}. By Corollary 9.8, we can
minimize on {f e HK,z  f K = 1 } and take
m
f
z = X! CzjKxj,
a cjyiK(xi ,xj)<0
1
X\ m
1 =1
j=
cz =m( cT z , i , c z , m ) = argmin
csR c K [x]c=1
Since SK1 = {c e Rm  cTK[x]c = 1} is not a convex subset of Rm and X{^ CjyiK(xi Xj)<0 } may
not be a convex function of c e SK\ the optimization problem of computing cz is not, in
general, a convex programming problem.
We would like thus to replace the loss p0 by a loss p that, on one hand, approximates
Bayes rule  for which we will require that p is close to the misclassification loss p0  and, on
the other hand, leads to a convex programming problem. Although we could do so in the
setting described in Chapter 1 (we actually did it withfz0 above), we instead consider the
regularized setting of Chapter 8 .
Definition 9.10 Let K be a Mercer kernel, p a loss function, z e Zm, and Y > 0. The
regularized classifier associated with K, p, z, and Y is defined as sgnfzY), where
P{yf (Xi)) + Y Ilf III .
(9 .6 )
1m
Note that (9.6) is a regularization scheme like those described in Chapter 8 . The constant
Y > 0 is called the regularization parameter, and it is often selected as a function of m, Y =
Y(m).
Proposition 9.11 If : R ^ R+ is convex, then the optimization problem induced by
(9.6) is a convex programming one.
Proof. According to Proposition 9.7,fzfY =J2jL1 cZjKXj, where
(c
1
c
z
z, i , . . . , cz,m) = argmin  <p I
csRm m
i=i
=i
yK(xi, Xj)cj I
m
+ y ^2 cK(xi, Xj)cj. =1
For each i = 1 , . . . , m , $ j=tyK(xi,xj)cj = $(yTK[x]c) is a convex
function of c e Rm. In addition, since K is a Mercer kernel, the Gramian matrix K [x] is positive
semidefinite. Therefore, the function c ^ cTK [x]c is convex. Thus, cz is the minimizer of a
convex function.
Regularized classifiers associated with general loss functions are discussed in the next
chapter. In particular, we show there that the least squares loss $s yields a satisfactory
algorithm from the point of view of convergence rates in its error analysis. Here we restrict
our exposition to a special loss, called hinge loss,
$h(t) = (1  t)+ = max{1  t, 0}.
(9.7)
The regularized classifier associated with the hinge loss, the support vector machine, has
been used extensively and appears to have a small misclassification error in practice. One
nice property of the hinge loss $h, not possessed by the least squares loss $s, is the
elimination of the local error when yf (x) > 1. This property often makes the solution fy of
(9.6) sparse in the representation fZfj) = J2=t cz,iKXi. That is, most coefficients cz,i in this
representation vanish. Hence the computation offZfjh can, in practice, be very fast. We return
to this issue at the end of Section 9.4.
9.3
Although the definition of the hinge loss may not suggest at a first
glance any particular reason for inducing good classifiers, it turns out that
there is some geometry to explain why it may do so. We next disgress on
this geometry.
Suppose X Rn and z = (zi, . . . , zm) is a sample set with zi = (xi,yi), i = 1,, m. Then z consists
of two classes with the following sets of indices: I = {i I yi = 1} and II = {i  yi = 1}. Let H
be a hyperplane given by w x = b with w e Rn, w = 1, and b e R. We say that I and II are
separable by H when, for i = 1 , . . . , m,
w xi >
b if i e I
w xi <
b if i e II.
That is, points xi corresponding to I and II lie on different sides of H. We say that I and II are
separable (or that z is so) when there exists a hyperplane H separating them. As shown in
Figure 9.2, if w is a unit vector in Rn, then the distance from a point x* e Rn to the plane w x
= 0 is x*  cos 01 = wx* HI cos 01
= \w x*. For any b
e R, the
hyperplane
Hgiven
by w x
=b
is parallel to w x
=0 and the distancefrom the point x* to
H is
w x* b.
When w x*  b < 0, the point x* lies on the side of H opposite to the direction w.
If I and II are separable by H,points xi with i e I satisfy w xi  b > Oandthe point(s) in
this set closest to H is (are) at a distance bI (w) := minieI{w xib} = minieI wxib. Similarly,
points xi with i e II satisfy wxib < Oandthe points in this set closest to H is (are) at a
distance bu(w) :=  maxieII{w xi  b} = b  maxign w xi.
If we shift the separating hyperplane to w x = c(w) with
c(w) = 1 min w xi + max w xi ,
2
ieI
ieII
x=0
iel
isII
if i e I
Different directions w induce different separating hyperplanes. In Figure 9.3, one can
rotate w such that a hyperplane with smaller angle still separates the data, and such a
separating hyperplane will have a larger margin (see Figure 9.4).
O
Figure 9.4
Any hyperplane in Rn induces a classifier. If its equation is w x b = 0, then the
function x ^ sgn(w x b) is such a classifier. This reasoning suggests that the best
classifier among those induced in this way may be that for which the direction w yields a
separating hyperplane with the largest possible margin A(w). Given z, such a direction is
obtained by solving the optimization problem
max A(w) lw = 1
or, in other words,
max 1 min w Xi  max w Xi .
(9.8)
lw=l 2 yi=1
yi=l
If w* is a maximizer of (9.8) with A(w*) > 0, then w* x = c(w*) is called the optimal
hyperplane and A(w*) is called the (maximal) margin of the sample.
Theorem 9.12 If z is separable with I and II both nonempty, then the optimization
problem (9.8) has a unique solution w*, A(w*) > 0, and the optimal separating
hyperplane is given by w* x = c(w* ).
Proof. The function A : Rn ^ R defined by
A(w) = 1 min w Xi  max w Xi
2
yi = 1
yi=1
= minw Xi + minw %,
ISI
ISIIwhere
2 Xi
X
i
 2 Xi
w*
w*
w
= min
ieI w
*
*
Xi + min ieII
w*
if yi = 1
if yi = 1 ,
w*
Xi =
 w*
A(w*) > A(w*).
is continuous. Therefore, A achieves a maximum value over the compact set {w e Rn   w y
< 1}. The maximum cannot be achieved in the interior of this set; for w* with  w*  < 1 , we
have
Furthermore, the maximum cannot be attained at two different points. Otherwise, for two
maximizers w* = w*, we would have, for any i e I and j e II,
w*
Xi + w* Xj
> A(w*),
w* Xi
+ w*
Xj
>
/S(w*2)=
A(w*),
which implies
(2 w* + 2 wi) Xi + (2 w* + 2 w*) Xj > A(wi).
That is^2 w* + 2 w*) would be another maximizer, lying in the interior, which is not possible.
9.4
When z is separable, we can obtain a classifier by solving (9.8) and then taking, if w* is the
computed solution, the classifier x ^ sgn (w* x c(w*)). We can also solve an equivalent
form of (9.8).
Theorem 9.13 Assume (9.8) has a solution w* with A(w*) > 0. Then w* = w/w, where w
is a solution of
2
min
 w 2
weRn, beR
(9 .9 )
s.t.
yt (w Xi  b) > 1 , i = 1 ,m.
Moreover, A(w*) = 1 /1w is the margin.
Proof. Aminimizer (\v,b) of the quadratic function w2 subject to the linear constraints exists.
Recall that
A(w) = 1 min w Xi  max w Xi
2 yi=1
yi=1
Then
A
1 w
b
min
xi
2 yi=1
w
w
w
b
1
 max
Xj
>
,
y; =1 W^^
W
W
since w Xi b > 1 when yi = 1 , and w Xj  b <1 when yj = 1 .
We claim that A(w0 ) < 1 /1w for each unit vector w0. If this is so, we can conclude from
Theorem 9.12 that A(w/w) = 1 /1w = A(w*) and
w* = w/w.
Suppose, to the contrary, that for some unit vector w0 e Rn, A(w0 ) > 1/HwH holds. Consider
the vector w = w0/A(w0) together with
2 (minyi=1 w0 Xi + maxyj=1 w0 Xj)
A(w0)
'
They satisfy
w Xi b =
wo xi 2 (minyi=1 wo Xi + maxy.=1 wo Xj)
A(wo)
> 1 if yi = 1
and
w Xj b =
wo Xj 2 (minyi =1 wo Xi + maxy.=_ 1 wo Xj)
A(wo)
< 1 if yj = 1 .But w 2 = w0 2 /A(w0 ) 2 = 1/A(w0 ) 2 < w2, which is in contradiction with w
being a minimizer of (9.9).
Thus, in the separable case, we can proceed by solving either the optimization problem
(9.8) or that given by (9.9). The resulting classifier is called the hard margin classifier, and
its margin is given by A(w*) with w* the solution of (9.8) or by 1/w with w the solution of
(9.9).
It follows from Theorem 9.12 that there are at least n support vectors. In most
applications of the support vector machine, the number of support vectors is much smaller
than the sample size m. This makes the algorithm solving (9.9) run faster.
Support vector machines (SVMs) consist of a family of efficient classification algorithms:
the SVM hard margin classifier (9.9), which works for separable data, the SVM soft margin
classifier (9.10) for nonseparable data (see next section), and the general SVM algorithm (9.6)
associated with the hinge loss ph and a general Mercer kernel K. The first two classifiers can
be expressed in terms of the linear kernel K(x, y) = x y + 1, whereas the general SVM
involves general Mercer kernels: the polynomial kernel (x y + 1)d with d e N or Gaussians
exp{x  y2 /a2} with a > 0. These SVM algorithms share a special feature caused by the
hinge loss ph: the solution = J2m=i Cz,iKx often has a sparse vector of coefficients cz =
(cz, 1 , . . . , cz,m), which makes the algorithm computing cz run faster.
In the nonseparable situation, there are no w e R" and b e R such that the points in z can be
separated in to two classes with yi = l a n d yi = 1 by the hyperplane w x = b. In this case,
we look for the soft margi" classifier. This is defined by introducing slack variables f = (f
1,. . . , fm) and considering the problem
m
min
wsRn, bsR, sRm s.t.
INI2 + jm %i
(9.10)
i=1
yi (w xi  b) > 1  %i > 0 , i = 1 , . . . , m .
Here Y > 0 is a regularization parameter. If (w,b,f) is a solution of (9.10), then its associated
soft margin classifier is defined by x ^ sgn (w x  b).
The hard margin problem (9.9) in the separable case can be seen as a special case of the
soft margin one (9.10) corresponding to 1 =ro, in which case all solutions have f = 0 .
We claimed at the end of Section 9.2 that the regularized classifier associated with the
hinge loss was related to our previous discussion of margins and separating hyperplanes. To
see why this is so we next show that the soft margin classifier is a special example of (9.6).
Recall that the hinge loss ph is defined by
ph(t) = (1  t)+ = max{1  t,0 }.
1
min n
wsR , bsK m
m
J2^h(yt(w xt i=1
b)) + Y lw
2
If (w,b,l;) is a solution of (9.10), then we must have ^ = (1 yi(w xi b))+; that is, i =
min
f &HK,bsR m
m
(9.11)
J2$h(yitf (Xi)  b)) + Y If llK. i=i
The scheme (9.11) is the same as (9.6) with the linear kernel except for the constant term b,
called ofset. 5
One motivation to consider scheme (9.6) with an arbitrary Mercer kernel is the
expectation of separating data by surfaces instead of hyperplanes only. Let f be a function on
Rn, and f (x) = 0 the corresponding surface. The two classes I and II are separable by this
surface if, for i = 1 , . . . , m,
f (xi) > 0if i e I
f (xi) < 0
if i e II;
that is, if yf (xi) > 0 for i = 1 , . . . , m. This set of inequalities is an empirical version of the
separation condition yf (x) > 0 almost surely for the probability distribution p on Z. Such a
separation condition is more general than the separation by hyperplanes. In order to find such
a separating surface using efficient algorithms (convex optimization), we require that the
function f lies in an RKHS HK. Under such a separation condition, one may take Y = 0 and
algorithm (9.6) corresponds to a hard margin classifier. This is the context of the next two
sections, on error analysis.
9.6
In this section we present an error analysis for scheme (9.6) with the hinge loss = (1  t)+ for
separable distributions.
Definition 9.14 Let HK be an RKHS of functions on X, and p a probability measure on Z = X x
Y. We say that p is strictly separable by HK with margin A > 0 if there is somefsp e HK such
that fsp K = 1 and yfsp(x) > A almost surely.
Remark 9.15
i; Even under the weaker condition that yfsp (x) > 0 almost surely (which we consider in the
next section), we have y = sgn(fsp(x)) almost surely. Hence, the variance op vanishes
(i.e., p is noise free) and so does KP .
ii; As a consequence of (i), fc = sgnfsp).
iii; Sincefsp is continuous and fsp (x) > A almost surely, it follows that if p is strictly
separable, then
px(T n X \ T) = 0 ,
/sp = 0
Figure 9.5
where T = {x e X  fc(x) = 1}. This implies that if X is connected, pX (T) > 0, and pX (X \ T)
5 We could have considered the scheme (9.11) with offset. We did not do so for simplicity of
exposition. References to work on the general case can be found in Section 9.8.
> 0, then p is degenerate. The situation would be as in Figure 9.5, where the two dashed
regions represent the support of the set T, those with dots represent the support of X \ T, and
the remainder of the rectangle has measure zero (for the measure pX).
Theorem 9.16 If p is strictly separable by HK with margin A, then, for almost every z e Zm,
Y_
A2
Et' ft ) <
and IfY I < A.
Proof. SincefSp/A e HK, we see from the definition offZfJ) that
f
sp
A
f
Et ft ) + f III < Efh
A +Y
But y(fSp(x)/A) > 1 almost surely, that is, 1  y(fSp(x)/A) < 0, so we have th (yfsp(x)/A)) = 0
almost surely. It follows that EZfh (/Sp/A) = 0. Since
lfsp/AiK = I/A2 ,
Efh (ft) + Y\ft \\K <
holds and the statement follows.
The results in Chapter 8 lead us to expect the solution ft of (9.6) to satisfy Eth (ft) ^ Eth
(ft), where ft is
at
this is indeed the case. To this end, we first characterize ft. For x e X, let nx := Proby(y = 1 
x).
Theorem 9.17 For any measurable function f : X ^ R
E( f ) > E(fc )
holds.
That is, the Bayes rulefc is a minimizerft of Eth.
Proof. Write Eth (f) = fX $h,x(f (x)) dpX, where
h,x(t) = j th(yt) dp(y  x) = th(t)nx + th(t)(i  nx).
When t = fc(x) e { 1 , 1}, for y = fc(x) one finds that yt = 1 and th(yt) = 0, whereas for y =
fc(x) = fc(x), yt = 1 and th(yt) = 2. So y th(yt) dp(y  x) = 2 Prob(y = fc(x)  x) and
$h,x(fc(x)) = 2 Prob(y = fc(x)  x).
According to (9.2), Prob (y = fc (x)  x) < Prob (y = s  x) for s = 1. Hence, $h,x(fc(x)) <
2Prob(y = s  x) for any s e {1, 1}.
If t > 1, then
th(t) = 0 and $h,x(t)
= (1 + t)(1  nx) > 2(1
nx)
>
h,x (fc (x)).
If t < 1, then ) = 0 and $h,x(t) = (1  t)nx > 2nx > $h,x(fc(x)). If 1 < t < 1, then $hx(t) =
(1  t)nx + (1 + t)(1  nx)>(1  t)
1
$h,x (fc(x)) + (1 + t) 2 $h,x (fc (x)) = $h,x (fc(x)).
Thus, we have $h,x(t) > $h,x(fc (x)) for all t e R. In particular,
fh (f) =
$h,x(f (x)) dpx >$h,x(fc(x)) dpx = fh (fc).
X
X
Eth (ft) = E(ft)  Eth (ft) + t (ft) < Eth (ft)  E
4 >h
<Ph
(ft) + 2 (9.12)
When p is strictly separable by HK, weseethatsgn( yfsp(x)) = 1 and hence y = sgn( fsp(x))
almost surely. This means fc (x) = sgn( fsp(x)) and y = fc(x) almost surely. In this case, we
have fh (fc) = jZ (1  yfc(x))+ = 0.Therefore, we expect (ft) ^ 0. To get error bounds
showing that this is the case we write
Here we have used the first inequality in Theorem 9.16. The second inequality of that theorem
tells us that ft lies in the set {f e HK  Ilf K < 1/A). So it is sufficient to estimate E(f)  h (f)
for functions f in this set in some uniform way. We can use the same idea we used in Lemmas
3.18 and 3.19.
Prob
zsZm
t____m 1=1 tZi)
> a*/e
< exp
3a2me
8M
Lemma 9.18 Suppose a random variable f satisfies 0 < f < M. Denote p = E(f). For
every e > 0 and 0 < a < 1,
holds.
Proof. The proof follows from Lemma 3.18, since the assumption 0 < f < M implies f  p< M
and E(f2) < ME(f).
Prob
zsZm
E(f)  Eth (f) sup
f sF x/E(f) + e
> 4^/e
< N(F, ae) exp
3a2me 8(1 + B) '
Lemma9.19 Let F be a subset of C (X) suchthat If IIC(X) < B for all f e F. Then, for
every e > 0 and 0 < a < 1, we have
Proof. Let { fi , . . . , fN} be an aenet for F with N = N(F, ae). Then, for each f e F, there is
some j < N such that  f  fj C(X) < ae. Since fh is Lipschitz, lfh(t)  fp(t')l < t  t'l for all t,t'
e R. Therefore, \*h(yf(x))  *h(yfj(x)) < yf  fjIle(X) < as. It follows that \E*h(f )  E*h(fj)\ < as
and \E*h(f )  E*h(fj)\ < as. Hence,
\E*'n (f )  E*h (fj)l V *h ( f ) + s
Ez*h ( f )  E*h ( fj )\
r .
< a s and
< as.
E *h ( f ) + s
Also,
y E ( fj ) + s < y E *h ( f ) + s + E*h ( fj )  E*h ( f )
<yE*h(f ) + s + J\*h(fj)  E*h(f )!.
Since a < 1, we have \E*h (f )  E*h (fj)\< s < s + E*h (f ) and then
< 2^*hf)+l.
(9.13)
*h
*h
E*h ( f )  E*h ( f )
E*h ( f )  E*h ( f )
E*" ( f )  E*" ( f )
+
s/E *h ( f ) + s
E *h ( f ) + s
E *h ( f ) + s
+
E( f )  E*h ( f ) < 2a^_ + E( f )  E*h ( f )
V E *h ( f ) + s
V E *h ( f ) + s
It follows that if (E *h ( f )  E*h ( f ))/^/ E *h ( f ) + s ) > 4a V for some f e F, then
E*h ( f )  E*h ( f )
s/E *h ( f ) + s
This, together with (9.13), tells us that
E*h ( f )  E*h ( f )
> 2 a V.
yE ( fj )+s
> a*/s.
Thus,
Prob
zeZm
E 4>h ( f )  E*h ( f )
sup ;
f eF VE *h ( f ) + s
> 4a*/s
<
j=1
E ^ ( fj )  E*h f )
>as.
yE *h ( fj )+s
Therefore,
The statement now follows from Lemma 9.18 applied to the random variable H =
<h(yfj(x)), for j = 1 ,N, which satisfies 0 < f < 1 + \\fj\\C(X) < 1 + B.
We can now derive error bounds for strictly separable measures. Recall that Bi denotes
the unit ball of HK as a subset of C(X).
Theorem 9.20 If p is strictly separable by HK with margin A, then, for any 0 < 5 < 1 ,
E(ft) < 2s* (m, 5) + A
with confidence at least 1  5, where s*(m, 5) is the smallest positive solution of the
inequality in s
log N
3
me
< log 8 .
128(1 + CK/A)
In addition,
(i) If log N (B1, n) < Co(1/n)p for some p, Co > 0, and all n > 0, then
e*(m, 8) < 86(1 + CK/A) max
jlogO/S) Cl/d+p) m 0
/ 4 \p/(1 +p) / 1 \ 1/(1+p) 1
Am
e*(m,8) <
(log mf m
{1 + 43(1 + CK/A)(2pCo + log(1/8))}.
(ii) If log N (B1, n) < C0(log(1/n)Y for some p, C0 > 0 and all 0 < n < 1, then, for m >
max{4/A, 3},
Proof. We apply Lemma 9.19 to the set F ={f e HK  f K < A}. Each functionf e F satisfies  f
C(X ) < CK  f K < CK/A. By Lemma 9.19, for any 0 <a < 1, with confidence at least 1N (F,
ae) exp {3a2 me/(8(1 + CK / A ) ) } , we have
< 4a*fe.
E^ (f) (f)
sup;
f eF yMf)
In particular, the function fy, which belongs to F by Theorem 9.16, satisfies
E*h( fZty) (fih)
.r
yE^
)+e
This, together with (9.12), yields
E (fZY) < 4a^eJs*f)Ve + A.
If we denote t = ^E^ (fy) + e, this inequality becomes t6 4a*fet (e + A) < 0.
Solving the associated quadratic equation and taking into account that t = yE(fy) + e > 0 ,
we deduce that
0 < t < 2a
+ (2a^~s)2 + e + ^2'
Hence, using the elementary inequality (a + b)2 < 2a2 + 2b2, we obtain
E(fy) = t2 e < 2(4a2e)+2 ((2a je)2 + e + A) e = 16a2e+e+A^.
Set a = 4 and e = e*(m, 8) as in the statement. Then the confidence 1 N(F, ae) exp{
3a2me/(8(1 + CK/A))} is at least 1 8, and with this confidence we have
E( f$) < 2e*(m, 8) + A
256(1 + CK/A)
1
e(m, S) < e < max
 log ,
3m
S
(
256(1 + CK/A)C0\1/(1+p) ( 4 \p/(1+p)
m
(A)'
t/a+p)
(ii) I fl o g N (Bi, n) < C0 (log(l/n))p, then e*(m, 5) < e*, where e* satisfies
3 me
Co[
log 0
Ae
128(1 + C/A)
= log S.
Then Lemma 7.2 with d = 2 yields
A = max
(2pCo + log(1/S)) 128(1 + C/A) 3
1
The function h: R+^ R defined by h(e) = C0 ^log A^^  3me/ (128(1 + CK/A)) is decreasing.
6 If log N(Bi, n) < C0 (1/n^,then e* (m, 8) < e*, where e* satisfies
3
me
Ae128(1 + CK/A)
This equation can be written as
i+p _ 128(1 +
3m
8
e
3m
A'
Take
and e = A(log
m)/m. Then, for m > max{4/A, 3},
A (log m)p
1
4
> and log + log m < 2 log m. m
m
A
It follows that
Al
m
p
h ( m ) < C^log A + log m)  (log m) ^2 Co + log ^
< (log m) log 1 < log 5.
5
Since h isp decreasing, we have
A(log m)
e (m, 5) < e* <
.
m
In order to obtain estimates for the misclassification error from the generalization error,
we need to compare R with E. This is simple.
Theorem 9.21 For any measure p and any measurable function f : X ^ R, R(sgn(f))  R(fc) <
E(f) 
(fc).
m/
C
t (t+p) , C0
4 \ p/(t+p)
\ 1/(1+p)
R(sgn($)) <
+ 172(1 + CKM)
(ii) If log N (B\, n) < C0(log(1/n))P for some p, C0 > 0 and all 0 < n < 1, then, for m >
max{4/A, 3}, with confidence 1  8,
R(sgn($)) < A2 + (lQgmmF {2 + 86(1 + CK/A)(2pC0 + log(1 /8 ) ) } .
It follows from Corollary 9.22 that for strictly separable measures, we may take Y = 0. In
this case, the penalized term in (9.6) vanishes and the soft margin classifier becomes a hard
margin one.
Col Y 
= log 8 .
We continue our discussion on separable measures. By abandoning the positive margin A > 0
assumption, we consider weakly separable measures.
Definition 9.23 We say that p is weakly separable by HK if there is some function fsp e HK
satisfying  fsp K = 1 and yfsp(x) > 0 almost surely. It has separation triple (0, A, Co) e (0,
TO] X (0, TO) 2 if, for all t > 0,
px {x e X :  fsp (x)  < At} < C0 te.
(9.16)
The largest 0 for which there are positive constants A, C0 such that (0, A, C0) is a separation
triple is called the separation exponent of p (w.r.t. HK andfsp).
Remark 9.24 Note that when 0 =TO, condition (9.16) is the same as pX {x e X : fsp (x)  <
At} = 0 for all 0 < t < 1. That is, fsp (x)  > A almost surely. But yfsp (x) > 0 almost surely. So
a weakly separable measure with 0 = TO is exactly a strictly separable measure with margin
A.
Lemma 9.25 Assume p is weakly separable by HK with separation triple (0, A, C0 ).
Take
(9.17)
Then
E *h ( fy) + YII f
Proof. Write fY = fsp/At, with t > 0 to be determined. Since yfsp(x) > 0 almost surely, the same
holds for yfY(x) >0. Hence, fu(yfY(x)) < 1 and fh(yfY(x))> 0 only if yfY(x)<1, that is, if  fY(xf
= fsp(x)/At < 1. Therefore,
*h (fy) = 0h( yfY(x)) dp = 0h(l fY(x)l) dpx
Z
x
=
(1  Ify(x)l) dpx < px {x e X :  fy(x)l
I fy(x)l<1
< 1}< px {x e x : l fsp (x) l < At} < C0 te.
f
Theorem 9.26 If p is weakly separable by HK with a separation triple (0, A, C0 ) then, for
any 0 < 8 < 1, with confidence 1  8, we have
R(sgn($)) < 2e*(m, 8, Y) + 8C^/(2+0) (A,
where e*(m, 8, y) is the smallest positive solution of the inequality
8
< log2
(b', 4 R)
3 me
128(1 + CKR) with R = 2c',(1 +0)A0'(2 +0)y1/(2+ +J 2log,8)
In addition,
(i) If log N (Bu n) < C0 (1/n)p for some p, C0 > 0 and all n > 0, then, taking Y = mP
with 0 < ) < max{20+0+2p}, we have, with confidence 1  8,
log N
R(s 8 f))
< e(( m
(io
i))
with r = min 2++0), 2++ eyi+p) , 2+0 and C a constant independent of m and 8.
(ii) If log N (Bi, n) < C0(log(1/n)f for some p, C0 > 0 and all 0 < n < 1, then, for m >
max{4 / A , 3 }, taking Y = m(2+e>/(1+e>, we have, with confidence 1  8,
R(sgn(fZY)) < C ^} (log m)p (log 2 ) j
Proof. By Theorem 9.21, it is sufficient to bound E(fz;) as stated, since y = sgn(fsp (x)) = fc
(x) almost surely and therefore E^h (fc) = 0 and R(fc) = 0 .
Choose fY by (9.17). Decompose E(fy) + Y  fZfjh K as
E*h (fty)  Ezh (fty) + Eh (fv) + YII fty IK  (Eth (fY) + YII fY IlK) }
+ Ezh (fY) + Y 1 fY IlK.
lY IK
Since the middle term is at most zero by (9.6), we have E^ (fZY) + Y II fz+ IlK < (E^ (fz+ ) Eth (fY)} + Eh (fY) + Y II fY
F*rob {E?h (fY)  E*h (fy) < e} >
ms*
2(E ( fy) + 3 s)
ms
2
5
To bound the last term, consider the random variable f = fh(yfY(x)). Since yfY(x) > 0 almost
surely, we have 0 < f < l.Also, o2(f) < E(f) = E^ (fY). Apply the oneside Bernstein inequality
to f and deduce, for each e > 0,
2 ( E ( f y ) + 1 s) 2
s
we conclude that, with confidence 1  ,
E z ( f y)  E ( f y) <
2 log I + y j ( log I) + 2mE^ (f y ) !g
<
7 log 6 m
+ E ^ ( f y) .
m
Thus, there exists a subset U1 of Zm with p(U1) > 1   such that
I
9.8
The support vector machine was introduced by Vapnik and his collaborators. It appeared in
[20] with polynomial kernels K(x, y) = (1 + x y)d and in [35] with general Mercer kernels.
More details about the algorithm for solving optimization problem (9.11) can be found in [37,
107, 134, 152].
Proposition 9.3 and some other properties of the Bayes rule can be found in [44].
Proposition 9.7 (a representer theorem) can be found in [137]. The material in Sections 9.39.5 is taken from [134].
Theorem 9.17 was proved in [138] and Theorem 9.21 in [154]. The idea of comparing
excess errors also appeared in [80].
The error analysis for support vector machines and strictly separable
distributions was already well understood in the early works on support vector
machines (see [134, 37]). The concept of weakly separable distribution was
introduced, and the error analysis for such a distribution was performed, in [31].
When the support vector machine soft margin classifier contains an offset term
b as in (9.11), the algorithm is more flexible and more general data can be
separated. But the error analysis is more complex than for scheme
(96; , which has no offset. The bound for  ^ becomes larger than those
shown in Theorem 9.16 and (9.18). But the approach we have used for scheme
can be applied as well and a similar error analysis can be performed. For details,
see [31].
In Chapter 9 we saw that solving classification problems amounts to approximating the Bayes
rule fc (w.r.t. the misclassification error) and we described a learning algorithm, the support
vector machine, producing such aproximations from a sample z, a Mercer kernel K, and a
regularization parameter Y > 0. The main result in Chapter 9 estimated the quality of the
approximations obtained under a separability hypothesis on p. The classifier produced by the
support vector machine is the regularized classifier associated with z, K, y, and a particular
loss function, the hinge loss 0 h. Recall that for a loss function 0, this regularized classifier is
given by sgn(fZty), with
(10.1)
Note that Proposition 9.11 implies that optimization problem (10.1) for a classifying loss function is a convex programming problem.
One special feature shared by p\s, ph, and p2 = (ph)2 is that their associated convex programming problems are quadratic programming
problems. This allows for many efficient algorithms to be applied when computing a solution of (10.1). Note that p\s differs from p2 by the
addition of a symmetric part on the right of 1.
Figure 10.1 shows the shape of some of these classifying loss functions (together with that of p 0).
The following two theorems, easily derived from Theorem 10.24, become specific for C kernels and the hinge loss ph and the least
squares loss p\s, respectively.
Theorem 10.2 Assume that X c Rn, K is C in X x X, and, for some p > 0,Choose
(l+P)
. Then, for any 0 < e < 1 and 0 < 8 < 1, with confidence 1  8,
I)
=m
2/
SH K
Choose
PX
= m. Then, for any 0 < e < 1 and 0 < 8 < 1, with confidence
 8,
(log ^
\\
holds, where CC is a constant depending on e but not on m or 8.
Again, the exponents P in Theorems 10.2 and 10.3 depend on the measure p. We note,
however, that in the latter this exponent occurs only in the bounds; that is, the regularization
parameter Y can be chosen without knowing P and, actually, without any knowledge about p.
Theorem 10.5 Let f be a classifying loss such that f" (00) exists and is positive. Then there is
a constant cf > 0 such that for all measurable functions f : X ^ R,
R(sgn(f))  R(fc) < 04,yjs* (f)  Ef (ff ).
If R( fc) = 0, then the bound can be improved to
R(sgn(f))  R( fc) < 04 {E (f)  E (ff )).
4
and
f+(x) := inf{t e R  $+(t) = nx$+ (t) _ (1 _ nx)$_(_t) > 0}.
The convexity of $ implies that f _(x) < f+ (x).
Theorem 10.8 Let $ be a classifying loss function andx e X.
i;
The convex function $x is strictly decreasing on (_rn ,f (x)], strictly increasing on
[ f+(x), +ro), and constant on [ f _(x), f+ (x)].
ii; f$ (x) is a minimizer of $x and can be taken to be any value in
[
fP~(x),f+(x)l
if fp (x) <
Lemma 10.10 Let 0 be a classifying loss such that 0"(0) exists and is positive. Then
there is a constant C = C0 > 0 such that for all x e X,
$(0) $(f0(x)) > C 2) 2 .0(t)  0(0) t
0(0)
< no)
<
Proof. By the definition of 0"(0), there exists some 1 > c0 > 0 such that for all t e [co, co],
d>"(0)
0(0) + 4> (0)t ^j1 lt < 0(t) < ^(0) + 0'(0)t
0'(0)
+ ?LYLt, Vt e[c0, c0].
(10.7)
Let x e X. Consider the case nx > j first.
Denote A = min{(0f (nx  j), c0}. For 0 < t < c0,
,
,
,
,
d>"(0)
0(t) = nx0(t)  (1  nx)^(t) < (2nx  1)4>'(0 ) + 0 "(0 )t
t.
w
Thus, for 0 < t < A < $(00) (nx  1), e have
0(0)
0'(0)
0(t ) < (2 nx < 0 (0 ) 2
3
1 )0 '(0 )
+^
"(0) nx  i) <
Therefore $ is strictly decreasing on the interval [0, A]. But ff ,(x) is its minimal point, so
$(0)  $(f$ (x)) > $(0)  $(A) > ^ L  ^ A.
0 (nx 
2 ).
en
2 ). In both cases, we have
Wh
$(0)
$(f0(x)) >
0(0)
2
1
2
min
0(0)
000)
When (nx  5 ) < co, we have A : W (nx  2 ) > co, we have A = co > 2co (nx
C = min
0(0)c0
{0(0)) 2 1 20 "(0) ('
That is, the desired inequality holds with
The proof for nx < 1 is similar: one estimates the upper bound of $ (t) for t < 0 .
Proof of Theorem 10.5 Denote Xc = {x e X  sgn( f )(x) = fc(x)}. Recall (9.14). Applying the
CauchySchwarz inequality and the fact that pX is a probability measure on X, we get
1/2
1/2
R(sgn( f ))
 'R(fc) <
\ fp(x)\2 dp
Xc
<
d PX
Xc
\ fp(x)\2 dPx
.
We then use Lemma 10.10 and (10.5) to find that
R(sgn(f ))  R( fc) < 2 { C {$(0 )  $(ff(x))) dpx
Let x e Xc. Iffp(x) > 0, then fc(x) = 1 and f (x) < 0. By Theorem 10.8, f~(x) > 0 and $ is
strictly decreasing on (ro,0]. So $(f (x)) > $(0) in this case. In the same way, if fp(x) < 0,
then f (x) > 0. By Theorem 10.8, f+(x) < 0 and $ is strictly increasing on [0, +ro). So $(f (x))
> $(0). Finally, iffp(x) = 0, by (10.6), ff (x) = 0 and then $(0)  $(fp (x)) = 0.
In all three cases we have $ (0)  $ (fp (x)) < $ (f (x))  $ (fp (x)). Hence,
{$(0 )  $(f(x))) dpx <
{$(f (x))  $( f(x))) dpx
X
<
{$(f (x))  $(f(x))) dpx = Ef(f )  *(f).
X
This proves the first desired bound with c$ = 2/VC.
If R( f c) = 0, then y = f c (x ) almost surely and nx = 1 or 0 almost everywhere. This means
that f p (x) = 1 and f p (x )\ = \f p (x) 2 almost everywhere. Using this with respect to relation
(9.14), we see that R(sgn(f))  R( f c) = jXc i f P (x ) i d pX. Then the above procedure yields the
second bound with cf = C4 .
10.2
Since regularized classifiers are obtained by composing the sgn function with a realvalued
function f: X ^ R, we may improve the error estimates by replacing image values of f by their
projection onto [1,1]. This section develops this idea.
Definition 10.11 The projection operator n on the space of measurable functions f : X ^ R
is defined by
1
if f (x) > 1
n(f )(x)
1
if f (x) < 1
f (x) if 1 < f (x) < 1 .
Trivially, sgn(n(f)) = sgn(f). Lemma 10.7 tells us that f(y(n(f ))(x)) < f (yf (x)). Then
f(n(f)) <f(f) and f(n(f)) < f(f).
(10 .8 )
Together with Theorem 10.5, this implies that if f "(0) > 0,
R(sgn(fY))  R(fc) < C*7 Ef(n(fY))  f(ff ).
Thus the analysis for the excess misclassification error o fi s reduced into that for the
excess generalization error f (n(fY))  f (ff ). We carry out the latter analysis in the next
two sections.
The following result is similar to Theorem 8.3.
Theorem 10.12 Let f be a classifying loss, fY be defined by (10.1), and fY e HK. Then
f (n( fY))  f (f ) is bounded by
IY IlK
(f f) + YII f y
+
{{ (f y)  f (f f)]  (f y)  f (f f)]}
(10.9)
{[f (n( f fy))  f (f f)]  [4 (n( f fy))  f (f f)]}} .
Proof. The proof follows from (10.8) using the procedure in the proof of Theorem 8.3.
The function fY e HK in Theorem 10.12 is called a regularizing function. It is arbitrarily chosen and depends on
y. One standard choice is the function
f
(n( f ly)) 
(f y)
(10.10)
D(y
0)
of
fY
In this section we estimate the regularized error V(y, 0) of fY. This estimate follows from
estimates for 0( fy) 0( ff). Define the function * : R+ ^ R+ by
*(t) = max{0 (t)\, 0 + (t)\, 0 (t)\, 0 + (t)}.
Theorem 10.13 Let 0 be a classifying loss. For any measurable function f,
0( f)  0( f00) <
*( f (x)) If (x)  f0(x)\ dp X.
X
If, in addition, 0 e C2(R), then we have
0(f)  0(f0) <
{H0"HL[1.1]
+ II0//IL[f (x), f (x)]} \f (x)  f0(x)\ dpX.
Proof. It follows from (10.3) and (10.4) that
0(f)  0(fp0) =
$(f (x))  $(f0(x)) dpx.
XBy Theorem 10.8, $ is constant on [f (x), f+ (x)]. So we need only bound for those points x
for which the value f (x) is outside this interval.
Iff (x) > f+ (x), then, by Theorem 10.8 and since f+ (x) > 1, $ is strictly increasing on
[f+ (x), f (x)]. Moreover, the convexity of $ implies that
$(f (x)) $(f(x)) < $(f (x)) (f (x) f(x))
< max {f (f (x)), f+ (f (x))}  f (x) f0(x)\
< * (I f (x)l)\f (x) f(x)\.
Similarly, iff (x) < f(x), then, by Theorem 10.8 again and sincef (x) < 1,
$ is strictly decreasing on [f (x), f(x)], and
$(f (x)) $(f(x)) < $+(f (x)) (f (x) f(x)) < *(f (x))\f (x) f(x)\ Thus, we have
$(f (x)) $(f(x)) < *(f (x)) \f (x) f(x)\.
This gives the first bound.
If 0 e C2 (R), so is $. Then $'( ff (x)) = 0 sinceff (x) is a minimum of $. When f (x) > f +
(x), using Taylors expansion,
,
,
,
ff (x)
$(f (x)) $(f(x)) = $'(f0(x))(f (x) f(x)) + (f (x) t) $"(t) dt
f(x)
<11 $ ^L^\f0(x), f (x)]
\f (x) fp(x)\ .
10.4
fY
lY^i=1 f(z i)  E(f) with f the random variable on (Z , p) given by f(z ) = t (yf Y (x)) t (yf t (x)). To
bound this quantity using the Bernstein inequalities, we need to control the variance. We do
so by means of the following constant determined by t and p.
Definition 10.15 The variancing power T = t^,p of the pair (t, p) is defined to be the
maximal number T in [0 ,1 ] such that for some constant C1 > 0 and any measurable function f :
X ^ [1,1],
E{ (t(yf (x))  $(yft(x))f\ < Cx {*( f)  t( ft)) .
(10.12)
2
Since (10.12) always holds with T = 0 and C1 = (t(1)) , the variancing power Tt,p is well
defined.
Example 10.16 For fys(t) = (1  t)2 we have T$, p = 1 for any probability measure .
Proof. For tls(t) = (1  t)2 we know that tls(yf (x)) = (y  f (x))2 and f/ = fp. Hence (10.12) is
valid with T = 1 and C1 = sup(x,y)eZ
(y  f (x) + y  fp (x))2 < 16.
quantity
fy Ile), t(\\ fy lloo)}( fy)
E H fy )  E H ft)
5By + 2t(l) 3m
is bounded by
l0g
(j
+
2 Ci log (2/8) m
1/(2T)
+ E t ( fy )  E t ( ft ).
Therefore, f (a) < f (a*) = 0 for all a e R+. This is true for any b > 0. So the inequality holds.
Proof. Write the random variable h(z ) = (yf Y (x ))  (yft (x)) on (Z , p) as h = H1 + &, where
Hi := t(yfy(x))  t(yn(fy)(x)), %2 := t(yn(fY)(x))  tfX).
(10.13)
Prob
zsZm
l
m
m
) E(Hi)
i=i
>
< exp
2
(a2(Hi) + 3BY)
The first part f1 is a random variable satisfying 0 < f1 < BY .Applying the oneside
Bernstein inequality to f1, we obtain, for any e > 0,
Solving the quadratic equation for e given by
me2
2
7
\ = log
8
2 a2 (fi) + 3BYe
we see that for any 0 < 8 < 1, there exists a subset U1 of Zm with measure at least 1  2 such
that for every z e U1,
)  E(0
i=1
2
1 BY log(2/8) + J(iBY log(2/5))2 + 2ma &) log(2 /8)
2
m
m
But a2(^1) < E(f2) < BYE(f1). Therefore,
+ E(1 ), Vz e U1 .
3m
m t ^)  E(5o < 5 BY log(2/8)
i=1
1
m
7 q 1 q*
a b < aq +bq , Va, b > 0.
q
q*
<
2 log(2 /8 )a 2 fe) m
20 (1 )
log(2 /8 ) 3m
m
)  Efe)
i=1
Next we consider f2. This is a random variable bounded by $ (1). Applying the oneside
Bernstein inequality as above, we obtain another subset U2 of Zm with measure at least 1  
such that for every z e U2,
The definition of the variancing power T gives a2(^2) < E(f) < C1(E(f2)r. Applying Lemma
10.17 to q =
,q* = 7 , a = y/2log(2/8)C1/m, and
2 log(2 /8 )a 2 fe) m
^  ^ ^2log(2/8)C1
1/(2t)
+ ^ Efe).
b = V{E(f2)}r, we obtain
1
m
m
)  Efe)
i=1
<
20 (1 ) log(2 /8 ) 3m
2 log(2 /8 )C1
m
1/(2T)
+ E(&).
Hence, for all z e U2 ,Combining these inequalities for f1 and %2 with the fact that E(f 1) +
E(%2) = E(f) = ^( fY)  $( fp ), we conclude that for all z e U1 n U2,
<
5B V
10.5
fP
<P\
4(n(ft))  4(f4)
Ez (n( fZ\y))  Ez (fp) , involves the function fz Y and thus runs over a set of functions. To
bound it, we use  as we have already done in similar cases  a probability inequality for a
function set in terms of the covering numbers of the set.
The following probability inequality can be proved using the oneside Bernstein inequality
as in Lemma 3.18.
Prob
zsZm
d  m mkiM
V XT + eT
1_ T
e1 2
< exp
me2 T
2(c + 3 Be1T)
Lemma 10.19 Let % be a random variable on Z with mean x and variance a 2. Assume
that x > 0, \% x\< B almost everywhere, and a2 < cxT for some 0 < T < 2 and c, B >
0. Then, for every e > 0,
holds.
Also, the following inequality for a function set can be proved in the same way as Lemma
3.19.
Lemma 10.20 Let 0 < T < 1, c, B > 0, and G be a set of functions on Z such that for every
g e G, E(g) > 0, g  E(g)\\^oo < B, and E(g2) < c(E(g ))T .
Prob
zeZm
sup
geG
E(g )  mz
m=i g (z i
) V(E(g ))T +
eT
>
4e1 T/ 2
me2 T
2(c + 3 Be1T)
< N(G, e) exp
Then, for all e > 0,
We can now derive the sample error bounds along the same lines we followed in the
previous chapter for the regression problem.
Prob
zeZm
sup
f eBR
\E*(n(D)  Ef(f)  f(n(f ))  *(f) (Ef(n(f ))  Ef(f*))' + eT
< 4e1 T / 2
> 1  N Bi,
e
Rf'(1)\
exp
me
2T
2 C1 + 3 f(1)e
1 T
+2
1/(2r)
Hn(ftY)')  EH
2r
logN Bi,
me
'RW (1 )U2Ci + 3 ft(1 )e
< log 8 .
(10.15)
the quantity
E1Tf t (fY )  Ef t (ft )  Ef t (fY )  Ef t (ft )
is bounded by
log (4/8) +
1/(2 T )
+ E* (fY )
 E* (ft ).
m
2 C1 log (4/8)
5BY + 24(1) 3m
Proof. Lemma 10.17 implies that for 0 < T < 1,
Combining these two bounds with (10.9), we see that for z e W(R), with confidence 1 5,
(n( fty )) * (ft ))
(2 C1 log (4/5) \1/(2'T)
E*(n(fty)) t(ft) + Y II fy IlK < D(Y, t) + 2
+ 12e*(m, R, 5/2) + B + 2 (^ log (4/5) +
+ t (fy) t (ft).
This gives the desired bound.
i=1
By Lemma 10.23, W(V<t(0)/Y) = Zm. Taking R := Vt(0)7y", we can derive a weak error
bound, as we did in Section 8.3. But we can do better. Abound for the norm  ftY K improving
that of Lemma 10.23 can be shown to hold with high probability. To show this is the target of
the next section. Note that we could now wrap the results in this and the two preceding
sections into a single statement bounding the excess misclassification error ^(sgn(ftY))
R(fc). We actually do that, in Corollary 10.25, once we have obtained a better bound for the
norm  ftY K.
10.6
In this section we derive bounds for t (n( fty))t (ft), improving those that would follow
from the preceding sections, at the cost of a few mild assumptions.
Theorem 10.24 Assume the following with positive constants p, C0, C^, A, q > 1 , and /3 <
1.
ii; (i) K satisfies logN (B\, n) < C0(1/n)p. f(t) < C'f\t\q for all t (1,1).
iii; V(y, 0) < Ayp for each y > 0.
Choose Y = mZ with Z = p+q(1p)/2. Then, for all 0 < n < 2 and all 0 < 8 < 1, with
confidence 1  8,
2
1  pC r
p + q(1  P)/2 2  T + p
s :=
2(1 + p)
Z  1/(2  T + p) 2(1  s)
1P
1 2
and Cn is a constant depending on n and the constants in conditions (i)(iii), but not
on m or 8.
The following corollary follows from Theorems 10.5 and 10.24.
Corollary 10.25 Under the hypothesis and with the notations of Theorem 10.24, if
f"(0) > 0, then, with confidence at least 1  8, we have
R(sgn( fiy ))  R( fc) < c^JCn log 2 m0.
When the kernel is C on X c Rn, we know (cf. Theorem 5.1(i)) that p in Theorem 10.24(i)
can be arbitrarily small. We thus get the following result.
Corollary 10.26 Assume that K is C onX x X and f(t) < Cf\t\q for all t ^ (1,1) and
some q > 1. If V(y, f) < Ay? for all Y > 0 and some 0 < p < 1, choose Y = mZ with Z =
p+qd_p )/2. Then for any 0 < n < 1 and 0 < 8 < 1, with confidence 1  8,
2
Theorem 10.2 follows from Corollary 10.26, Corollary 10.14, and Theorem 9.21 by taking
fY = f defined in (10.10).
Theorem 10.3 is a consequence of Corollary 10.26 and Theorem 10.5. In this case, in
addition, we can take q = 2 , which implies Z = 1 .
The proof of Theorem 10.24 will follow from several lemmas. The idea is to find a radius R
such that W (R) is close to Zm with high probability.
First we establish a bound for the number e* (m, R, 8).
Lemma 10.27 Assume K satisfies log N (B1, n) < C0(1 /nf for some p > 0. Then for R > 1
and 0 <8 < 2 ,the quantity e*(m, R, 8) defined by (10.15) can be bounded by
e*(m,R,8) < C2 (^
(
RP \1/(1+ ) RP \!/(2 +p)
where C2 := (60(1) + 8 C1 + 1 )(C0 + 1)(0/(1) + 1).
Proof. Using the covering number assumption, we see from (10.15) that e*(m, R, 8) < A, where
A is the unique positive number e satisfying
log 8.
/R<^(1 )lY
me2T
p
V e /  2 C1 + 4&(1)e1T
We can rewrite this equation as
m
e2T +p  44(1) log(1/8) e +p  2 C1 log(1/8)ep 3m
 4<p(~l)C ^,(1)iy lt  2CCo (R^(1)iy = 0 .
3mm
A < max
160(1) log(1/8) 3m
1/(2r)
8 C1 log(1/8A
m
/16<M1)C0
3m
1/(1+p)
(Rl^Dlf) ' , 8CC {Rl^(1)\)p
1/(2T +p) '
m
Applying Lemma 7.2 with d = 4 to this equation, we find that the solution A satisfies
/(
A
2
Z, Z
1 q
 +  (1
2 + 4(
P)
1
2
q/
+2
2 C1
log (4/8)\
1/(2t)
Since $(t )
<
max{0(
),$(\\
fY
C^ \t \q
fY
BY
fY
R>
V .
+24C2m1/(2T +P)RP/(1+P) + ^C^CqKAq/2YV(P1)/2
Taking Y = m Z, we see that we can choose
/
log
+ 1,
+ C3
log 4
m r.
.
4\ t / 2
Z1
2 (2 T + p)
Jn := log2 max
Lemma 10.29 Under the assumptions of Theorem 10.24, take Y = mZ for some Z > 0
andletm > (C2KA)l/(z(l^')'1. Then,forany n > 0 and0 < 8 < 1, the set W(R*) has
measure at least l  Jn8, where R* = C4mr ,
and
r* := max
Z 1/(2 T + p) , r,
+n
1
2(1 s)
The constant C4 is given by
C4 = (STC^) 2 (f(0) + l) + Jn (5yC2) 2 C3 (log 4 )' .
Proof. Let J be a positive integer that will be determined later. Define a sequence {R( j) }j=0 by
R( 0 ) =
(0)/Y and Rj = am (R( jr> )s + bm for
(J
R )
(a
m)
1+s+s2++sJ1
J1
+ (Rm)1+s+s2++si1 j=0
m
0
Tr
(0(0))
/2
m2 s ,
m2
2 ( 1 s)
(LZ1/(2T +P) \ SJ
/.
1 2
Combining the bounds for the two terms, we have R(J} <
5TC^)2 (0(0) + 1) + J (5/C2)2
/
r
J
C3 (log(4/5)) m .Taking J to be Jn, we have 2 > max{Z, (2x+P)n} and we finish the proof.
The proof of Theorem 10.24 follows from Lemmas 10.27 and 10.29 and Proposition 10.22.
The constant Cn can be explicitly obtained.
1 2
There is a difference in the learning rates given by Theorem 10.2 (where the best rate is 1  e)
and Theorem 9.26 (where the rate can be arbitrarily close to 1). This motivates the idea of
improving the learning rates stated in this chapter by imposing some conditions on the
measures. In this section we introduce one possible such condition.
Definition 10.30 Let 0 < q <. We say that p has Tsybakov noise exponent q if there exists a
constant cq > 0 such that for all t > 0 ,
px ({x e X :  fp(x)\ < cqt}) < tq.
(10.18)
All distributions have at least noise exponent 0 since t0 = 1. Deterministic distributions
(which satisfy  fp(x)\ = 1 ) have noise exponent q = with
c
= 1.
The Tsybakov noise condition improves the variancing power r^,p. Let us show this for the
hinge loss.
Lemma 10.31 Let 0 < q < .Ifp has Tsybakov noise exponent qwith (10.18) valid, then,
for every function f : X ^[1,1],
t
i
/ 1 \q/(q+ 1 )
( 11 )
*h(f )  *h (f)q/(q+ 1
holds.
Proof. Since f (x) e [1,1], we have $h(yf (x))  $h(yfc (x)) = y( fc(x)  f (x)). It follows that
(f ) (fc) = (fc(x)  f (x))fp(x) dpx =
\ fc(x) f (x) \ fp(x)\dpx
XX
and
E{ ($h(yf (x))  $h(yfc(x)))2} = J \ fc (x)  f (x)\2 d px.
Let t > 0 and separate the domain X into two sets: X+ := { x e X :  fp(x)\ > cqt} and X: = { x e X :  fp(x)\ < cqt}. On X+ we have  fc(x)  f (x) 2 < 2 fc(x)  f (x) 1 fpc^. On X we
have  fc(x)  f (x) 2 < 4. It follows from
c
qt
(10.18) that
h
h
f
2 (E^ (f )  E* (f c))
2
c qt
h
2 (E* (f )  E(f c)) c qt
 f c(x )  f (x)2 d PX <
^ + 4PX(X t  )
X
q
+ 4t .
Choosing t = { (E^ h ( f )  E^h(fc))/(2cq)}l/i'q+V} yields the desired bound.
Lemma 10.31 tells us that the variancing power t^h,P of the hinge loss equals q+Y when
the measure p has Tsybakov noise exponent q. Combining this with Corollary 10.26 gives the
following result on improved learning rates for measures satisfying the Tsybakov noise
condition.
Theorem 10.32 Under the assumption of Theorem 10.2, if p has Tsybakov noise
exponent q with 0 < q <<x, then, for any 0 < e < 2 and 0 < 8 < 1, with confidence 1 8, we have
2 1
R(sgn(fz?))  R( fc) < C log  8m
where 0 = min j f+ p, q^  e J and C is a constant independent ofm and 8.
1
In Theorem 10.32, the learning rate can be arbitrarily close to 1 when q is sufficiently
large.
10.8
General expositions of convex loss functions for classification can be found in [14,31].
Theorem 10.5, the use of the projection operator, and some estimates for the regularized
error were provided in [31]. The error decomposition for regularization schemes was
introduced in [145].
The convergence of the support vector machine (SVM) 1norm soft margin classifier for
general probability distributions (without separability conditions) was established in [121]
when HK is dense in C(X) (such a kernel K is called
universal). Convergence rates in this situation were derived in [154]. For further results and
references on convergence rates, see the thesis [140].
The error analysis in this chapter is taken from [142], where more technical and better
error bounds are provided by means of the local Rademacher process, empirical covering
numbers, and the entropy integral [84, 132]. The Tsybakov noise condition of Section 10.7
was introduced in [131].
The iteration technique used in the proof of Lemma 10.29 was given in [122] (see also
[144]).
SVMs have many modifications for various purposes in different fields [134]. These include
qnorm soft margin classifiers [31,77], multiclass SVMs [4, 32, 75, 139], vSVMs [108], linear
programming SVMs [26, 96, 98, 146], maximum entropy discrimination [65], and oneclass
SVMs [107, 128].
We conclude with some brief comments on current trends.
Learning theory is a rapidly growing field. Many people are working on both its
foundations and its applications, from different points of view. This work develops the theory
but also leaves many open questions. Here we mention some involving regularization
schemes [48].
i; Feature selection. One purpose is to understand structures of highdimensional data. Topics
include manifold learning or semisupervised learning [15, 23, 27, 34, 45, 97] and
dimensionality reduction (see the introduction [55] of a special issue and references
therein). Another purpose is to determine important features (variables) of functions
defined on hugedimensional spaces. Two approaches are the filter method and the
wrapper method [69]. Regularization schemes for this purpose include those in [56, 58]
and a least squarestype algorithm in [93] that learns gradients as vectorvalued functions
[89].
fz,Y ,S
arginf inf
aeS
1
feH
Ka
68 ,
134], and their error with respect to the step size has been
analyzed for the least squares regression in [112 ] and for regularized classification
with a general classifying loss in [151]. Error analysis for online schemes with
varying regularization parameters is performed in [127] and
References
[149].
1; R.A. Adams. Sobolev Spaces. Academic Press, 1975.
2; C.A. Aliprantis and O. Burkinshaw. Principles of Real Analysis. Academic Press, 3rd
edition, 1998.
3; F. Alizadeh and D. Goldfarb. Secondorder cone programming. Math. Program., 95:351,
2003.
4; E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying
approach for margin classifiers. J. Mach. Learn. Res., 1:113141, 2000.
5; N. Alon, S. BenDavid, N. CesaBianchi, and D. Haussler. Scalesensitive dimensions,
uniform convergence and learnability. J. ACM, 44:615631, 1997.
6; M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations.
Cambridge University Press, 1999.
7; M. Anthony and N. Biggs. Computational Learning Theory. Cambridge University Press,
1992.
8; M. Anthony and J. ShaweTaylor. A result of Vapnik with applications. Discrete Appl.
Math., 47:207217, 1993.
9; N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337404, 1950.
10; A.R. Barron. Complexity regularization with applications to artificial neural networks. In G.
Roussas, editor, Nonparametric Functional Estimation, pages 561576. Kluwer
Academic Publishers 1990.
11; R.G. Bartle. The Elements of Real Analysis. John Wiley & Sons, 2nd edition, 1976.
12; P.L. Bartlett. The sample complexity of pattern classification with neural networks: the size
of the weights is more important than the size of the network. IEEE Trans. Inform.
Theory, 44:525536, 1998.
13; P.L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Ann. Stat.,
33:14971537, 2005.
14; P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. J.
Amer. Stat. Ass., 101:138156, 2006.
15; M. Belkin and P. Niyogi. Semisupervised learning on Riemannian manifolds. Mach.
Learn., 56:209239, 2004.
16; J. Bergh and J. Lofstrom. Interpolation Spaces: An Introduction. SpringerVerlag, 1976.
17; P. Binev, A. Cohen, W. Dahmen, R. DeVore, and V. Temlyakov. Universal algorithms for
learning theory. Part I: piecewise constant functions. J. Mach. Learn. Res., 6:12971321,
2005.
18; C.M. Bishop. Neural Networks for Pattern Recognition. Cambridge University Press,
1995.
19; L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and Real Computation.
SpringerVerlag, 1998.
20; B.E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Proceedings of the Fifth Annual Workshop of Computational Learning Theory,
pages 144152. Association for Computing Machinery, New York, 1992.
21; S. Boucheron, O. Bousquet, and G. Lugosi. Concentration inequalities. In O. Bousquet, U.
von Luxburg, and G. Ratsch, editors, Advanced Lectures in Machine Learning, pages
208240. SpringerVerlag, 2004.
22; S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications
in random combinatorics and learning. Random Struct. Algorithms, 16:277292, 2000.
23; O. Bousquet, O. Chapelle, and M. Hein. Measure based regularizations. In S. Thrun, L.K.
Saul, and B. Scholkopf, editors, Advances in Neural Information Processing
Systems, volume 16, pages 12211228. MIT Press, 2004.
24; O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499526, 2002.
25; S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
26; P.S. Bradley and O.L. Mangasarian. Massive data discrimination via linear support vector
machines. Optimi. Methods and Softw., 13:110, 2000.
27; A. Caponnetto andS. Smale. Risk bounds for random regression graphs. To appear at
Found Comput. Math.
28; N. CesaBianchi, P.M. Long, and M.K. Warmuth. Worstcase quadratic loss bounds for
prediction using linear functions and gradient descent. IEEE Trans. Neural Networks,
7:604619, 1996.
29; N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, 2006.
30; O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for
support vector machines. Mach. Learn., 46:131159, 2002.
31; D.R. Chen, Q. Wu, Y. Ying, and D.X. Zhou. Support vector machine soft margin classifiers:
error analysis. J. Mach. Learn. Res., 5:11431175, 2004.
32; D.R. Chen and D.H. Xiang. The consistency of multicategory support vector machines.
Adv. Comput. Math., 24:155169, 2006.
33; D.A. Cohn, Z. Ghahramani, and M.I. Jordan. Active learning with statistical models. J. Artif.
Intell. Res., 4:129145, 1996.
34; R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker.
Geometric diffusions as a tool for harmonic analysis and structure definition of data:
diffusion maps. Proc. Natl. Acad. Sci., 102:74267431, 2005.
35; C. Cortes and V. Vapnik. Supportvector networks. Mach. Learn., 20:273297, 1995.
36; D.D. Cox. Approximation of least squares regression on nested subspaces. Ann. Stat.,
16:713732, 1988.
37; N. Cristianini and J. ShaweTaylor. An Introduction to Support Vector Machines.
Cambridge University Press, 2000.
38; F. Cucker and S. Smale. Best choices for regularization parameters in learning theory.
Found. Comput. Math., 2:413428, 2002.
39; F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math.
Soc., 39:149, 2002.
40; L. Debnath and P. Mikusinski. Introduction to Hilbert Spaces with Applications.
Academic Press, 2nd edition, 1999.
41; C. de Boor, K. Hollig, and S. Riemenschneider. Box Splines. SpringerVerlag, 1993.
42; E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized leastsquares
algorithm in learning theory. Found. Comput. Math., 5:5985, 2005.
43; E. De Vito, L. Rosasco, A. Caponnetto, U. de Giovannini, and F. Odone. Learning from
examples as an inverse problem. J. Mach. Learn. Res., 6:883904, 2005.
44; L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.
SpringerVerlag, 1996.
45; D.L. Donoho and C. Grimes. Hessian eigenmaps: locally linear embedding techniques for
highdimensional data. Proc. Natl. Acad. Sci., 100:55915596,
2003.
46; R.M. Dudley, E. Gine, and J. Zinn. Uniform and universal GlivenkoCantelli classes. J.
Theor. Prob., 4:485510, 1991.
47; D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Diferential
Operators. Cambridge University Press, 1996.
48; H.W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems, volume
375 of Mathematics and Its Applications. Kluwer, 1996.
49; T. Evgeniou and M. Pontil. Regularized multitask learning. In C.E. Brodley, editor, Proc.
17th SIGKDD Conf. Knowledge Discovery and Data Mining, Association for
Computing Machinery, New York, 2004.
50; T. Evgeniou, M. Pontil, andT. Poggio. Regularization networks and support vector
machines. Adv. Comput. Math., 13:150, 2000.
51; J. Forster and M.K. Warmuth. Relative expected instantaneous loss bounds. J. Comput.
Syst. Sci., 64:76102, 2002.
52; Y. Freund and R.E. Shapire. A decisiontheoretic generalization of online learning and an
application to boosting. J. Comput. Syst. Sci., 55:119139, 1997.
53; F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures.
Neural Comp., 7:219269, 1995.
54; G. Golub, M. Heat, and G. Wahba. Generalized crossvalidation as a method for choosing a
good ridge parameter. Technometrics, 21:215223, 1979.
55; I. Guyon and A. Ellisseeff. An introduction to variable and feature selection. J. Mach.
Learn. Res., 3:11571182, 2003.
56; I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification
using support vector machines. Mach. Learn., 46:389422, 2002.
89; C.A. Micchelli and M. Pontil. On learning vectorvalued functions. Neural Comp., 17:177204, 2005.
90; C.A. Micchelli, M. Pontil, Q. Wu, and D.X. Zhou. Error bounds for learning the kernel.
Preprint, 2006.
91; M. Mignotte. Mathematics for Computer Algebra. SpringerVerlag, 1992.
92; T.M. Mitchell. Machine Learning. McGrawHill, 1997.
93; S. Mukherjee and D.X. Zhou. Learning coordinate covariances via gradients. J. Mach.
Learn. Res., 7:519549, 2006.
94; FJ. Narcowich, J.D. Ward, and H. Wendland. Refined error estimates for radial basis function
interpolation. Constr. Approx., 19:541564, 2003.
95; P. Niyogi. The Informational Complexity of Learning. Kluwer Academic Publishers,
1998.
96; P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis
complexity and sample complexity for radial basis functions. Neural Comput., 8:819842, 1996.
97; P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high
confidence from random samples. Preprint, 2004.
98; J.P. Pedroso and N. Murata. Support vector machines with different norms: motivation,
formulations and results. Pattern Recognit. Lett., 22:12631272, 2001 .
99; I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. Ann.
Probab., 22:16791706, 1994.
100; A. Pinkus. Nwidths in Approximation Theory. SpringerVerlag, 1996.
101; A. Pinkus. Strictly positive definite kernels on a real inner product space. Adv. Comput.
Math., 20:263271, 2004.
102; T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory. Nature,
317:314319, 1985.
103; D. Pollard. Convergence of Stochastic Processes. SpringerVerlag, 1984.
104; A. Rakhlin, D. Panchenko, and S. Mukherjee. Risk bounds for mixture density estimation.
ESAIM: Prob. Stat., 9:220229, 2005.
105; R. Schaback. Reconstruction of multivariate functions from scattered data. Manuscript,
1997.
106; I.J. Schoenberg. Metric spaces and completely monotone functions. Ann. Math., 39:811841, 1938.
107;B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
108; B. Scholkopf, A.J. Smola, R.C. Williamson, andP.L. Bartlett. New support vector algorithms.
Neural Comp., 12:12071245, 2000.
109;I.R. Shafarevich. Basic Algebraic Geometry. 1: Varieties in Projective Space.
SpringerVerlag, 2nd edition, 1994.
110; J. ShaweTaylor, P.L. Bartlet, R.C. Williamson, and M. Anthony. Structural risk minimization
over data dependent hierarchies. IEEE Trans. Inform. Theory, 44:19261940, 1998.
111; J. ShaweTaylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, 2004.
112; S. Smale and Y. Yao. Online learning algorithms. Found. Comput. Math., 6:145170,
2006.
113; S. Smale and D.X. Zhou. Estimating the approximation error in learning theory. Anal.
Appl., 1:1741, 2003.
114; S. Smale and D.X. Zhou. Shannon sampling and function reconstruction from point values.
Bull. Amer. Math. Soc., 41:279305, 2004.
115; S. Smale and D.X. Zhou. Shannon sampling II: Connections to learning theory. Appl.
Comput. Harmonic Anal., 19:285302, 2005.
116; S. Smale and D.X. Zhou. Learning theory estimates via integral operators and their
approximations. To appear in Constr. Approx.
117; A. Smola, B. Scholkopf, and R. Herbricht. A generalized representer theorem. Comput.
Learn. Theory, 14:416426, 2001.
118; A. Smola, B. Scholkopf, and K.R. Muller. The connection between regularization operators
and support vector kernels. Neural Networks, 11:637649, 1998.
119; M. Sousa Lobo, L. Vandenberghe, S. Boyd, andH. Lebret. Applications of second order
cone programming. Linear Algebra Appl., 284:193228, 1998.
120;E.M. Stein. Singular Integrals and Diferentiability Properties of Functions.
Princeton University Press, 1970.
121; I. Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768791, 2002.
122; I. Steinwart and C. Scovel. Fast rates for support vector machines. In P. Auer and R. Meir,
editors, Proc. 18th Ann. Conf. Learn. Theory, pages 279294, Springer
2005.
123; H.W. Sun. Mercer theorem for RKHS on noncompact sets. J. Complexity, 21:337349,
2005.
124; R.S. Sutton andA.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
125; J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least
Squares Support Vector Machines. World Scientific, 2002.
126; M. Talagrand. New concentration inequalities in product spaces. Invent. Math., 126:505563, 1996.
127; P. Tarres and Y. Yao. Online learning as stochastic approximations of regularization paths.
Preprint, 2005.
128; D.M.J. Tax and R.P.W. Duin. Support vector domain description. Pattern Recognit. Lett.,
20:11911199, 1999.
129;M.E. Taylor. Partial Diferential Equations I: Basic Theory, volume 115 of Applied
Mathematical Sciences. SpringerVerlag, 1996.
130; A.N. Tikhonov and V.Y. Arsenin. Solutions of IllPosed Problems. W.H. Winston, 1977.
131; A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Stat.,
32:135166, 2004.
132; A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes.
SpringerVerlag, 1996.
133;V.N. Vapnik. Estimation of Dependences Based on Empirical Data. SpringerVerlag,
1982.
134; V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
135; V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies
of events to their probabilities. Theory Prob. Appl., 16:264280,1971.
136; M. Vidyasagar. Learning and Generalization. SpringerVerlag, 2003.
137;G. Wahba. Spline Models for Observational Data. SIAM, 1990.
138; G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the
randomized GACV. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel
Methods  Support Vector Learning, pages 6988. MIT Press, 1999.
139; J. Weston and C. Watkins. Multiclass support vector machines. Technical Report CSDTR9804, Department of Computer Science, Royal Holloway, University of London, 1998.
140; Q. Wu. Classification and regularization in learning theory. PhD thesis, City University of
Hong Kong, 2005.
141; Q. Wu, Y. Ying, and D.X. Zhou. Learning theory: from regression to classification. In K.
Jetter, M. Buhmann, W. Haussmann, R. Schaback, and J. Stoeckler, editors, Topics in
Multivariate Approximation and Interpolation, volume 12 of Studies in
Computational Mathematics, pages 257290. Elsevier, 2006.
142; Q. Wu, Y. Ying, and D.X. Zhou. Multikernel regularized classifiers. To appear in J.
Complexity.
143; Q. Wu, Y. Ying, and D.X. Zhou. Learning rates of leastsquare regularized regression.
Found. Comput. Math., 6:171192, 2006.
144; Q. Wu and D.X. Zhou. SVM soft margin classifiers: linear programming versus quadratic
programming. Neural Comp., 17:11601187, 2005.
145; Q. Wu and D.X. Zhou. Analysis of support vector machine classification. J. Comput. Anal.
Appl., 8:99119, 2006.
146; Q. Wu and D.X. Zhou. Learning with sample dependent hypothesis spaces. Preprint, 2006.
147; Z. Wu and R. Schaback. Local error estimates for radial basis function interpolation of
scattered data. IMA J. Numer. Anal., 13:1327, 1993.
148; Y. Yang and A. R. Barron. Informationtheoretic determination of minimax rates of
convergence. Ann. Stat., 27:15641599, 1999.
149; G.B. Ye and D.X. Zhou. Fully online classification by regularization. To appear at Appl.
Comput. Harmonic Anal.
150; Y. Ying and D.X. Zhou. Learnability of Gaussians with flexible variances. To appear at J.
Mach. Learn. Res.
151; Y. Ying and D.X. Zhou. Online regularized classification algorithms. To appear in IEEE
Trans. Inform. Theory, 52:47754788, 2006.
152; T. Zhang. On the dual formulation of regularized linear systems with convex risks.
Machine Learning, 46:91129, 2002.
153; T. Zhang. Leaveoneout bounds for kernel methods. Neural Comp., 15:13971437, 2003.
154; T. Zhang. Statistical behavior and consistency of classification methods based on convex
risk minimization. Ann. Stat., 32:5685, 2004.
155; D.X. Zhou. The covering number in learning theory. J. Complexity, 18:739767, 2002.
156; D.X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform.
Index
p
12(XN), 90 i (XN), 90 Lip(s), 73 Lip*(s, C(X)), 74 Lip*(s, L (X)), 74 LK , 56 LP(X ), 18 L (X ), 19
Lz, 10
M(S, n), 101 N (S, n),37 O (n), 30 00, 162 0h, 165 0ls, 162 ns(K), 84 R(f ) 5 P, 5
PX , 6 p(yx), 6 sgn, 159
P2,6 T, 8
7x ,5 Xr,t, 74 Y, 5 Z ,5 Zm, 8
Bayes rule, 160 Bennetts inequality, 38 Bernsteins inequality, 40, 42 best fit, 2
biasvariance problem, 13, 127 bounded linear map, 21 box spline, 28
Chebyshevs inequality, 38 classification algorithms, 160 classifier binary, 157 regularized,
164 compact linear map, 21 confidence, 42 constraints, 33 convex function, 33 convex
programming, 33 convex set, 33 convolution, 19 covering number, 37
defect, 10 distortion, 110 divided difference, 74 domination of measures, 110
efficient algorithm, 33 enet, 78 ERM, 50 error
approximation, 12
approximation (associated with ty), 70 empirical, 8
empirical (associated with p), 162 empirical (associated with ty), 51 excess generalization, 12
excess misclassification, 188 generalization, 5
generalization (associated with p), 162
generalization (associated with ty), 51
in H, 11
local, 6, 161
misclassification, 157
regularized, 134 regularized (associated with p), 162 regularized empirical, 134 regularized
empirical (associated with
P), 162
expected value, 5
feasible points, 33 feasible set, 33 feature map, 70 Fourier coefficients, 55 Fourier transform,
19 nonnegative, 26 positive, 26 full measure, 10 function
completely monotonic, 21 even, 26 measurable, 19
generalized Bennetts inequality, 40, 42 Gramian, 22
Hoeffdings inequality, 40, 42 homogeneous polynomials, 17, 29 hypothesis space, 9 convex,
46
interpolation space, 63
Kfunctional, 63 kernel, 56 box spline, 28 dot product, 24 Mercer, 22 spline, 27
translation invariant, 26 universal, 212
Lagrange interpolation polynomials, 84
Lagrange multiplier, 151
least squares, 1, 2
left derivative, 34
linear programming, 33
localizing function, 190
loss