Vous êtes sur la page 1sur 139

Omtfdtr Mzocgrjota cn \y,< * 3 ani


Learning Theory

An Approximation Theory viewpoint


Felipe cucker and Ding xuan Zhou

This page intentionally left blank


Series Editors


Learning Theory:

An Approximation Theory ViewpointThe Cambridge Monographs on Applied

and Computational Mathematics reflect the crucial role of mathematical and computational techniques in
contemporary science. The series publishes expositions on all aspects of applicable and numerical
mathematics, with an emphasis on new developments in this fast-moving area of research.
State-of-the-art methods and algorithms as well as modern mathematical descriptions of physical and
mechanical ideas are presented in a manner suited to graduate research students and professionals alike.
Sound pedagogical presentation is a prerequisite. It is intended that books in the series will serve to inform a
new generation of researchers.
Within the series will be published titles in the Library of Computational Mathematics, published under the
auspices of the Foundations of Computational Mathematics organisation. Learning Theory: An Approximation
Theory View Point is the first title within this new subseries.
The Library of Computational Mathematics is edited by the following editorial board: Felipe Cucker
(Managing Editor) Ron Devore, Nick Higham, Arieh Iserles, David Mumford, Allan Pinkus, Jim Renegar, Mike

Also in this series:

A practical Guide to Pseudospectral Methods, Bengt Fornberg

Dynamical Systems and Numerical Analysis, A. M. Stuart and A. R. Humphries
Level Set Methods, J. A. Sethian
The Numerical Solution of Integral Equations of the Second Kind,

Kendall E. Atkinson

Orthogonal Rational Functions, Adhemar Bultheel, Pablo Gonzalez-Vera,

Erik Hendriksen, and Olav Njastad Theory of Composites, Graeme W. Milton

Geometry and Topology for Mesh Generation, Herbert Edelsbrunner Schwarz-Christoffel Mapping, Tobin A.

Driscoll and Lloyd N. Trefethen

High-Order Methods for Incompressible Fluid, M.O. Deville,

E.H. Mund and P Fisher

Practical Extrapolation Methods, Avram Sidi
Generalized Riemann Problems in Computational Fluid Dynamics,

M. Ben-Artzi and J. Falcovtz

Radial Basis Functions, Martin Buhmann

Learning Theory: An Approximation Theory

City University of Hong Kong
City University of Hong Kong


Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York www.cambridge.org
Information on this title: www.cambridge.org/9780521865593 Cambridge University Press 2007
This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing
agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
First published in print format 2007
978-0-511-27407-7 eBook (EBL)
0-511-27407-6 eBook (EBL)
978-0-521-86559-3 hardback

0-521-86559-X hardback

Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or
third-party internet websites referred to in this publication, and does not guarantee that any content on
such websites is, or will remain, accurate or appropriate.



1 The framework of learning
1.1; Introduction
1.2; A formal setting
1.3; Hypothesis spaces and target functions
1.4; Sample, approximation, and generalization errors 11
1.5; The bias-variance problem
1.6; The remainder of this book
1.7; References and additional remarks
2 Basic hypothesis spaces
2.1; First examples of hypothesis space
2.2; Reminders I
2.3; Hypothesis spaces associated with Sobolev spaces 21
2.4; Reproducing Kernel Hilbert Spaces
2.5; Some Mercer kernels
2.6; Hypothesis spaces associated with an RKHS
2.7; Reminders II
2.8; On the computation of empirical target functions 34
2.9; References and additional remarks
3 Estimating the sample error
3.1; Exponential inequalities in probability
3.2; Uniform estimates on the defect
3.3; Estimating the sample error
3.4; Convex hypothesis spaces
3.5; References and additional remarks
4 Polynomial decay of the approximation error
4.1; Reminders III
4.2; Operators defined by a kernel
4.3; Mercers theorem
4.4; RKHSs revisited
4.5; Characterizing the approximation error
in RKHSs 63
4.6; An example
4.7; References and additional remarks
5 Estimating covering numbers
5.1; Reminders IV
5.2; Covering numbers for Sobolev smooth
kernels 76
5.3; Covering numbers for analytic kernels
5.4; Lower bounds for covering numbers
5.5; On the smoothness of box spline kernels
5.6; References and additional remarks
6 Logarithmic decay of the approximation error
6.1; Polynomial decay of the approximation error for C
6.2; Measuring the regularity of the kernel
6.3; Estimating the approximation error in RKHSs
6.4; Proof of Theorem 6.1
6.5; References and additional remarks
7 On the bias-variance problem
7.1; A useful lemma
7.2; Proof of Theorem 7.1
7.3; A concrete example of bias-variance
7.4; References and additional remarks
8 Least squares regularization
8.1; Bounds for the regularized error
8.2; On the existence of target functions
8.3; A first estimate for the excess generalization error140
8.4; Proof of Theorem 8.1

8.5; Reminders V
8.6; Compactness and regularization
8.7; References and additional remarks
9 Support vector machines for classification
9.1; Binary classifiers
9.2; Regularized classifiers
9.3; Optimal hyperplanes: the separable case
9.4; Support vector machines
9.5; Optimal hyperplanes: the nonseparable case
9.6; Error analysis for separable measures
9.7; Weakly separable measures
9.8; References and additional remarks
10 General regularized classifiers
10.1; Bounding the misclassification error in terms of the
generalization error
10.2; Projection and error decomposition
10.3; Bounds for the regularized error V(y, f) offY
10.4; Bounds for the sample error term involving fY 198
10.5; Bounds for the sample error term involving f^Y 201
10.6; Stronger error bounds
10.7; Improving learning rates by imposing noise conditions 210
10.8; References and additional remarks


This book by Felipe Cucker and Ding-Xuan Zhou provides solid mathematical foundations and
new insights into the subject called learning theory.
Some years ago, Felipe and I were trying to find something about brain science and
artificial intelligence starting from literature on neural nets. It was in this setting that we
encountered the beautiful ideas and fast algorithms of learning theory. Eventually we were
motivated to write on the mathematical foundations of this new area of science.
I have found this arena to with its new challenges and growing number of application, be
exciting. For example, the unification of dynamical systems and learning theory is a major
problem.Another problem is to develop a comparative study of the useful algorithms currently
available and to give unity to these algorithms. How can one talk about the best algorithm
or find the most appropriate algorithm for a particular task when there are so many desirable
features, with their associated trade-offs? How can one see the working of aspects of the
human brain and machine vision in the same framework?
I know both authors well. I visited Felipe in Barcelona more than 13 years ago for several
months, and when I took a position in Hong Kong in 1995, I asked him to join me. There
Lenore Blum, Mike Shub, Felipe, and I finished a book on real computation and complexity. I
returned to the USA in 2001, but Felipe continues his job at the City University of Hong Kong.
Despite the distance we have continued to write papers together. I came to know Ding-Xuan
as a colleague in the math department at City University. We have written a number of papers
together on various aspects of learning theory. It gives me great pleasure to continue to work
with both mathematicians. I am proud of our joint accomplishments.
I leave to the authors the task of describing the contents of their book. I will give some
personal perspective on and motivation for what they are doing.Computational science
demands an understanding of fast, robust algorithms. The same applies to modern theories of
artificial and human intelligence. Part of this understanding is a complexity-theoretic analysis.
Here I am not speaking of a literal count of arithmetic operations (although that is a byproduct), but rather to the question: What sample size yields a given accuracy? Better yet,
describe the error of a computed hypothesis as a function of the number of examples, the
desired confidence, the complexity of the task to be learned, and variants of the algorithm. If
the answer is given in terms of a mathematical theorem, the practitioner may not find the
result useful. On the other hand, it is important for workers in the field or leaders in
laboratories to have some background in theory, just as economists depend on knowledge of
economic equilibrium theory. Most important, however, is the role of mathematical
foundations and analysis of algorithms as a precursor to research into new algorithms, and
into old algorithms in new and different settings.
I have great confidence that many learning-theory scientists will profit from this book.
Moreover, scientists with some mathematical background will find in this account a fine
introduction to the subject of learning theory.


Stephen Smale Chicago

Broadly speaking, the goal of (mainstream) learning theory is to approximate a function (or
some function features) from data samples, perhaps perturbed by noise. To attain this goal,
learning theory draws on a variety of diverse subjects. It relies on statistics whose purpose is
precisely to infer information from random samples. It also relies on approximation theory,
since our estimate of the function must belong to a prespecified class, and therefore the
ability of this class to approximate the function accurately is of the essence. And algorithmic
considerations are critical because our estimate of the function is the outcome of algorithmic
procedures, and the efficiency of these procedures is crucial in practice. Ideas from all these
areas have blended together to form a subject whose many successful applications have
triggered its rapid growth during the past two decades.
This book aims to give a general overview of the theoretical foundations of learning
theory. It is not the first to do so. Yet we wish to emphasize a viewpoint that has drawn little
attention in other expositions, namely, that of approximation theory. This emphasis fulfills two
purposes. First, we believe it provides a balanced view of the subject. Second, we expect to
attract mathematicians working on related fields who find the problems raised in learning
theory close to their interests.
While writing this book, we faced a dilemma common to the writing of any book in
mathematics: to strike a balance between clarity and conciseness. In particular, we faced the
problem of finding a suitable degree of self-containment for a book relying on a variety of
subjects. Our solution to this problem consists of a number of sections, all called Reminders,
where several basic notions and results are briefly reviewed using a unified notation.
We are indebted to several friends and colleagues who have helped us in many
ways. Steve Smale deserves a special mention. We first became interested in
learning theory as a result of his interest in the subject, and much of the material
in this book comes from or evolved from joint papers we wrote with him. Qiang Wu,
Yiming Ying, Fangyan Lu, Hongwei Sun, Di-Rong Chen, Song Li, Luoqing Li,
Bingzheng Li, Lizhong Peng, and Tiangang Lei regularly attended our weekly
seminars on learning theory at City University of Hong Kong, where we exposed
early drafts of the contents of this book. They, and Jos Luis Balcazar, read
preliminary versions and were very generous in their feedback. We are indebted
also to David Tranah and the staff of Cambridge University Press for their patience
and willingness to help. We have also been supported by the University Grants
Council of Hong Kong through the grants CityU 1087/02P, 103303, and


The framework of learning


We begin by describing some cases of learning, simplified to the extreme, to convey an

intuition of what learning is.
Case 1.1 Among the most used instances of learning (although not necessarily with this
name) is linear regression. This amounts to finding a straight line that best approximates a
functional relationship presumed to be implicit in a set of data points in R2, {(xi,y1), (x2,y2),
, (xm,ym)} (Figure 1.1). The yardstick used to measure how good an approximation a
given line Y = aX + b is, is called least squares. The best line is the one that minimizes
Q(a, b) = ^2(yi - axi - b)2.

Figure 1.1

Case 1.2 Case 1.1 readily extends to a classical situation in science, namely, that of learning a
physical law by curve fitting to data. Assume that the law at hand, an unknown function f : R
^ R, has a specific form and that the space of all functions with this form can be
parameterized by N real numbers. For instance, iff is assumed to be a polynomial of degree d,
then N = d +1 and the parameters are the unknown coefficients w0,..., wd of f. In this case,
finding the best fit by the least squares method estimates the unknown f from a set of
pairs {(xi,yi),..., (xm, ym)}. If the measurements generating this set were exact, then yi would
be equal to f (xi). However, in general one expects the values yi to be affected by noise. That
is, yi = f x) + e, where e is a random variable (which may depend on xi) with mean zero. One
then computes the vector of coefficients w such that the value
J2(fw(xi) - yi)2,
with fw(x) = ^2 wjxj
is minimized, where, typically, m > N .In general, the minimum value above is not 0. To solve
this minimization problem, one uses the least squares technique, a method going back to
Gauss and Legendre that is computationally efficient and relies on numerical linear algebra.
Since the values yi are affected by noise, one might take as starting point, instead of the
unknown f, a family of probability measures ex on R varying with x e R. The only requirement
on these measures is that for all x e R, the mean of ex is f (x). Then yi is randomly drawn
from eXi. In some contexts the xi, rather than being chosen, are also generated by a
probability measure pX on R. Thus, the starting point could even be a single measure p on R x
R - capturing both the measure pX and the measures ex for x e R - from which the pairs (xi,
yi) are randomly drawn.
A more general form of the functions in our approximating class could be given by
fw (x) = ^2 wrfi (x), i=1
where the fa are the elements of a basis of a specific function space, not necessarily of
Case 1.3 The training of neural networks is an extension of Case 1.2. Roughly speaking, a
neural network is a directed graph containing some input nodes, some output nodes, and
some intermediate nodes where certain functions are computed. If X denotes the input space
(whose elements are fed to the input nodes) and Y the output space (of possible elements
returned by the output nodes), a neural network computes a function from X to Y. The
literature on neural networks shows a variety of choices for X and Y, which can be continuous
or discrete, as well as for the functions computed at the intermediate nodes. A common
feature of all neural nets, though, is the dependence of these functions on a set of
parameters, usually called weights, w = {wj }jeJ. This set determines the function fw: X ^ Y
computed by the network.
Neural networks are trained to learn functions. As in Case 1.2, there is a target function
f : X ^ Y, and the network is given a set of randomly chosen pairs (x1, y1),..., (xm,ym) in X x
Y. Then, training algorithms select a set of weights w attempting to minimize some distance
from fw to the target function f : X ^ Y.
Case 1.4 A standard example of pattern recognition involves handwritten characters. Consider
the problem of classifying handwritten letters of the English alphabet. Here, elements in our
space X could be matrices with entries in the interval [0,1] - each entry representing a pixel in

a certain gray scale of a digitized photograph of the handwritten letter or some features
extracted from the letter. We may take Y to be

26 26 y e R26 | y = ^ Xiei such that ^ Xi = 1 1=1


Here ei is the ith coordinate vector in R26, each coordinate corresponding to a letter. If A c Y is
the set of points y as above such that 0 < Xi < 1, for i = 1,..., 26, one can interpret a point in
A as a probability measure on the set {A, B, C,..., X, Y, Z}. The problem is to learn the ideal
function f : X ^ Y that associates, to a given handwritten letter x, a linear combination of the
ei with coefficients {Prob{x = A},Prob{x = B}, ...,Prob{x = Z}}. Unambiguous letters are
mapped into a coordinate vector, and in the (pure) classification problem f takes values on
these ei. Learning f means finding a sufficiently good approximation off within a given
prescribed class.
The approximation of f is constructed from a set of samples of handwritten letters, each of
them with a label in Y. The set {(x1, y1),..., (xm, ym)} of these m samples is randomly drawn
from X x Y according to a measure p on X x Y. This measure satisfies p(X x A) = 1. In
addition, in practice, it is concentrated around the set of pairs (x, y) with y = ei for some 1 <
i < 26. That is, the occurring elements x e X are handwritten letters and not, say, a digitized
image of the Mona Lisa. The function f to be learned is the regression function fp of p.
That is, fp (x) is the average of the y values of {x} x Y (we are more precise about p and the
regression function in Section 1.2).
Case 1.5 A standard approach for approximating characteristic (or indicator) functions of sets
is known as PAC learning (from probably approximately correct). Let T (the target
concept) be a subset of Rn and pX be a probability measure on Rn that we assume is not
known in advance. Intuitively, a set S c Rn approximates T when the symmetric difference
SAT = (S \ T) U (T \ S) is small, that is, has a small measure. Note that if fS and fT denote the
characteristic functions of S and T, respectively, this measure, called the error of S, is Rn fS
- fT | dpX. Note that since the functions take values in {0,1}, only this integral coincides
with /K (fS - fT )2 dpX.
Let C be a class of subsets of Rn and assume that T e C. One strategy for constructing an
approximation of T in C is the following. First, draw points x1,..., xm e Rn according to pX and
label each of them with 1 or 0 according to whether they belong to T. Second, compute any
function fS : Rn ^ {0,1}, fS e C, that coincides with this labeling over {xi,..., xm}. Such a
function will provide a good approximation S of T (small error with respect to pX ) as long as
m is large enough and C is not too wild. Thus the measure pX is used in both capacities,
governing the sample drawing and measuring the error set SAT.
A major goal in PAC learning is to estimate how large m needs to be to obtain an e
approximation of T with probability at least 1 - 8 as a function of e and 8.
The situation described above is noise free since each randomly drawn point xi e Rn is
correctly labeled. Extensions of PAC learning allowing for labeling mistakes with small
probability exist.
Case 1.6 (Monte Carlo integration) An early instance of randomization in algorithmics
appeared in numetical integration. Let f : [0,1]n ^ R. One way of approximating the
integral /xS[01]f (x) dx consists of randomly drawing points x1,..., xm e [0,1]n and computing
Im(f) = f (xi).
Under mild conditions on the regularity off, Im(f) ^ f f with probability 1; that is, for all e > 0,
lim Prob
Im(f) f (x) dx
^<x x1 ,...,xm
Again we find the theme of learning an object (here a single real number, although defined in
a nontrivial way through f) from a sample. In this case the measure governing the sample is

known (the measure in [0,1]n inherited from the standard Lebesgue measure on Rn), but the
same idea can be used for an unknown measure. If pX is a probability measure on X c Rn, a
domain or manifold, Im(f) will approximate fxeXf (x) dpX for large m with high probability as
long as the points x1,..., xm are drawn from X according to the measure pX. Note that no noise
is involved here. An extension of this idea to include noise is, however, possible.
A common characteristic of Cases 1.2-1.5 is the existence of both an unknown function
f : X ^ Y and a probability measure allowing one to randomly draw points in X x Y. That
measure can be on X (Case 1.5), on Y varying with x e X (Cases 1.2 and 1.3), or on the
product X x Y (Case 1.4). The only requirement it satisfies is that, if for x e X a point y e Y
can be randomly drawn, then the expected value of y is f (x). That is, the noise is centered at
zero. Case 1.6 does not follow this pattern. However, we have included it since it is a wellknown algorithm and shares the flavor of learning an unknown object from random data.
The development in this book, for reasons of unity and generality, is based on a single
measure on X x Y. However, one should keep in mind the distinction between inputs x e X
and outputs y e Y.

A formal setting


Since we want to study learning from random sampling, the primary object in our
development is a probability measure p governing the sampling that is not known in advance.
Let X be a compact metric space (e.g., a domain or a manifold in Euclidean space) and Y
= Rk. For convenience we will take k = 1 for the time being. Let p be a Borel probability
measure on Z = X x Y whose regularity properties will be assumed as required. In the
following we try to utilize concepts formed naturally and solely from X, Y, and p.
Throughout this book, if f is a random variable (i.e., a real-valued function on a probability
space Z), we will use E(f) to denote the expected value (or average, or mean) of f and a2(f)
to denote its variance. Thus
E(f) =
f(z) dp and a2(f) = E((f - E(f))2) = E(f2) - (E(f))2.
A central concept in the next few chapters is the generalization error (or least squares
error or, if there is no risk of ambiguity, simply error) of f, forf : X ^ Y, defined by

E(f) = Ep(f) =
(f (x) - y)2 dp
For each input x e X and output y e Y, (f (x) - y)2 is the error incurred through the use off as a
model for the process producing y from x. This is a local error. By integrating over X x Y (w.r.t.
p, of course) we average out this local error over all pairs (x,y). Hence the word error for

The problem posed is: What is thef that minimizes the error E(f)? To answer this
question we note that the error E(f) naturally decomposes as a sum. For every x e X, let
p(y\x) be the conditional (w.r.t. x) probability measure on Y. Let also pX be the marginal
probability measure of p on X , that is, the measure on X defined by pX (S) = p(n-1 (S)),
where n : X x Y ^ X is the projection. For every integrable function y: X x Y ^ R a version of
Fubinis theorem relates p, p(y\x), and pX as follows:
This breaking of p into the measures p(y\x) and pX corresponds to looking at Z as a product
of an input domain X and an output set Y. In what follows, unless otherwise specified,
integrals are to be understood as being over p, p (y\x) or pX .
Define fp : X ^ Y by
fp(x) = ydp(y\x).
The function fp is called the regression function of p. For each x e X, fp(x) is the average of
the y coordinate of {x} x Y (in topological terms, the average of y on the fiber of x).
Regularity hypotheses on p will induce regularity properties on fp.
We will assume throughout this book that fp is bounded.
a2(x) = / (y - fp(x))2 dp(y\x).

Fix x e X and consider the function from Y to R mapping y into (y - fp(x)). Since the
expected value of this function is 0, its variance is

a2(x) dPX = E(fp).

Now average over X, to obtainThe number op is a measure of how well conditioned p is,
analogous to the
notion of condition number in numerical linear algebra.
Remark 1.7
i; It is important to note that whereas p and fp are generally unknown, pX is known in
some situations and can even be the Lebesgue measure on X inherited from Euclidean
space (as in Cases 1.2 and 1.6).
ii; In the remainder of this book, if formulas do not make sense or appears, then the
assertions where these formulas occur should be considered vacuous.
Proposition 1.8 For every f : X ^ Y,
E(f ) = f (f (x) - fp(x))2 dPX + op.
Proof From the definition ofp(x) for each x e X, fY (fp(x)-y) = 0. Therefore,
E(f ) = j (f (x) - fp(x) + fp(x) - y)2
(f (x) - fp(x))2 +
(fp(x) - y)2
(f (x) -fp(x))(fp(x) -y)
(f (x) - fp(x))2 + op + 2
(f (x) - fp(x)) (fp(x) - y)
= j (f (x) - fp(x))2 + op.
The first term on the right-hand side of Proposition 1.8 provides an average (over X) of the
error suffered from the use of f as a model for fp. In addition, since op is independent of f,
Proposition 1.8 implies that fp has the smallest possible error among all functions f : X ^ Y.
Thus op represents a lower bound on the error E and it is due solely to our primary object, the
measure p. Thus, Proposition 1.8 supports the following statement:
The goal is to learn (i.e., to find a good approximation of)fp from random
samples on Z .
Throughout this book, the square denotes the end of a proof or the fact that no proof is
We now consider sampling. Let
z e Zm, z = ((X1, yO,..., (xm, ym))
be a sample in Zm, that is, m examples independently drawn according to p. Here Zm
denotes the m-fold Cartesian product of Z. We define the empirical error of f (w.r.t. z) to be
Ez(f) = (f (xt) - yt)1.
Ez (f)
If f is a random variable on Z, we denote the empirical mean of f (w.r.t. z) by Ez(f). Thus,
For any function f : X ^ Y we denote by fY the function
fY : X x Y ^ Y
(x, y) ^ f (x) - y.
With these notations we may write E (f) = E(fy) and Ez (f) = Ezf). We have
already remarked that the expected value of (fp)Y is 0; we now remark that its

variance is a 2.
Remark 1.9 Consider the PAC learning setting discussed in Case 1.5 where X = Rn and T is a
subset of Rn .2 The measure pX described there can be extended to a measure p on Z by
defining, for A c Z,
p(A) = px ({x e X | (x,fr (x)) e A}),


where, we recall, fT is the characteristic function of the set T. The

1 Note, in this case, that X is not compact. In fact, most of the results in this book do not
require compactness of X but only completeness and separability.

marginal measure of p on X is our original pX. In addition, a2 = 0, the error E

specializes to the error mentioned in Case 1.5, and the regression function f


of p coincides with fT except for a set of measure zero in X.

spaces and target functions


Learning processes do not take place in a vacuum. Some structure needs to be present at the
beginning of the process. In our formal development, we assume that this structure takes the
form of a class of functions (e.g., a space of polynomials, of splines, etc.). The goal of the
learning process is thus to find the best approximation offp within this class.
Let C(X ) be the Banach space of continuous functions on X with the norm
If IU = sup If (x)|.
We consider a subset H of C(X) - in what follows called hypothesis space - where algorithms
will work to find, as well as is possible, the best approximation for fp. A main choice in this
book is a compact, infinite-dimensional subset of C(X), but we will also consider closed balls
in finite-dimensional subspaces of C(X) and whole linear spaces.
Iffp e H, simplifications will occur, but in general we will not even assume that fp e C (X)
and we will have to consider a target function fn in H. Define fH to be any function
minimizing the error E(f) over f e H, namely, any optimizer of
min(f (x) - y)2 dp.
f sH Z
Notice that since (f ) = fX (f - fp)2 + a2, fn is also an optimizer of
Let z e Zm be a sample. We define the empirical target function fn? = fz to be a
function minimizing the empirical error Ez (f) over f e H, that is, an optimizer of

Notethat althoughfz is notproducedby an algorithm, itis closeto algorithmic. The statement of
the minimization problem (1.1) depends on p only through its dependence on z, but once z is
given, sois(1.1), and its solutionfz can be looked for without further involvement of p. In
contrast to fn, fz is empirical from its dependence on the sample z.Note finallythat E (fz)
and Ez (f) are different objects.
We next prove that fn andfz exist under a mild condition on H.
Definition 1.10 Let f : X ^ Y and z e Zm. The defect off (w.r.t. z) is
Lz(f ) = Lp,z(f ) = (f ) - Ez(f ).
Notice that the theoretical error E(f) cannot be measured directly, whereas z(f) can. A
bound on Lz(f) becomes useful since it allows one to bound the actual error from an observed
quantity. Such bounds are the object of Theorems 3.8 and 3.10.
Let f1,f2 e C(X). Toward the proof of the existence of fn and fz, we first estimate the
|Lz (fj) - Lzf2)|
linearly by |[/j - f2[ for almost all z e Zm (a Lipschitz estimate). We recall that a set U Z is
said to be full measure when Z \ U has measure zero.
Proposition 1.11 If,forj = 1,2, fj (x) - y| < M on a full measure set U Z then, for all z e
|Lz(fj) Lz(/2)|<4M [fj -f2||
(fj(x) - y)2 - (f2(x) - y) = (fj(x) + f2(x) - 2y)(fj(x) - f2(x)),
we have
|(fj) - m =
(fi(x) + f2(x) - 2y)(fj(x) - f2(x)) dp
< fx) - y) + (f2(x) - y^llfj - f2\\ d p
< 2M [fj - f2\\
Also, for all z e Um, we have

|z (fj) - z = (fj(xi ) + f2(xi ) - 2yi )(fj(xi ) - f2(xi ))

<|(fj(xi) - y) + (f2(xi) - yi)| [fj -f2iu
m L'
< 2M[fj -f2|U
Proof. First note that since
Lzfi) - Lzf2)| = Ef) - Ez f) - E(f2) + z(f2)\<4M llfx - f2\\^
Remark 1.12 Notice that for bounding |Ez(fi) - Ez(f2)\ in this proof - in contrast to the bound for
|E(fi) - E(f2)\ - the use of the || H norm is crucial. Nothing less will do.
Corollary 1.13 Let H c C(X) and p be such that, for allf e H, f (x) - y \< M almost
everywhere. Then E, Ez : H ^ R are continuous.
Proof. The proof follows from the bounds |E (fi)-E (f2)\ <2M lfi-f2\ and |Ez fi) - Ez(f2)\ < 2M |
[/i -f2\\ shown in the proof of Proposition 1.11.
Corollary 1.14 Let H c C(X) be compact and such that for all f e H, f (x) - y \< M almost
everywhere. Thenfy andfz exist.
Proof. The proof follows from the compactness of H and the continuity of E, Ez : C(X) ^ R.
Remark 1.15
i; The functions fy and fz are not necessarily unique. However, we see a uniqueness result
for fy in Section 3.4 when H is convex.
ii; Note that the requirement of H to be compact is what allows Corollary 1.14 to be proved
and therefore guarantees the existence of fy and fz. Other consequences (e.g., the
finiteness of covering numbers) follow in subsequent chapters.


Sample, approximation, and generalization errors

For a given hypothesis space H, the error in H of a function f e H is the normalized error
En(f) = E(f) - E(fy).
Note that Ey(f) > 0 for all f e H and that Ey(fy) = 0. Also note that E(fy) and Ey(f) are
different objects.
Continuing the discussion after Proposition 1.8, it follows from our definitions and
proposition that
fz - fp) dpX + a p = E(fz) = H(fz) + E(fH).
The quantities in (1.2) are the main characters in this book. We have already noted that a2 is a
lower bound on the error E that is solely due to the measure p. The generalization error E(fz)
of fz depends on p, H, the sample z, and the scheme (1.1) defining fz. The squared distance fX
(fz fp)2 dpX is the excess generalization error of fz. A goal of this book is to show that
under some hypotheses on p and H, this excess generalization error becomes arbitrarily small
with high probability as the sample size m tends to infinity.
Now consider the sum Ey(fz) + E(fy). The second term in this sum depends on the choice
of H but is independent of sampling. We will call it the approximation error. Note that this
approximation error is the sum
+ a2
where A(H) = fX (fy fp)2 dpX. Therefore, a2 is a lower bound for the approximation error.
The first term, Ey(fz), is called the sample error or estimation error.
Equation (1.2) thus reduces our goal above - to estimate fX (fz fp)2 or, equivalently,
E(fz) - into two different problems corresponding to finding estimates for the sample and
approximation errors. The way these problems depend on the measure p calls for different
methods and assumptions in their analysis.
The second problem (to estimate A(H)) is independent of the sample z. But it depends
heavily on the regression function fp. The worse behaved fp is (e.g., the more it oscillates), the
more difficult it will be to approximate fp well with functions in H. Consequently, all bounds for
A(H) will depend on some parameter measuring the behavior offp.
The first problem (to estimate the sample error Ey(fz )) is posed on the space H, and its
dependence on p is through the sample z. In contrast with the approximation error, it is

essentially independent of fp. Consequently, bounds for Ey (fz) will not depend on properties
offp. However, due to their dependence on the random sample z, they will hold with only a
certain confidence. That is, the bound will depend on a parameter 8 and will hold with a
confidence of at least 1 8.
This discussion extends to some algorithmic issues. Although dependence on the behavior
of fp seems unavoidable in the estimates of the approximation error (and hence on the
generalization error E(fz) of fz), such a dependence is undesirable in the design of the
algorithmic procedures leading to fz (e.g., the selection of H). Ultimately, the goal is to be
able, given a sample z, to select a hypothesis space H and compute the resulting fz without
assumptions on fp and then to exhibit bounds on fX (fz - fp)2 dpX that are, with high
probability, reasonably good in the measure that fp is well behaved. Yet, in many situations,
the choice of some parameter related to the selection of H is performed with methods that,
although satisfactory in practice, lack a proper theoretical justification. For these methods, our
best theoretical results rely on information about fp.


The bias-variance problem

For fixed H the sample error decreases when the number m of examples increases (as we see
in Theorem 3.14). Fix m instead. Then, typically, the approximation error will decrease when
enlarging H, but the sample error will increase. The bias-variance problem consists of
choosing the size of H when m is fixed so that the error E(fz) is minimized with high
probability. Roughly speaking, the bias of a solution f coincides with the approximation
error, and its variance with the sample error. This is common terminology:
A model which is too simple, or too inflexible, will have a large bias, while one which has too much flexibility in
relation to the particular data set will have a large variance. Bias and variance are complementary quantities,
and the best generalization [i.e. the smallest error] is obtained when we have the best compromise between
the conflicting requirements of small bias and small


Thus, a too small space H will yield a large bias, whereas one that is too large will yield a
large variance. Several parameters (radius of balls, dimension, etc.) determine the size of H,
and different instances of the bias-variance problem are obtained by fixing all of them except
one and minimizing the error over this nonfixed parameter.
Failing to find a good compromise between bias and variance leads to what is called
underfitting (large bias) or overfitting (large variance). As an example, consider Case 1.2
and the curve C in Figure 1.2(a) with the set of sample points and assume we want to
approximate that curve with a polynomial of degree d (the parameter d determines in our
case the dimension of H). If d is too small, say d = 2, we obtain a curve as in Figure 1.2(b)
which necessarily underfits the data points. If d is too large, we can tightly fit the data
[18] p. 332.

Figure 1.2

but this overfitting yields a curve as in Figure 1.2(c). In terms of the error decomposition
(1.2) this overfitting corresponds to a small approximation error but large sample error.

As another example of overfitting, consider the PAC learning situation in Case 1.5 with C
consisting of all subsets of Rn. Consider also a sample {(xi, 1),..., (xk, 1), (xk+i,0),..., (xm, 0)}.
The characteristic function of the set S = {x1,..., xk} has zero sample error, but its
approximation error is the measure (w.r.t. pX) of the set T A {x1,..., xk}, which equals the
measure of T as long as pX has no points with positive probability mass.


The remainder of this book

In Chapter 2 we describe some common choices for the hypothesis space H. One of them,
derived from the use of reproducing kernel Hilbert spaces (RKHSs), will be systematically used
in the remainder of the book.
The focus of Chapter 3 is on estimating the sample error. We want to estimate how close
one may expectfz and fn to be, depending on the size of the sample and with a given
confidence. Or, equivalently,
How many examples do we need to draw to assert, with a confidence greater than
1 S, that X (fz )2 d PX is not more than e?
Our main result in Chapter 3, Theorem 3.3, gives an answer.
Chapter 4 characterizes the measures p and some families {HR}R>0 of hypothesis spaces
for which A(HR) tends to zero with polynomial decay; that is, A(HR) = O(Re) for some 0 > 0.
These families of hypothesis spaces are defined using RKHSs. Consequently, the chapter
opens with several results on these spaces, including a proof of Mercers theorem.
The bounds for the sample error in Chapter 3 are in terms of a specific measure of the size
of the hypothesis space H, namely, its covering numbers. This measure is not explicit for all of
the common choices of H. In Chapter 5 we give bounds for these covering numbers for most
of the spaces H introduced in Chapter 2. These bounds are in terms of explicit geometric
parameters of H (e.g., dimension, diameter, smoothness, etc.).
In Chapter 6 we continue along the lines of Chapter 4. We first show some conditions
under which the approximation error can decay as O(R~e) only if fp is C. Then we show a
polylogarithmic decay in the approximation error of hypothesis spaces defined via RKHSs for
some common instances of these spaces.
Chapter 7 gives a solution to the bias-variance problem for a particular family of
hypothesis spaces (and under some assumptions on fp).
Chapter 8 describes a new setting, regularization, in which the hypothesis space is no
longer required to be compact and argues some equivalence with the setting described
above. In this new setting the computation of the empirical target function is algorithmically
very simple. The notion of excess generalization error has a natural version, and a bound for it
is exhibited.
A special case of learning is that in which Y is finite and, most particularly, when it has
two elements (cf. Case 1.5). Learning problems of this kind are called classification
problems as opposed to the ones with Y = R, which are called regression problems. For
classification problems it is possible to take advantage of the special structure of Y to devise
learning schemes that perform better than simply specializing the schemes used for
regression problems. One such scheme, known as the support vector machine, is
described, and its error analyzed, in Chapter 9. Chapter 10 gives a detailed analysis for
natural extensions of the support vector machine.
We have begun Chapters 3-10 with brief introductions. Our intention is that maybe after
reading Chapter 2, a reader can form an accurate idea of the contents of this book simply by
reading these introductions.


References and additional remarks

The setting described in Section 1.2 was first considered in learning theory by V. Vapnik and
his collaborators. An account of Vapniks work can be found in [134].
For the bias-variance problem in the context of learning theory see [18, 54] and the
references therein.
There is a vast literature in learning theory dealing with the sample error. A pair of
representative books for this topic are [6, 44].
Probably the first studies of the two terms in the error decomposition (1.2) were [96] and

In this book we will not go deeper into the details of PAC learning.A standard reference for
this is [67].
Other (but not all) books dealing with diverse mathematical aspects of learning theory are
[7, 29, 37, 57, 59, 61, 92, 95, 107, 111, 124, 125, 132, 133, 136, 137]. In addition, a number
of scientific journals publish papers on learning theory. Two devoted wholly to the theory as
developed in this book are Journal of Machine Learning Research and Machine
Finally, we want to mention that the exposition and structure of this chapter

Basic hypothesis spaces

largely follow [39].

In this chapter we describe several examples of hypothesis spaces. One of these examples
(or, rather, a family of them) - a subset H of an RKHS - will be systematically used in the
remainder of this book.


First examples of hypothesis space

Example 2.1 (Homogeneous polynomials) Let Hd = Hd (R+1) be the linear space of

homogeneous polynomials of degree d in x0,x1,...,xn. Let X = S(Rn+1), the n-dimensional unit
sphere. An element in Hd defines a function from X to R and can be written as
f = E Wa X1.
Here, a = (a0,..., an) e Nn+1 is a multi-index, |a| = a0 + + an, and xa = xa0 x^1.
Thus, Hd is a finite-dimensional vector space. We may consider H = {f e Hd | ||f ||OT < 1} to be
a hypothesis space. Because of the scaling f (Xx) = Xdf (x), taking the bound || f ||, < 1
causes no loss.
Example 2.2 (Finite-dimensional function spaces) This generalizes the previous example. Let
01,..., e C(X) and E be the linear subspace of C(X) spanned by (01,...,
}. Here we may
take H = {f e E | \\f ||OT < R}
for some R > 0.
The next two examples deal with infinite-dimensional linear spaces. To describe them
better, we first remind the reader of a few basic notions and notations.


Reminders I

We first recall some commonly used spaces of functions.

We have already defined C(X). Recall that this is the Banach space of bounded continuous
functions on X with the norm
II f lie(X) = Ilf 11 = sup | f (x)|.
When X c Rn, for s e N, we denote by Cs(X) the space of functions on X that are s times
differentiable and whose sth partial derivatives Daf are continuous. This is also a Banach
space with the norm
II f lie(X) = max Df II.
Here, a e N and Daf is the partial derivative d|a|f /3^xi,..., 3%xn.
The space C(X) is the intersection of the spaces Cs(X) for s e N. We do not define any
norm on C (X) and consider it only as a linear space.
Let v be a Borel measure on X, p e [i, ), and L be the linear space of functions f : X ^ Y
such that the integral
| f (xyf d v X
exists. The space L (X) is defined to be the quotient of L under the equivalence relation =
given by
f = g ^ i |f (x) - g(x)^dv = 0.
This is a Banach space with the norm
II f IILP(X) = (jf | f (x) |p dv) 'P .
If p = 2, L2(X) is actually a Hilbert space with the scalar product
<f,g>L2(x) = / f (x)g(x) dv.
When there is no risk of confusion we write <, > v and I ||v instead of <, >L2(X) and I IIL2(X). In

addition, when v = pX we simply write I ||p instead of the more cumbersome I ||PX.
Note that elements in L (X) are classes of functions. In general, however, one abuses
language and refers to them as functions on X. For instance, we say that f e Ly (X) is
continuous when there exists a continuous function in the class off.
The support of a measure on X is the smallest closed subset X of X such that v(X \ Xv) =
A function f : X ^ R is measurable when, for all a e R, the set {x e X | f (x) < a} is a
Borel subset of X.
The space L (X) is defined to be the set of all measurable functions on X such that
II f II L(X) := sup |f (x)| < .
Each element in L (X) is a class of functions that are identical on Xv.
A measure v is finite, when v(X) < .Also, we say that v is nondegenerate when, for
each nonempty open subset U c X, v(U) > 0. Note that v is nondegenerate if and only if X =
If v is finite and nondegenerate, then we have a well-defined injection C(X) ^ LP(X), for
all 1 < p <.
When v is the Lebesgue measure, we sometimes denote L^ (X) by Lp(X) or, if there is no
risk of confusion, simply by Lp.
We next briefly recall some basics about the Fourier transform. The Fourier transform
F : L*(Rn) ^ L*(Rn) is defined by
F(f )(w) =( e~iw'xf (x) dx.
The function F(f) is well defined and continuous on Rn (note, however, that F( f) is a complexvalued function). One major property of the Fourier transform in L1(Rn) is the convolution
F(f * g) = F(f )F(g),
where f * g denotes the convolution of f and g defined by
(f * g)(x) =
f (x - u)g(u) du.
The extension of the Fourier transform to L2(Rn) requires some caution. Let C0(Rn) denote
the space of continuous functions on Rn with compact support. Clearly, C0(Rn) c L 1(Rn) n
L2(Rn). In addition, C0(Rn) is dense in L2(Rn). Thus, for any f e L2(Rn), there exists [fak}k>1 c
C0(Rn) such that \\fak f || ^ 0 when k ^ra. One can prove that for any such sequence
(fak), the sequence (F(fak)) converges to the same element in L2(Rn). We denote this
element by F(f) and we say that it is the Fourier transform of f. The notation f instead of F(f)
is often used.
The following result summarizes the main properties of F: L2(Rn) ^ L 2(Rn).
Theorem 2.3 (Plancherels theorem) Forf e L2(Rn)
F (f )(w) = lim I eiw'xf (x) dx, where the convergence is for the
^<x j[kk]n
norm in L2 (Rn).
ii; ||F(f)\\ = (2n)n/2\\f ||.
iii; f (x) = lim - elw'xF(f )(w) dw, where the convergence is
k^ra (2n)n J[k,k]n
for the norm in L2(Rn). If f e L1 (Rn) n L2(Rn), then the convergence holds almost
iv; The map F: L 2(Rn) ^ L 2(Rn) is an isomorphism of Hilbert spaces.
III; Our third reminder is about compactness.
It is well known that a subset of Rn is compact if and only if it is closed and bounded. This
is not true for subsets of C (X). Yet a characterization of compact subsets of C(X) in similar
terms is still possible.
A subset S of C(X) is said to be equicontinuous at x e X when for every e > 0 there
exists a neighborhood V of x such that for all y e V and f e S,
| f (x) f (y)| < e. The set S is said to be equicontinuous when it is so at every x in X .
Theorem 2.4 (Arzela-Ascoli theorem) LetX be compact and S be a subset of C (X). Then
S is a compact subset of C (X) if and only ifS is closed, bounded, and


The fact that every closed ball in Rn is compact is not true in Hilbert space. However, we
will use the fact that closed balls in a Hilbert space H are weakly compact. That is, every
sequence [fn}neN in a closed ball B in H has a weakly convergent subsequence [fnk }keN, or,
in other words, there is some f e B such that
lim (fnk, g> = f, g>, Vg e H.
kWe close this section with a discussion of completely monotonic functions. This
discussion is on a less general topic than the preceding contents of these reminders.
A function f : [0, TO) ^ R is completely monotonic if it is continuous on [0, TO), CTO on (0,
TO), and, for all r > 0 and k > 0, (-1)kf(k\r) > 0.
We will use the following characterization of completely monotonic functions.
Proposition 2.5 A function f : [0, TO) ^ R is completely monotonic if and only if, for
all t e (0, TO),
p TO
f (t) =
e-ta d v(o),
where v is a finite Borel measure on [0, TO).

2.3 Hypothesis spaces associated with Sobolev spaces

Definition 2.6 Let J : E ^ F be a linear map between the Banach spaces E and F. We say that
J is bounded when there exists b e R such that for all x e E with ||x|| = 1, || J(x)|| < b. The
operator norm of J is
IIJII = sup IIJ(x) ||.
I|x|| = 1
If J is not bounded, then we write ||J || = TO. We say that J is compact when the closure J (B)
of J (B) is compact for any bounded set B c E.
Example 2.7 (Sobolev spaces) Let X be a domain in Rn with smooth boundary. For every s
e N we can define an inner product in CTO (X) by
f, >s = f E Dafag.
Here we are integrating with respect to the Lebesgue measure p on X inherited from
Euclidean space. We will denote by || ||s the norm induced by ( , )s. Notice that when s = 0,
the inner product above coincides with that of L2(X). That is, || ||0 = || ||. We define the
Sobolev space Hs(X) to be the completion of CTO (X) with respect to the norm || ||s. The
Sobolev embedding theorem asserts that for all r e N and all s > n/2 + r, the inclusion
Js : Hs(X )^ Cr (X )
is well defined and bounded. In particular, for all s > n/2, the inclusion
Js: Hs(X )^ C(X)
is well defined and bounded. From Rellichs theorem it follows that if X is compact, this last
embedding is compact as well. Thus, if BR denotes the closed ball of radius R in Hs(X) we may
take HR,s = H = Js (BR).


Reproducing Kernel Hilbert Spaces

Definition 2.8 Let X be a metric space. We say that K: X x X ^ R is symmetric when K (x, t)
= K (t, x) for all x, t e X and that it is positive semidefinite when for all finite sets x =
{xi,..., xk} c X the k x k matrix K[x] whose (i, j) entry is K(xi, xj) is positive semidefinite. We
say that K is a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The
matrix K[x] above is called the Gramian of K at x.
For the remainder of this section we fix a compact metric space X and a Mercer kernel
K:X x X ^ R. Note that the positive semidefiniteness implies that K (x, x) > 0 for each x e
X .We define
CK := sup VK (x, x).
CK = sup
|K(x, t)|
since, by the positive semidefiniteness of the matrix K[{x, t}], for all x, t e X, (K(x, t))2 <

K(x,x)K(t, t).
For x e X , we denote by Kx the function
Kx: X ^ R
t ^ K (x, t).
The main result of this section is given in the following theorem.
Theorem 2.9 There exists a unique Hilbert space (HK, (, )HK ) of functions onX
satisfying the following conditions:
i; for all x e X, Kx e HK,
ii; the span of the set {Kx | x e X} is dense in HK, and
iii; for all f e HK andx e X,f (x) = {Kx, f )HK.
Moreover, HK consists of continuous functions and the inclusion IK : HK ^ C(X) is
bounded with \\IK || < CK.
Proof. Let H0 be the span of the set {Kx | x e X}. We define an inner product in Ho as
(f, g) = J2 aPjKfe, tj), for f =
aKxi, g =
The conditions for the inner product can be easily checked. For example, if (f, f) = 0, then
for each t e X the positive semidefiniteness of the Gramian of K at the subset {x,}s= 1 U {t}
tells us that for each e e R
aKx, xj)aj + 2 ^ aiK(xt, t)e + e2K(t, t) > 0. ij=1
However^]sij=1 aK(x,, xj)aj = (f,f) = 0. By letting e be arbitrarily small, we see that f (t) =
^f s=1 aiK(xi, t) = 0. This is true for each t e X; hence f is the zero function.
Let HK be the completion of H0 with the associated norm. It is easy to check that HK
satisfies the three conditions in the statement. We need only prove that it is unique. So,
assume H is another Hilbert space of functions on X satisfying the conditions noted. We want
to show that
= HK and (, )H = ^ )HK .
We first observe that H0 c H. Also, for any x, t e X, (Kx, KT)H = K(x, t) = (Kx, Kt)HK . By
linearity, for every f, g e H0, (f, g)H = (, g)nK. Since both H and HK are completions of H0,
(2.1) follows from the uniqueness of the completion.
To see the remaining assertion consider f e HK and x e X. Then
| f (x) | = \{KX ,f ) H K | < \\f \\ H K \\ KX\\ H K = \\f W H J K (x, x).This implies that ||f ||, < CK||f
\\HK and, thus, ||/K|| < CK. Therefore, convergence in || \\HK implies convergence in || ||OT, and
this shows that f is continuous since f is the limit of elements in H0 that are continuous.
In what follows, to reduce the amount of notation, we will write (, )K instead of (, )HK and ||
||K instead of || \\Hk.
Definition 2.10 The Hilbert space HK in Theorem 2.9 is said to be an Reproducing Kernel
Hilbert Space (RKHS). Property (iii) in Theorem 2.9 is refered to as the reproducing


Some Mercer kernels

In this section we discuss some families of Mercer kernels on subsets of Rn. In most cases,
checking the symmetry and continuity of a given kernel K will be straightforward. Checking
that K is positive semidefinite will be more involved.
The first family of Mercer kernels we look at is that of dot product kernels. Let X = {x e
Rn : ||x|| < R} be a ball of Rn with radius R > 0. A dot product kernel is a function K: X x X ^
R given by

K(x, y) = ^ ad (x y)d,
where ad > 0 and J] adR2d < .
Proposition 2.11 Dot product kernels are Mercer kernels on X.
Proof. The kernel K is obviously symmetric and continuous on X x X. To check its positive
semidefiniteness, recall that the multinomial coefficients associated with the pairs (d, a)

(x y)d =
&axaya, Vx,y e Rn.
Let {x1,..., xk} c X. Then, for all c1,..., ck e R


i j =

c c K(x , x )

J2 Y.

d=0 |a|=d
Therefore, K is a Mercer kernel.

c x<a

An explicit example is the linear polynomial kernel.

Example 2.12 Let X be a subset of R containing at least two points, and K the Mercer kernel
on X given by K(x, y) = 1 + x y. Then HK is the space of linear functions and {1, x} forms
an orthonormal basis of HK.
Proof. Note that for a e X, Ka is the function 1 + ax of the variable x in X c R. Take a = b e X.
By the definition of the inner product in HK,
\\Ka - Kb IlK = (Ka - Kb, Ka - Kb)K = K(a, a) - 2K(a, b) + K(b, b)
= 1 + a2 - 2(1 + ab) + 1 + b2 = (a - b)2.
But (Ka - Kb)(x)= (1 + ax)- (1 + bx) = (a
- b)x. So
- Kb\\2K =
\\(a - b)x\\K = (a - b) ||x||K. It follows that ||x||K = 1.
In the same way,
(Ka, Ka - Kb)K = K(a, a) - K(a, b) = (1 + a2) - (1 + ab) = a(a - b).
But (Ka - Kb)(x) = (a - b)x and Ka(x) = 1 + ax. So
(Ka, Ka - Kb)K = <1 + ax, (a - b)x)K = (a - b) (1, X)K + a (a - b)\\x\\2K.
Since \\x\\2K = 1, we have (a - b)<1,x)K = 0. But a - b = 0, and hence <1, X)K = 0.
Similarly, <Ka, Ka)K = K(a, a) = 1 + a2 can also be written as
<1 + ax,1 + ax)K = ||1||K + 2a(1, x)K + a2\\x\\2K = ||1||K + a2.
Hence ||1||K = 1. We have thus proved that {1,x} is an orthonormal system of HK .
Finally, each function Kc = 1 + cx with c e X is contained in span{1, x}, which is a closed
subspace of HK. Therefore, span{1, x} = HK and {1, x} is an orthonormal basis of HK .

The above extends to the multivariate case under a slightly stronger assumption on X .
Example 2.13 Let X c Rn containing 0 and the coordinate vectors ej, j = 1,..., n. Let K be the
Mercer kernel on X given by K (x, y) = 1 + x y. Then HK is the space of linear functions and
{1, xi, x2,..., xn} forms an orthonormal basis of HK .
Proof. Note that K(v, x) = 1 + v x = 1 + v1x1 + + vnxn with v = (vi,..., vn) e X c Rn.
We argue as in Example 2.12. For each 1 < j < n,
K - KoIlK = K(ej, ej) - 2K(ej,0) + K(0,0) = 1.
But (Kej -K0)(x) = (1 + Xj) -1 = Xj, and therefore 1 = \\Kej - K0||K = \\Xj ||K. One can prove,
similarly, that (1, xj )K = 0and <1,1)K = 1.Now consider i = j:
K, Kej )K = K (ei, ej) = 1 = <1 + xi ,1 + xj )K
= ||1||K + (1, xi )K + (1, xj )K + (xi, xj )K .
Since we have shown that (1, xi)K = (1, xj)K = 0 and || 1 |K = 1, it follows that (xi, xj )K = 0.
We have thus proved that {1, x1, x2,..., xn} is an orthonormal system of HK. But each function
Kv = 1 + v x with v e X is contained in span{1,x1,x2,...,xn}, which is a closed subspace of
HK. Therefore, span{1, x1, x2,..., xn} = HK and {1, x1, x2,..., xn} is an orthonormal basis of HK .

The second family is that of translation invariant kernels as given by,

K(x, y) = k(x - y),
where k is an even function on Rn, that is, k(-x) = k(x) for all x e Rn.
We say that the Fourier transform k of k is nonnegative (respectively, positive) when it
is real valued and k (%) > 0 (respectively, k ($) > 0) for all $ e Rn.
Proposition 2.14 Let k e L2(Rn) be continuous and even. Suppose the Fourier transform
of k is nonnegative. Then the kernel K(x, y) = k (x - y) is a Mercer kernel on Rn and
hence a Mercer kernel on any subset X of Rn.
Proof. We need only show the positive semidefiniteness. To do so, for any x1,..., xm e Rn and c1,
..., cm e R, we apply the inverse Fourier transform
k(x) = (2n)-n
Ik(f)eix'$ d$

k(*)eixj * e-ixft
Y cjceK (xj, xe)

to get

d* > 0 ,
where | | means the module in C and z is the complex conjugate of z. Thus, K is a Mercer
kernel on any subset of Rn.

Example 2.15 (A spline kernel) Let k be the univariate function supported on [-2,2] given by
k(x) = 1 - |x|/2 for -2 < x < 2. Then the kernel K defined by K (x, y) = k (x - y) is a Mercer
kernel on any subset X of R.
Proof. One can easily check that 2k (x) equals the convolution of the characteristic function
x[-1,1] with itself. But x[-1 ,1 ](f) = 2sin*/*. Thus, k(*) = 2(sin */*) 2 > 0 and the Mercer property
follows from Proposition 2.14.

Remark 2.16 Note that the kernel K defined in Example 2.15 is given by
K (x, y)
1 -^ if |x - y|< 2 0
and therefore CK = 1.
Multivariate splines can also be used to construct translation-invariant kernels. Take B =
[b1 b2 ... bq] to be an n x q matrix (called the direction set) such that q > n and the n x n
submatrix B0 = [b1 b2 ... bn] is invertible. Define

jdetBOl the normalized characteristic function of the parallepiped

par (BO) :

hb | I j | < 2,1 < j < n j =1

spanned by the vectors b i , , bn in Rn. Then the (centered) box spline MB can be
inductively defined by

[bi b2 ... bn+j]


[bi b2 ... bn+j-1 ](x tbn+j)


sin (f bj/2) f bj


MB (f) = f]
for j = 1,. . . , q n. One can check by induction that its Fourier transform satisfies
Example 2.17 (A box spline kernel) Let B = [b1 b2 ... bq] be an n x q matrix where [b1 b2
... bn]is invertible. Choose k(x) =
(MB *
MB)(x) to be the box
spline with direction set [B, B]. Then, for all % e R ,

sin (f bj/2) f bj


and the kernel K (x, y) = k (x y) is a Mercer kernel on any subset X of Rn.

An interesting class of translation invariant kernels is provided by radial basis functions.
Here the kernel takes the form K(x,y) = f (||x y||2 ) for a univariate function f on [0, +ro).
The following result allows us to verify positive semidefiniteness easily for this type of kernel.
Proposition 2.18 Let X c Rn, f : [0, <x>) ^ R and K:X x X ^ R defined by K(x, y) = f (||
x y||2 ). If f is completely monotonic, then K is positive semidefinite.

d v(a)
Proof. By Proposition 2.5, there is a finite Borel measure v on [0, TO) for which
for all t e [0, TO). It follows that
f TO
K(x,y) = f(||x yll2) = e|xdv(a).
o x
0 Now note that for each o e [0, TO), the Fourier transform of e 11 1, 2 equals
(s/nj)ne Nf1 j4 a.Hence,
allxN2 = {2n)-n
( 1 ) 2 e-1 ^eix'f df.
Rn V a /
n t 2 /4o
Therefore, reasoning as in the proof of Proposition 2.14, we have, for all x = (xi,..., xm) e Xm,

Rn Va


cecjK (xe, xj ) =
jRn V
Corollary 2.19 Let c > 0. The following functions are Mercer kernels on any subset X c
M n:
(Gaussian) K(x,t) = e-Nx-t,l2 jc2 .
ii; (Inverse multiquadrics) K(x, t) = (c2 + ||x - tN2 )- with a > 0.
Proof. Clearly, both kernels are continuous and symmetric. In (i) K is positive semidefinite by
Proposition 2.18 with f (r) = e-rjc . The same is true for (ii) taking f (r) = (c2 + r)-a.
Remark 2.20 The kernels of (i) and (ii) in Corollary 2.19 satisfy CK = 1 and CK = c-a,
A key example of a finite-dimensional RKHS induced by a Mercer kernel follows. Unlike in
the case of the Mercer kernels of Corollary 2.19, we will not use Proposition 2.18 to show

Example 2.1 (continued) Recall that Hd = Hd (Rn+1 ) is the linear space of homogeneous
polynomials of degree d in x0 , x1 ,..., xn. Its dimension (the number of coefficients of a
polynomial f e Hd ) is
The number N is exponential in n and d. We notice, however, that in some situations one may
consider a linear space of polynomials with a given monomial structure; that is, only a
prespecified set of monomials may appear.
We can make Hd an inner product space by taking
(f,g)w = J2
for f, g e Hd, f = ^2 wa xa, g = Y. va xa .This inner product, which we call the Weyl inner
product, is natural and has an important invariance property. Let O(n +1 ) be the orthogonal
group in Rn+1 , that is, the group of (n +1 ) x (n +1 ) real matrices whose action on Rn+1
preserves the inner product on Rn+1 ,
a(x) a(y) = x y, for all x,y e Rn+1 and all a e O(n + 1 ).
The action of O(n + 1) on Rn+1 induces an action of O (n + 1) on Hd. For f e Hd and a e O(n
+ 1) we define a(f ) e Hd by af (x) = f (a-1 (x)). The invariance property of (, }w, called
orthogonal invariance, is that for all f, g e Hd,
{o(f), o(g))w = {, g>w.
Note that if || f ||w denotes the norm induced by {, )w, then
|fMi < If ||wUf,
where ||x| is the standard norm of x e R+1. This follows from taking the action of o e O(n + 1 )
such that o(x) = (||x||, 0 ,
0 ).

Let X = 5 (Rn+1) and

K: X x X ^ R
(x, t) ^ (x t)d.
Let also
$ : X ^ Rw
x ^ (xa(Cd )1 /2 )
) |a|=d
Then, for x, t e X, we have
$(x) $(t) = J2 xataCdd = (x t)d = K(x, t).
This equality enables us to prove that K is positive semidefinite. For t1 ,..., tk e X, the entry in
row i and column j of K[t] is $(ti)$(tj). Therefore, ifM denotes the matrix whose jth column is
$(tj), we have that K[t] = MTM, from which the positivity of K[t] follows. Since K is clearly
continuous and symmetric, we conclude that K is a Mercer kernel.
The next proposition shows the RKHS associated with K.
Proposition 2.21 Hd = HK as function spaces and inner product spaces.
Proof. We know from the proof of Theorem 2.9 that HK is the completion of H0, the span of {Kx
| x e X}. Since H0 c Hd and Hd has finite dimension, the same holds for H0 . But then H0 is
complete and we deduce that
K = H0 ^ Hd.
The map V : Rn+1 ^ Rw defined by V(x) = (xa)\a\=d is a well-known object in algebraic
geometry, where it is called a Veronese embedding. We note here that the map $ defined
above is related to V, since for every x e X, $(x) = DV (x), where D is the diagonal matrix
with entries (Cda )1 /2 . The image of Rn+1 by the Veronese embedding is an algebraic variety
called the Veronese variety, which is known to be nondegenerate, that is, to span all of Rw.
This implies that HK = Hd as vector spaces. We now show that they are actually the same
inner product space.
By definition of the inner product in H0 , for all x, t e X,
(Kx, Kt)H0 = K(x, t) =
Cdxa wa, we know that the Weyl
= Y Cdxa ta = {Kx, Kt}K .
On the other hand, since Kx(w) = ^\a\=d inner product of Kx and Kt satisfies
(Kx,Kt)W = Y, (Cd)-1CdxaCdJa~- |a|=d
We conclude that since the polynomials Kx span all of H0, the inner product in HK = H0 is the
Weyl inner product.


Hypothesis spaces associated with an RKHS

We now proceed with the last example in this chapter.

Proposition 2.22 Let K be a Mercer kernel on a compact metric space X, and HK its
RKHS. For all R > 0, the ballBR :={f e HK : || f ||K < R} is a closed subset of C (X).
Proof. Suppose {fn} c BR converges in C(X) to a function f e C(X). Then, for all x e X,
f (x) = lim fn(x). nw
Since a closed ball of a Hilbert space is weakly compact, we have that BR is weakly compact.
Therefore, there exists a subsequence {fnk }keN of {fn} and an element f e BR such that
lim {fnk, g)K = f, g)K, Vg e HK.
For each x e X, take g = Kx to obtain
lim fnk (x) = lim fnk, KX)K = f, Kx)K = f (x).
k w
But limkw fnk (x) = f (x), so we have f (x) = f (x) for every point x e X. Hence, as continuous
functions on X, f = f. Therefore, f e BR. This shows that BR is closed as a subset of C(X).
Proposition 2.23 Let K be a Mercer kernel on a compact metric space X, and HK be
its RKHS. For all R > 0, the set IK (BR) is compact.
Proof. By the Arzela-Ascoli theorem (Theorem 2.4) it suffices to prove that BR is
Since X is compact, so is X x X. Therefore, since K is continuous on X x X, K must be
uniformly continuous on X x X .It follows that for any e > 0, there exists 8 > 0 such that for
all x, y, y' e X with d (y, y') < 8,
|K(x,y) - K(x,y')\ < e.

For f e BR and y, y' e X with d (y, y') < 8, we have

If (y) - f (Z)l = |{f,Ky - KY)K\ < ||f IIK\\Ky - Kyi||K
< R(K(y, y) - K(y, y') + K(y', y') - K(y, y))1'2 < RV2I.
Example 2.24 (Hypothesis spaces associated with an RKHS) Let X be


compact and K: X x X R be a Mercer kernel. By

Proposition 2.23, for all R > 0 we may consider IK(BR) to be a hypothesis

space. Here and in what follows BR denotes the closed ball of radius R

Reminders II

centered on the origin.

The general nonlinear programming problem is the problem of finding x e Rn to solve the
following minimization problem:
min f (x)
s.t. gi(x) < 0 , i = 1 ,

, m,

(2 .2 )

hj (x) = 0 , j = 1 ,..., p,
where f, gi, hj: Rn ^ R. The function f is called the objective function, and the equalities and
inequalities on gi and hj are called the constraints. Points x e Rn satisfying the constraints
are feasible and the subset of Rn of all feasible points is the feasible set.
Although stating this problem in all its generality leads to some conceptual clarity, it
would seem that the search for an efficient algorithm to solve it is hopeless. A vast amount of
research has thus focused on particular cases and the emphasis has been on those cases for
which efficient algorithms exist. We do not develop here the complexity theory giving formal
substance to the notion of efficiency - we do not need such a development; instead we
content ourselves with understanding the notion of efficiency according to its intuitive
meaning: an efficient algorithm is one that computes its outcome in a reasonably short time
for reasonably long inputs. This property can be found in practice and studied in theory (via
several well-developed measures of complexity available to complexity theorists).
One example of a well-studied case is linear programming. This is the case in which
both the objective function and the constraints are linear. It is also a case in which efficient
algorithms exist (and have been both used in practice and studied in theory). A much more
general case for which efficient algorithms exist is that of convex programming.
A subset S of a linear space H is said to be convex when, for all x, y e S and all X e [0,1],
Xx + (1 - X)y e S.
A function f on a convex domain S is said to be convex if, for all X e[0,1] and all x, y e S,
f (Xx + (1 - X)y) < Xf (x) + (1 - X)f (y). If S is an interval on R, then, for x0, x e S, x0 < x, we
f (x0 + X(x - x0 )) = f (Xx + (1 - X)x0 ) < Xf (x) + (1 - X)f (XQ)
f (XQ + X(x - XQ)) - f (XQ) ^ f (x) - f (XQ)
X(x - XQ)
x - x0
f+ (XO) :
f (t) - f (XO) t - XO
This means that the function t ^ (f (t) - f (x0))/(t - x0) is increasing in the interval [x0, x].
Hence, the right derivative
f- (XO) :
f (t) - f (XO) t - XO
exists. In the same way we see that the left derivative
exists. These two derivatives, in addition, satisfy f- (x0 ) < f+ (x0) whenever x0 is a point in the
interior of S. Hence, both f-(x0) and f+ (x0) are nondecreasing in S.
In addition to those listed above, convex functions satisfy other properties. We highlight
the fact that the addition of convex functions is convex and that if a function f is convex and
C2 then its Hessian D2f (x) at x is positive semidefinite for all x in its domain.

The convex programming problem is the problem of finding x e Rn to solve (2.2) with f
and gi convex functions and hj linear. As we have remarked, efficient algorithms for the
convex programming problem exist. In particular, when f and the gi are quadratic functions,
the corresponding programming problem, called the convex quadratic programming
problem, can be solved by even more efficient algorithms. In fact, convex quadratic
programs are a particular case of second-order cone programs. And second- order cone
programming today provides an example of the success of interior point methods: very large
amounts of input data can be efficiently dealt with, and commercial code is available. (For
references see Section 2.9).

2.8 On the computation of empirical target functions

A remarkable property of the hypothesis space H = IK(BR), where BR is the ball of radius R in
an RKHS HK, is the fact that the optimization problem of computing the empirical target
function fz reduces to a convex programming problem.
Let K be a Mercer kernel, and HK its associated RKHS. Let z e Zm. Denote by HK,z the finitedimensional subspace of HK spanned by {Kx1 , ..., Kxm} and let P be the orthogonal projection
P: HK ^ HK,z.
Proposition 2.25 Let B c HK. If e HK is a minimizer of Ez in B, then P(f) is a minimizer
of Ez in P(B), the image ofB under P.
Proof. For all f e B and all i = 1 , m , f, KXi)K = (P(f), KXi)K. Since both f and P(f) are in HK,
the reproducing property implies that
f X) = f, KXi )K = (P(f), KXi )K = (P(f ))X).
It follows that Ez(f) = Ez(P(f)). Taking f to be a minimizer of Ez in B proves the statement.
Corollary 2.26 Let B c HK be such that P(B) c B. If Ez can be minimized in B then such
a minimizer can be chosen in P(B).

Corollary 2.26 shows that in many situations - for example, when B is convex - the
empirical target function fz may be chosen in HK,z. Recall from Theorem 2.9 that the norm || ||K
restricted to HK,z is given by

Therefore, when B = BR, we may take fz =
solution of the following problem:
min s.t

m= 1 c*KXi, where c* e Rm is a

i, xj ) - yj

c K (x


c K[x]c < R2.


Y cK(Xi, Xj)Cj = cTK[x]c. ij=1

Note that this is a convex quadratic programming problem and, therefore, can be efficiently


References and additional remarks

An exhaustive exposition of the Fourier transform can be found in [120].

For a proof of Plancherels theorem see section 4.11 of [40]. The Arzela- Ascoli theorem is
proved, for example, in section 11.4 of [70] or section 9 of [2]. Proposition 2.5 is shown in
[106] (together with a more difficult converse). For extensions to conditionally positive
semidefinite kernels generated by radial basis functions see [86 ].
The definition of Hs (X) can be extended to s e R, s > 0 (called fractional Sobolev
spaces), using a Fourier transform argument [120]. We will do so in Section 5.1.
References for Sobolev space are [1,129], and [47] for embedding theorems.
A substantial amount of the theory of RKHSs was surveyed by N. Aronszajn [9]. On page
344 of this reference, Theorem 2.9, in essence, is attributed to E. H. Moore.

The special dot product kernel K (x, y) = (c + x y)d for some c > 0 and d e N was
introduced into the field of statistical learning theory by Vapnik (see, e.g., [134]). General dot
product kernels are described in [118]; see also [101] and [79]. Spline kernels are discussed
extensively in [137].
Chapter 14 of [19] is a reference for the unitary and orthogonal invariance of (, )W. A
reference for the nondegeneracy of the Veronese variety mentioned in Proposition 2.21 is
section 4.4 of [109].
A comprehensive introduction to convex optimization is the book [25]. For second-order
cone programming see the articles [3, 119].
For more families of Mercer kernels in learning theory see [107]. More examples of
box splines can be found in [41]. Reducing the computation of fz from HK to HK,z is
ensured by representer theorems [137]. For a general form of these theorems see

Estimating the sample error


The main result in this chapter provides bounds for the sample error of a compact and convex
hypothesis space. We have already noted that with m fixed, the sample error increases with
the size of H. The bounds we deduce in this chapter show this behavior with respect to a
particular measure for the size of H: its capacity as measured by covering numbers.
Definition 3.1 Let S be a metric space and n > 0. We define the covering number N(S, n) to
be the minimal t e N such that there exist t disks in S with radius n covering S. When S is
compact this number is finite.
Definition 3.2 Let M > 0 and p be a probability measure on Z. We say that a set H of functions
from X to R is M-bounded when
sup | f (x) - y\<M f eH
holds almost everywhere on Z.
300M2 J '

( ,T2M


) I

Prob {EH(fz) < e} > 1

Theorem 3.3 Let H be a compact and convex subset of C(X). If H is M-bounded, then,
for all e > 0,


Exponential inequalities in probability

Write the sample error EH(fz) = E(fz) - E(fH) as

EH(fz) = E(fz) - Ez(fz) + Ez(fz) - Ez(fH) + Ez(fH) - E(fH).
Since fz minimizes z in H, z(fz) - z(fn) < 0. Then a bound for H(JZ) follows from bounds for
(fz) - z(fz) and z(fH) - (fH). For f : X ^ R consider a random variable f on Z given by f(z)
= (f (x) - y)2, where z = (x,y) e Z. Then z(f) - (f) = E=i f(zi) - E(f) = Ez(f) - E(f). The
rate of convergence of this quantity is the subject of some well-known inequalities in
probability theory.
If f is a nonnegative random variable and t > 0, then f > fx{f >t} > tx{f >t}, where xJ
denotes the characteristic function of J. Noting that Prob{f > t} = E(x{f >t}), we obtain
Markovs inequality,
Prob{f > t}<
Applying Markovs inequality to (f - E(f)) 2 for an arbitrary random variable f yields
Chebyshevs inequality, for any t > 0,
2 2
a2 (f)
Prob{|f - E(f)| > t} = Prob{(f - E(f)) 2 > t2}<




One particular use of Chebyshevs inequality is for sums of independent random variables. If f
is a random variable on a probability space Z with mean E(f) = ^ and variance a2 (f) = a2,
then, for all e > 0 ,
This inequality provides a simple form of the weak law of large numbers since it shows that
when m ^ro, 1J2m=1 f(zi) ^ I-1 with probability 1 .


For any 0 < 8 < 1 and by taking e = ^a2/(m8) in the inequality above it follows that with
confidence 1 - 8 ,
The goal of this section is to extend inequality (3.1) to show a faster rate of decay. Typical
bounds with confidence 1 - 8 will be of the form c(log(2/8)/m) 2 +0 with 0 < 0 < 2 depending
on the variance of f. The improvement in the error is seen both in its dependence on 8 - from
2/8 to log(2/8) - and in its dependence on m-from m- 2 to m 2 + . Note that {fi = f(zi)} m=1
are independent random variables with the same mean and variance.
Proposition 3.4 (Bennett) Let {i}m=1 be independent random variables on a
probability space Z with means {Mi } and variances {of}. Set 2 := Y, m=i of. If for
each i
- Mi I < M holds almost everywhere, then for every e > 0
Probj [& - Mi]
> e i < exm e
( Me ,


+ Ml)

we have
Proof. Without loss of generality, we assume m = 0. Then the variance of fi is a2 = E(f2).
Let c be an arbitrary positive constant that will be determined later. Then
I := Prom
- Mi ] > ej = Probjexpj^ c^J > ece J.
By Markovs inequality and the independence of {i}, we have
I < e ceexpj^ cfi|^ = e ce ^ E(eci' i=1
Since \fi | < M almost everywhere and E(f0 = 0, the Taylor expansion for ex yields
, c.,
E ec') = 1 + E
< 1 + EUsing 1 +1 < et, it follows that
r+TO P*MP2
cnM n 2 o, 2

E(eci) < exp I ^2

= exp
ecM - 1 - cM


e=2 '
I < exm -ce +
ecM - 1 - cM ,


and therefore
Now choose the constant c to be the minimizer of the bound on the right-hand side above:
0 Me
c = -----------------------------------log 1 +- - - -T
That is, ecM - 1 = Me/E2. With this choice,
Mi) O]
- - M M
+M +Mi l - 1
This proves the desired inequality.
Let g : [0, +ro) ^ R be given by
= (1 + X) log(l +
Eft.- J- exp{-MMg()
g(X) :

X) X

Then Bennetts inequality asserts that

Proposition 3.5 Let}"= 1 be independent random variables on a probability
spaceZwith means {p} andvariances {a2 } and satisfying \^i (z) - E(f; )| < M for each i
and almost all z e Z. Set E2 := Y, m= 1
Then for every e > 0, (Generalized Bennetts
^j Yte - J.] > e J - expj - 2M lo^l + E?)}.
Si - /J,.] > s - exp
. .=i

(E2 + 3M
{ is.- JJ ] >]- p|Proof. The first inequality follows from (3.3) and the inequality
g(x) > 2 log(i + X), VX > o.
To verify (3.4), define a C2 function f on [0, x>) by
f (X) := 2log(1 + X) - 2X + X log(1 + X).
We can see that f (0) = 0, f '(0) = 0, and f "(X) = X(1 + X)-2 > 0 for X > 0. Hence f (X) > 0
log(1 + X) - X > - | X log( 1 + X ) ,
V X > 0.
It follows that
g(X) = X log(1 + A) + log(1 + A) X > ^ log(1 + A), VX > 0.
This verifies (3.4) and then the generalized Bennetts inequality.
g(X) >
6 + 2X '
Since g(X) > 0, we find that the function h defined on [0, ) by h(X) = (6 + 2X)g(X)
3X2 satisfies similar conditions: h(0) = h'(0) = 0, and h"(X) = (4/(1 + X))g(X) > 0. Hence
h(X) > 0 for X > 0 and
Applying this to (3.3), we get the proof of Bernsteins inequality.
To prove Hoeffdings inequality, we follow the proof of Proposition 3.4 and use (3.2). As the
exponential function is convex and M < i < M almost surely,
& ( cM) ecM I cM
cM 1 !
E (e ) < e
+ e
1 ~ (cM ) 1 (cM ) ^ (cM )
+ 2 W- =

)j j


t=0 2 )j
((cM )2/2)


2 -I



(cM )2 2
holds almost everywhere. It follows from E(fj) = 0 and the Taylor expansion for ex that
This, together with (3.2), implies that I < exp {-ce + m(cM)2/2}. Choose c = e/(mM2 ). Then I
< exp -e2/(2mM2 )}.

Bounds for the distance between empirical mean and expected value follow from
Proposition 3.5.Corollary 3.6 Let % be a random variable on a probability space Z
with mean E(%) = u and variance a2(%) = a2, and satisfying |%(z) - E(%)| < M for
almost all z e Z. Then for all e > 0,
(Generalized Bennett)

< exp
log 1

%(zi) u > e
zeZm m
%(zi) u > e
zeZm m
Proof. Apply Proposition 3.5 to the random variables {%i = %(zi)/m} that satisfy % E(%i)|
< M/m, a2(%i) = a2 /m2, and ^ a2 = a2 /m.

Remark 3.7 Each estimate given in Corollary 3.6 is said to be a one-side probability inequality.
The same bound holds true when >e is replaced by < e. By taking the union of these
two events we obtain a two-side probability inequality stating that ProbzeZm j|
%(zi) u\
> ej is
bounded by twice the bound occurring in the corresponding one-side inequality.
Recall the definition of the defect function Lz(f) = E(f) Ez(f). Our first main result,
Theorem 3.8, states a bound for Prob{Lz(f) > e} for a single function f : X ^ Y. This bound
follows from Hoeffdings bound in Corollary 3.6 by taking % = f = (f (x) y)2 satisfying
|%| < M2 when f is M-bounded.
Theorem 3.8 Let M > 0 andf: X ^ Y be M-bounded. Then, for all s > 0,
1 me ]
Pg!>{Lz(f) i-el>1 exp 2 M4

< exp
< exp
2 (a 2 + 3 M s)
Remark 3.9
(i) Note that the confidence (i.e., the right-hand side in the inequality above) is positive and
approaches 1 exponentially quickly with m.

A case implying the M-boundedness for f is the following. Define

Mp = inf {M > 0 | {(x,y) e Z : |y - fp(x)\ > M} has measure zero} . Then takeM = P + Mp,
where P >


- fp\\L = sup \f (x) - fp(x)\.

It foilows from Theorem 3.8 that for any 0 < 8 < 1, Ez(f) - E(f) < M^/2 log(1 /8 )/m with
confidence 1 - 8 .


Uniform estimates on the defect

The second main result in this chapter extends Theorem 3.8 to families of functions.
Theorem 3.10 Let H be a compact M-bounded subset of C(X). Then, for all
sup Lz(f) <
f sH



> 0,
Notice the resemblance to Theorem 3.8. The only essential difference is in the covering
number, which takes into account the extension from a single f to the family H. This has the
effect of requiring the sample size m to increase accordingly to achieve the confidence level
of Theorem 3.8.
Lemma 3.11 Let H = Sx U ... U Si and > 0. Then
sup Lz(f) > < Prob supLz(f) >
f sH
=IsZm f sSj
Proof. The proof follows from the equivalence
sup Lz(f) >
3j < l s.t. sup Lz(f) >
f eH
f eSj
and the fact that the probability of a union of events is bounded by the sum of the
probabilities of those events.

Proof of Theorem 3.10 Let l = N (H, 4^) and considerfx,..., fl such that the disks Dj centered at
fj and with radius 4M cover H. Let U be a full measure set on which supf e H f (x) - y\ < M .By
Proposition 1.11, for all z e Um and all f e Dj,

\Lz(f ) - Lz(fj)\ < 4M |lf - fj |U < 4M 4 M = . Since this holds for all z e Um and all f e Dj,
we get
sup Lz ( f ) > 2 ^ Lz ( fj ) > .
f eDj
We conclude that for j = 1,..., l,

Prob sup Lz(f ) > 2

f eDj
where the last inequality follows from Hoeffdings bound in Corollary 3.6 for f = (f (x) y) 2
on Z. The statement now follows from Lemma 3.11 by replacing e by |.

Remark 3.12 Hoeffdings inequality can be seen as a quantitative instance of the law of
large numbers. An abstract uniform version of this law can be extracted from the proof of
Theorem 3.10.
Proposition 3.13 Let F be a family of functions from a probability space Z to R and d
a metric on F. Let U c Zbe of full measure and B, L > 0 such that
i; (z) | < B for all f e F andallz e U, and
ii; |Lz(fi) Lz(f2)| < Ld(fi, f2) for all fi,f2 e F andall z e Um, where

Hzt )
Lz(f) =
Then, for all e > 0,

-N F, - 2exp ---2
Prob sup | z(f)|<e


Estimating the sample error

How good an approximation offn can we expectfz to be? In other words, how small can we
expect the sample error EH(fz) to be? The third main result in this chapter, Theorem 3.14,
gives the first answer.
Prob [n(fz) < e} > 1
32M4 '
Theorem 3.14 Let H be a compact M-bounded subset of C(X). Then, for all e > 0,
Proof. Recall that H(fz) < E(fz) - z(fz) + z(fn) - (fn).
By Theorem 3.8 applied to the single function fn and , we know that
z (fn) - (fn) < 2 with probability at least 1 - exp J - 8M7 }- On the other hand, Theorem 3.10
with e replaced by tells us that with
probability at least 1 - M (H, 16M) exp - 32M4 ,
sup Lz(f) = sup [(f) - z(f)} < fen

fen 2

( ,T6
) p| - 5JM4 j.
1 - exp 2
me '8 M
holds, which implies in particular that (fz) - z(fz) < . Combining these two bounds, we
know that n(fz) < e with probability at least
32M 4
This is the desired bound.

Remark 3.15 Theorem 3.14 helps us deal with the question posed in Section 1.3. Given e, 8 >
0, to ensure that
Prob [n(fz) < e} > 1 - 8 ,
32M 4


ta 1

+ N(H, T6M))+ta(

it is sufficient that the number m of examples satisfies
To prove this, take 8 = {N (H, i6 M) + 1} exp J - 3 ^4 } and solve for m. Note, furthermore, that
(3.5) gives a relation between the three basic variables e, 8, and m.


Convex hypothesis spaces

The dependency on e in Theorem 3.14 is quadratic. Our next goal is to show that when the
hypothesis space H is convex, this dependency is linear. This is Theorem 3.3. Its Corollary
3.17 estimates directly \\fz - fu\\P as well.
Toward the proof of Theorem 3.3, we show an additional property of convex hypothesis
spaces. From the discussion in Section 1.3 it follows that for a convex H, there exists a
function fa in H whose distance in L2 to fp is minimal. We next prove that if H is convex and
pX is nondegenerate, then fa is unique.
Lemma 3.16 Let H be a convex subset of C (X) such that fa exists. Then fa is unique
as an element in L2 and, for allf e H,
(fa -f )2 < n(f).
In particular, if pX is not degenerate, then fa is unique in H.
Proof. Let s = fnf be the segment of line with extremities fa and f: Since H is convex, s c H.
And, since fa minimizes the distance in L2 to

fp over H, we have that for all g e s, fn - fp llp < ||g - fp ||p. This means that for each t e[0 ,1 ],
IlH - fp llp < Ilf + (1 -t )fH - fp llp
= WH - fp llp + 2 t{ f - fa, fa - fp )p + 2 Ilf - fn llp} .
By taking t to be small enough, we see that f - fa, fa - fp) > 0. That is, the angle fpfaf is
obtuse, which implies (note that the squares are crucial)
lfn - f llp < Ilf - fp llp -WH - fp llp;that is,
(fH - f )2 < E(f) - E(fH) = EH(f ).
This proves the desired inequality. The uniqueness offH follows by considering the line
segment joining two minimizers f^ and fJ^. Reasoning as above, one can show that both
angles fpfand fpfUfn are obtuse. This is possible only
fH= fH.

Corollary 3.17 With the hypotheses of Theorem 3.3, for all e > 0,
a*(ff-fH)) <e


-^(",i2 ) =xp

I ^I

Now, in addition to convexity and M-boundedness, assume that H is a compact subset of

C(X), so that the covering numbers N(H, n) make sense and are finite. The main stepping
stone toward the proof of Theorem 3.3 is a ratio probability inequality.
We first give such an inequality for a single random variable.
-/ muKu
\J + B
< exp

a2ms 2c + 3 B
Lemma 3.18 Suppose a random variable % onZ satisfies E(%) = M > 0, and \% - M\< B
almost everywhere. If E(%2 ) < cE(%), then, for every e > 0 and 0 < a < 1,
mu tz)
AJ + B
< exp
a2m( + B)B
2 (a2(t) + 3Ba/ + B/B)
Proof. Since % satisfies \% - M\ < B, the one-side Bernstein inequality in Corollary 3.6 implies
2 / /
a (t) + Ba + B B < c + B( + B) <
( + B).
Here a2(t) < E(t2) < cE(t) = c. Then we find that
This yields the desired inequality.
We next give a ratio probability inequality involving a set of functions.
E(g) - Ez(g) sup geQ E(g) + s
> 4a^fe

< N (G, as) exp


a ms 2c + 3 B
Lemma 3.19 Let G be a set of functions on Z and c > 0 such that for each g e G, E(g)
> 0, E(g2) < cE(g), and |g - E(g)| < B almost everywhere. Then, for every e > 0 and 0 <
a < 1, we have
Proof. Let {gj}J= c G with J = N (G, ae) be such that G is covered by balls in C(Z) centered
on gj with radius ae.
E(gj) - Ez (gj ^
> as < exp zszm
E(gj) + s
a2 ms
2c + 3 B
Applying Lemma 3.18 to f = gj for each j, we have
|Ez (g) - Ez (gj )\ r A
|E(g) - E(gj )|
VE(g) + e
The latter implies that
< a*Js and
VE(g) + s
< a*/s.
For each g e G, there is some j such that ||g - gj\\C (Z) < ae. Then |Ez (g) - Ez(gj)| and |E(g) E(gj)| are both bounded by ae. Hence
E(g )
j + e = E(gj) E(g) + E(g) + e < aVeVE(g) + e + (E(g) + e)
+e +(E(g) +e) < 2(E(g) +e).
It follows that ^E(gj) + e < 2^jE(g) + e. We have thus seen that (E(g) - Ez (g))/ E(g) + e >
4a Je implies (E(gj) - Ez (gj))/y/E(g) + e > 2a Je and hence (E(gj) - Ez (gj))//E(gjT+ e >
a^/e. Therefore,
E(g) - Ez (g) Prob sup - z^zm geg VE(g) + s
> 4aV^l < V Probi E(gj} - Ez(gj} > aVsl, U z^zm E(gj) + s
which is bounded by J exp J-a2me/(2c + 2B) J. We are in a position to prove Theorem 3.3.
Proof of Theorem 3.3 Consider the function set
G = {(f (x) - y)2 - (fn(x) - y)2 : f e H} .
Each function g in G satisfies E(g) = H4) > 0. Since H is M-bounded, we have M2 < g(z) <
M2 almost everywhere. It follows that |g -E(g) | < B := 2M2 almost everywhere. Observe that
g(z) = (f (x) - fn(x)) [(f (x) - y) + (fn(x) - y)], z = (x,y) e Z.
a2 me
8 M 2 + 2M 2
1 - N (G, ae) exp
It follows that |g(z)| < 2Mf (x) - fH(x)\ and E(g2) < 4M2 fX(f - fH)2. Taken together with Lemma
3.16, this implies E(g2) < 4M 2-u(f ) = cE(g) with c = 4M2. Thus, all the conditions in Lemma
3.19 hold true and we can draw the following conclusion from the identity Ez(g) = %z(f )
for every e > 0 and 0 < a < 1 , with probability at least
< 4a^fe
H(J) - n,z(f ) sup
f eH
y/n(f ) + e
holds, and, therefore, for all f e H, %(f) < H,Z(f) + 4a*Js^/%(f) + e. Take a = V2/8 and f
= fz. Since H,Z (fz) < 0 by definition offz, we have
(f )
H z < Ve/2VH(fz) + e.
Solving the quadratic equation about y/sHf), we have H( fz) < e.
Finally, by the inequality ||gi g2\\c(Z) < \\(fi(x) f2(x)) [(fi(x) y)+ (f2(x) y)l
\\C(Z) < 2M \\fi f2\\c(X), it follows that
N (G, ae) < N(H,2M) .
The desired inequality now follows by taking a = 42/8.

Remark 3.20 Note that to obtain Theorem 3.3, convexity was only used in the proof of
Lemma 3.16. But the inequality proved in this lemma may hold true in other situations as
well. A case that stands out is when fp e H. In this case fH = fP and the inequality in Lemma
3.16 is trivial.


References and additional remarks

The exposition in this chapter largely follows [39]. The derivation of Theorem 3.3 deviates
from that paper - hence the slightly different constants.
The probability inequalities given in Section 3.1 are standard in the literature on the law of

large numbers or central limit theorems (e.g., [21, 103, 132]).

There is a vast literature on further extensions of the inequalities in Section 3.2 stated in
terms of empirical covering numbers and other capacity measures [13, 71] (called
concentration inequalities) that is outside the scope of this book. We mention the
McDiarmid inequality [83], the Talagrand inequality [126,22], and probability inequalities in
Banach spaces [99].
The inequalities in Section 3.4 are improvements of those in the Vapnik- Chervonenkis
theory [135]. In particular, Lemmas 3.18 and 3.19 are a covering number version of an
inequality given by Anthony and Shawe-Taylor [8 ]. The convexity of the hypothesis space
plays a central role in improving the sample error bounds, as in Theorem 3.3. This can be
seen in [10, 12, 74].
A natural question about the sample error is whether upper bounds such as those in
Theorem 3.3 are tight. In this regard, lower bounds called minimax rates of convergence
can be obtained (see, e.g., [148]).
The ideas described in this chapter can be developed for a more general class of learning
algorithms known as empirical risk minimization (ERM) or structural risk minimization
algorithms [17, 60,110]. We devote the remainder of this section to briefly describe some
aspects of this development.
The greater generality of ERM comes from the fact that algorithms in this class minimize
empirical errors with respect to a loss function f : R ^ R+. The loss function measures how the
sample value y approximates the function value f (x) by evaluating f(y f (x)).
Definition 3.21 We say that f : R ^ R+ is a regression loss function if it is even, convex,
and continuous and f(0) = 0 .
For (x, y) e Z, the value f(y f (x)) is the local error suffered from the use of f as a model
for the process producing y at x. The condition f(0) = 0 ensures a zero error when y = f (x).
Examples of regression loss functions include the least squares loss and Vapniks e-insensitive
Example 3.22 The least squares loss corresponds to the loss function f-(t) = t2. For > 0,
the e-insensitive norm is the loss function defined by
f(t) = fe(t)
|t| if |t| > 0 otherwise.
Given a regression loss function ' one defines its associated generalization error by
E f(f) =
f(y - f (x)) dp
and, given z e Zm as well, its associated empirical error by
1 m
EZt (f) = f(yi - f (xt)).

As in our development, given a hypothesis class H, these errors allow one to define a target
function f' and empirical target function fZ and to derive a decomposition bounding the
excess generalization error
E '(fZ) - Ef(fH) < {E f(fzf) - Et(fzf)} + {Et(f%) - E *(f%)).
The second term on the right-hand side of (3.6) converges to zero, with high probability
when m ^-<x>, and its convergence rate can be estimated by standard probability
The first term on the right-hand side of (3.6) is more involved. If one writes fz(z) = f(y-fZ
(x)),then E'(fzf) -E'Z(fZ) = jz &(z)dp - m 5=1 fz(zi). But fz is not a single random variable; it
depends on the sample z. Therefore, the usual law of large numbers does not guarantee the
convergence of this first term. One major goal of classical statistical learning theory [134] is
to estimate this error term (i.e., E' (fZ) - E' (fZ')). The collection of ideas and techniques used
to get such estimates, known as the theory of uniform convergence, plays the role of a
uniform law of large numbers. To see why, consider the quantity
sup \EZ(f) - E'(f )|,
f eH
which bounds the first term on the right-hand side of (3.6), hence providing (together with
bounds for the second term) an estimate for the sample error E ' (fZ' )-E ' (f'). The theory of

uniform convergence studies the convergence of this quantity. It characterizes those function
sets H such that the quantity (3.7) tends to zero in probability as m ^-<x>.
Definition 3.23 We say that a set H of real-valued functions on a metric space X is uniform
Glivenko-CanteUi (UGC) if for every t > 0,
lim supProb sup sup
f X) f (x) dp
t^+<x p
m>if en
where the supremum is taken with respect to all Borel probability distributions p on X, and
Prob denotes the probability with respect to the samples xi, x2 ,... independently drawn
according to such a distribution p.
The UGC property can be characterized by the VY dimensions of H, as has been done in
Definition 3.24 Let H be a set of functions from X to [0,1] and Y > 0. We say that A c X is VY
shattered by H if there is a number a e R with the following property: for every subset E of A
there exists some function fE e H such that fE (x) < a y for every x e A \ E, and fE (x) > a
+ y for every x e E. The VY dimension of H, VY (H), is the maximal cardinality of a set A c X
that is VY shattered by H.
The concept of VY dimension is related to many other quantities involving capacity of
function sets studied in approximation theory or functional analysis: covering numbers,
entropy numbers, VC dimensions, packing numbers, metric entropy, and others.
The following characterization of the UGC property is given in [5].
Theorem 3.25 Let H be a set of functions from X to [0,1], Then H is UGC if and only if
the VY dimension of H is finite for every Y > 0,
Theorem 3.25 may be used to verify the convergence of ERM schemes when the
hypothesis space H is a noncompact UGC set such as the union of unit balls of reproducing
kernel Hilbert spaces associated with a set of Mercer kernels. In particular, for the Gaussian
kernels with flexible variances, the UGC property holds [150].
Many fundamental problems about the UGC property remain to be solved. As an example,
consider the empirical covering numbers.
Definition 3.26 For x = (xi)f=1 e Xm and H c C(X), the -empirical covering numberNx,
(H,x, n) is the covering number of H|x := {(f (xi))m= 1 : f e H} as a subset of Rm with the
following metric. For f, g e C(X) we take dx(f, g) = maxj<m [f (xt) g(xi)|. The metric
entropy of H is defined as
Hm(H, n) = sup logN<x>(H,x, n), m e N, n> 0.
It is known [46] that a set H of functions from X to [0,1] is UGC if and only if, for every n
> 0, limm^OT Hm(H, n)/m = 0. In this case, one has Hm(H, n) = O(log2 m) for every n > 0. It is
conjectured in [5] that Hm(H, n) = O(log m) is true for every n > 0. A weak form is, Is it true
that for some a e [1,2), every UGC set H satisfies

Polynomial decay of the

approximation error
Hm(H, n) = O(loga m), Vn> 0?

We continue to assume that X is a compact metric space (which may be a compact subset of
Rn). Let K be a Mercer kernel on X and HK be its induced RKHS. We observed in Section 2.6
that the space HK,R = IK(BR) may be considered as a hypothesis space. Here IK denotes the
inclusion IK : HK ^ C(X). When R increases the quantity
A(fp,R): = inf E(f) - E(fp) = inf ||f - fp L f SHKR
(which coincides with the approximation error modulo a^) decreases. The main result in this
chapter characterizes the measures p and kernels K for which this decay is polynomial, that
is, A(fp,R) = O(R-9) with 0 > 0.
Theorem 4.1 Suppose p is a Borelprobability measure o"Z. LetK beaMercer kernel o"
X a"d LK : L2X ^ L2X be the operator give" by
LKf (x) =
K(x, t)f (t)dpX (t), x e X.

Let 0 > 0. Ifp e Range(L0f/(4+20)), that is, fp = L0J(4+20)(g) for some g e L2, the" A(fp,R) <
22+0 11g12+20 R-0. Conversely, if pX is "o"dege"erate
and A(fp, R) < CR-0 for some constants C and 0, thenfp lies in the range of L0f/i'4+20)-ffor all e > 0.
Although Theorem 4.1 may be applied to spline kernels (see Section 4.6), we show in
Theorem 6.2 that for C kernels (e.g., the Gaussian kernel) and under some conditions on pX,
the approximation error decay cannot reach the order A(fp, R) = O(R-0) unless fp is C itself.
Instead, also in Chapter 6 , we derive logarithmic orders like A(fp, R) = O((log R)-0) for analytic
kernels and Sobolev smooth regression functions.

Reminders III


We recall some basic facts about Hilbert spaces.

A sequence {pn}n> i in a Hilbert space H is said to be a complete orthonormal system
(or an orthonormal basis) if the following conditions hold:
i; for all n = m > 1 , (pn, pm) = 0 ,
ii; for all n > 1 , ||pn\| = 1 , and
iii; for all f e H, f = Y^==i f, pn)pn.
A sequence satisfying (i) and (ii) only is said to be an orthonormal system. The numbers (f,
pn) are the Fourier coefficients off in the basis {pn}n>1. It is easy to see that these
coefficients are unique since, iff = ^f anpn, an = (f, pn) for all n > 1 .
Theorem 4.2 (Parsevals theorem) If {pn} is an orthonormal system of a Hilbert space
H, then, for all f e H,^fn(f, pn)2 < \ ||2. Equality holds for allf e H if and only if {pn} is

We defined compactness of an operator in Section 2.3. We next recall some other basic
properties of linear operators and a main result for operators satisfying them.
Definition 4.3 A linear operator L: H ^ H on a Hilbert space H is said to be self-adjoint if, for
all f, g e H, (Lf, g) = (f, Lg). It is said to be positive (respectively strictly positive) if it is
self-adjoint and, for all nontrivial f e H, (Lf ,f) > 0 (respectively (Lf ,f) > 0).
Theorem 4.4 (Spectral theorem) Let L be a compact self-adjoint linear operator on a
Hilbert space H. Then there exists in H an orthonormal basis {p]_, p 2,...} consisting
of eigenvectors of L. If Xk is the eigenvalue corresponding to pk, then either the set
{Xk} is finite or Xk ^ 0 when k ^<x>. In addition, maxk> 1 |A.k | = ||L||. If, in addition, L
is positive, then Xk > 0 for all k > 1, and ifL is strictly positive, then X k > 0 for all k >

We close this section by defining the power of a self-adjoint, positive, compact, linear
operator. If L is such an operator and 0 > 0, then Le is the operator defined by

(y ,




Operators defined by a kernel

In the remainder of this chapter we consider v, a finite Borel measure on X, and L2(X), the
Hilbert space of square integrable functions on X. Note that v can be any Borel measure.
Significant particular cases are the Lebesgue measure p and the marginal measure pX of
Chapter 1.
Let K: X x X ^ R be a continuous function. Then the linear map
LK : L2 (X) ^ C(X)
given by the following integral transform
(LKf )(x) =
K(x, t)f (t) dv(t), x e X,
is well defined. Composition with the inclusion C(X) ^ Lv 2 (X) yields a linear operator LK :
L^(X) ^ L^(X), which, abusing notation, we also denote by LK .
The function K is said to be the kernel of LK, and several properties of LK follow from
properties of K. Recall the definitions of CK and Kx introduced in Section 2.4.
Proposition 4.5 IfK is continuous, then LK : L^(X) ^ C(X) is well defined and
compact. In addition, \\LK || < Vv(X)CK. Here v(X) denotes the measure ofX.
Proof. To see that LK is well defined, we need to show that LKf is continuous for every f e
L^(X). To do so, we consider f e L^(X) and xi,x2 e X. Then

\(LKf )(Xl) - (LKf )(X2)\

j (K(xi, t) - K(X2, t))f (t) dv(t)
- llKxi -KX2UL2(X)\f \\LV2(X)
(by Cauchy-Schwarz)
^v(X) max \K(xi, t) - K(X2,t)\ If HL^X)
Since K is continuous and X is compact, K is uniformly continuous. This implies the
continuity of LK[ .The assertion ||LK || < Vv(X)C2K follows from the inequality
\(LKJ)(X)\ < Vv(X)sup |K(x,t)\\\f \\L2(X), teX
which is proved as above.
Finally, to see that LK is compact, let (fn) be a bounded sequence in L2(X). Since \\Lxf||TO
< V v(X)C2K|[f\\L2(X), we have that (Lf) is uniformly bounded. By (4.1) we have that the
sequence (LKfn) is equicontinuous. By the Arzela-Ascoli theorem (Theorem 2.4), (Lf) contains
a uniformly convergent subsequence.

Two more important properties of LK follow from properties of K. Recall that we say K is
positive semidefinite if, for all finite sets (x1 , . . . , xk} c X, the k x k matrix K [x] whose (i, j)
entry is K (xt, Xj) is positive semidefinite.
Proposition 4.6
i; IfK is symmetric, thenLK : L2(X) ^ L2(X) is self-adjoint.
ii; If, in addition, K is positive semidefinite, then LK is positive.
K(x, t)f (x)f (t) d v (x) d v (t)
(v(X ))'
J^K (xi, xj )f (xi )f (xj) ij=1
(v(X)) k 2

-fxT K [x]fx,
Proof. Part (i) follows easily from Fubinis theorem and the symmetry of K. For Part (ii), just
note that
where, for all k > 1, xi, . . . , xk e X is a set of points conveniently chosen and fx = (f (X1 ), . . . , f
(xk))T. Since K[x] is positive semidefinite the result follows.

Theorem 4.7 Let K: X x X ^ R be a Mercer kernel. There exists an orthonormal basis

(0 1, < f c , . . . } of L (X) consisting of eigenfunctions ofLK. If Xk is the eigenvalue
corresponding to <k, then either the set (Xk} is finite or Xk ^ 0 when k ^<x>. In
addition, Xk > 0 for all k > 1, maxk>1 Xk = ||LK ||, and if Xk = 0, then <k can be chosen
to be continuous on X.
Proof. By Propositions 4.5 and 4.6 LK : LV2(X) ^ LV2(X) is a self-adjoint, positive, compact
operator. Theorem 4.4 yields all the statements except the continuity of the <k.
To prove this last fact, use the fact that fak = (1/Xk)/LK(fak) in Lv2(X).Then we can
choose the eigenfunction to be (1/Xk )LK (fak), which is a continuous function.

In what follows we fix a Mercer kernel and let {fak e L2(X)} be an orthonormal
basis of L2(X) consisting of eigenfunctions of LK. We call the fak orthonormal
eigenfunctions. Denote by Xk, k > 1, the eigenvalue of LK corresponding to fak. If Xk
> 0, the function fak is continuous. In addition, it lies in the RKHS HK. This is so
1 j
(x) =
LK(fak)(x) =
K(x, t)fak(t) dv(t),
l k IIK

Kt fak (t) d v (t) X

\\K||K lfak(t)l dv(t)
and, thus, fak can be approximated by elements in the span of {Kx | x e X}.


Theorem 2.9 then shows that fak e HK. In fact,

CK !
- v(X) < TO. Xk
We shall assume, without loss of generality, that Xk > Ak+i for all k > 1. Using
the eigenfunctions {fak}, we can find an orthonormal system of the RKHS H-.
Theorem 4.8 Let v be a Borel measure on X, and : X x X ^ R a Mercer kernel. Let Xk
be the kth eigenvalue of Lk, and fak the corresponding orthonormal eigenfunction.
Then [faXkfak : Xk > 0} forms an orthonormal system in H.
Proof. We apply the reproducing property stated in Theorem 2.9. Assume Xi, Xj > 0;
we have
{Vfafa, /xjfa) = ^=j (, y)fa (y) d v (y), /xjfa^
= ^Xj
fa (y)y, fa )-d v (y)
Xi X
= ^Xjfx fa(y)fa(y) dv(y) =
fai, fa)L2 (X) = fa.
It follows that [faXkfak : Xk > 0} forms an orthonormal system in H.

Remark 4.9 When v is nondegenerate, one can easily see from the definition of the
integral operator that LK has no eigenvalue 0 if and only if HK is dense in L2(X).
In fact, the orthonormal system above forms an orthonormal basis of HK when pX
is nondegenerate. This will be proved in Section 4.4. Toward this end, we next
prove Mercers theorem.


Mercers theorem

If f e L2(X) and {<1, f;, } is an orthonormal basis of L;(X), f can be uniquely written as f
= Y.k>1 akfk, and, when the basis has infinitely many functions, the partial sums
1 akfk
converge to f in L;(X). If this
convergence also holds in C(X), we say that the series converges uniformly to f. Also, we
say that a series
ak converges absolutely if the series |ak | is
When LK has only finitely many positive eigenvalues {Xk m 1 , K(x, t) =
rm=i h & (x)fa (t).
Theorem 4.10 Let vbea Borel, nondegenerate measure onX, andK: X xX ^ R a Mercer
kernel. Let be the kth positive eigenvalue of LK, and fa the corresponding
continuous orthonormal eigenfunction. For all x, t e X,
K (x, t) = ^ Xk fa (x)fa (t),
where the convergence is absolute (for each x, t e X x X) and uniform (onX x X).
Proof. By Theorem 4.8, the sequence {fafa fa}k>1 is an orthonormal system of HK . Let x e X.
The Fourier coefficients of the function Kx e HK with respect to this system are
f/Xkfk, Kx )K s/Xk fk
where Theorem 2.9(iii) is used. Then, by Parsevals theorem, we know that
J2l^kkfa(x)|; = J2XkIfk(x)|; < \\KxIlK = K(x,x) < CK.
Hence the series Xk fk (x) |; converges. This is true for each point x e X.
fak (x)fak (t)
\ 1/2 /m+i
\ 1/2
h \tk (t)\2
h \4>k (x) |2
k =m
\ 1/2
< CK
h\fak(x)\2 ,
Now we fix a point x e X. When the basis {fak}k>i has infinitely many functions, the
estimate above, together with the Cauchy-Schwarz inequality, tells us that for each t e X,
(Kx ,f >L?(X )
K(x, y)f (y) dv 0.
which tends to zero uniformly (for t e X). Hence the series J]k>i Xk$k (x)$k (t) (as a function of

t) converges absolutely and uniformly on X to a continuous function gx. On the other hand, as
a function in L2(X), Kx can be expanded by means of an orthonormal basis consisting of [$k}
and an orthonormal basis ^ of the nullspace of LK. For f e ^,
Hence the expansion of Kx is
x 'y ](Kx, fak)L2(X)fak y 'LK(fak)(x)fak ^ ^k<fik(x)fak.
Thus, as functions in L2(X), Kx = gx. Since v is nondegenerate, Kx and gx are equal on a
dense subset of X. But both are continuous functions and, therefore, must be equal for each t
e X .It follows that for any x, t e X, the series J]k>1 Xk$k (x)$k (t) converges to K(x, t). Since
the limit function K(x, t) is continuous, we know that the series must converge uniformly on
X x X.

Corollary 4.11 The sumYl Xk is convergent and

V Xk =
K(x, x) < v (X)C2k.
Moreover, for all k > 1, Xk < v(X)C2K/k.

fak (x)2 dv

K (x, x) d v < v (X )CK.

Proof. Taking x = t in Theorem 4.10, we get K(x, x) = Y.k>1 Xk$k (x)2. Integrating on both
sides of this equality givesHowever, since {</>1,/2,...} is an orthonormal basis, / /2 = 1 for all
k > 1 and the first statement follows. The second statement holds true because the
assumption Xk > Xj for j > k tells us that kXk < Xj=i X} < v(X)C2K.


RKHSs revisited

In this section we show that the RKHS HK has an orthonormal basis {VXX} derived from the
integral operator LK (and thus dependent on the measure v).
Theorem 4.12 Let v be a Borel, nondegenerate measure onX, andK: X xX ^ R a Mercer
kernel. Let Xk be the kth positive eigenvalue of LK, and /k the corresponding
continuous orthonormal eigenfunction. Then {VX/k : Xk > 0} is an orthonormal basis
of HK.
Proof. By Theorem 4.8, {^/X/k : Xk > 0} is an orthonormal system in HK. To prove the
completeness we need only show that for each x e X, Kx lies in the closed span of this
orthonormal system. Complete this system to form an orthonormal basis {VX/k : Xk > 0}U {fj:
j > 0} of HK. By Parsevals theorem,
llKx IlK =
(Kx^h$k^K + K, fj)2K.
Therefore, to show that Kx lies in the closed span of {VX/k : Xk > 0}, it is enough to
require that
llKx IlK = E(K , VX&k )K,
that is, that
K(x, x) = ^2 Xk (Kx, /k)K = ^2 Xk/k (x)2.
Theorem 4.10 with x = t yields this identity for each x e X.

Since the RKHS HK is independent of the measure v, it follows that when v is nondegenerate and dim HK = <x>, LK
has infinitely many positive eigenvalues Xk, k > 1, and
HK =
f = ^ a^Xk/k : {ak}^=i e t2
When dim HK = m < TO, LK has only m positive (repeated) eigenvalues. In this case,

f = Y a^0k : (ai,..., am) e Rm

In both cases, the map
L]/2: Lv2(X ) ^ HK
Y,ak0k ^ Y, ak\Y0k
defines an isomorphism of Hilbert spaces between the closed span of [0k : hk > 0} in L2 (X)

and HK . In addition, considered as an operator on L2(X), LY is the square root of Lk in the

sense that LK = LY o L]/2 (hence the notation L]/2). This yields the following corollary.
Corollary 4.13 Let v be a Borel, nondegenerate measure on X, and K: X x X ^ R a
Mercer kernel. Then HK = LY (L2(X)). That is, every function f e HK can be written as f
= LYg for some g e L2(X) with |f ||K =

A different approach to the orthonormal basis [Lh.0k} is to regard it as a function on X

with values in i2.
Theorem 4.14 The map
$: X ^ i2
x ^ (Yk(x))
/ k >1
is well defined and continuous, and satisfies
K(x, t) = $(x), $(t)).
Proof. Forevery x e X, by Mercers theorem J]lk 0^(x) convergesto K (x, x). This shows that $
(x) e i2. Also by Mercers theorem, for every x, t e X,
K (x, t) = Y lk 0k (x)0k (t) = $(x), $(t))i2.
It remains only to prove that $ : X ^ l2 is continuous. For any x, t e X,
W$(x) - ${t) \ \ 2 t 2 = $(x), $(x))i2 + $(t), $(t))i2 - 2($(x), $(t))i2 = K (x, x) + K (t, t)
2K (x, t),
which tends to zero when x tends to t by the continuity of K.

4.5 Characterizing the approximation error in RKHSs

In this section we prove Theorem 4.1. It actually follows from a more general characterization
of the decay of the approximation error given using interpolation spaces.
Definition 4.15 Let (B, || ||) and (H, || \\H) be Banach spaces and assume H is a subspace
of B. The K-functional K: B x (0, TO) ^ R of the pair (B, H) is defined, for a e B and t > 0, by
K(a,t) := mf{||a b\\+t\\b\\H}.
It can easily be seen that for fixed a e B, the function K(a, t) of t is continuous,
nondecreasing, and bounded by ||a|| (take b = 0 in (4.2)). When H is dense in B, K(a, t) tends
to zero as t ^ 0. The interpolation spaces for the pair (B, H) are defined in terms of the
convergence rate of this function.
For 0 < r < 1, the interpolation space (B, H)r consists of all the elements a e B such
that the norm
a||r := sup {K(a, t)/tr\
is finite.
Theorem 4.16 Let (B, || ||) be a Banach space, and (H, || ||H) a subspace, such that ||b||
< C0||b||H for allb e H and a constant C0 > 0. Let 0 < r < 1. If a e (B, H)r, then, for all
R > 0,
^(a,R) := inf {||a b||2} < ||a||2/(1'r)R2r/(1r).
Conversely, if ^.(a, R) < CR2r/(1r) for all R > 0, then a e (B, H)r and Mr < 2C(1r)/2.
Proof. Consider the function f (t) := K(a, t)/t. It is continuous on (0, +ro). SinceK(a,t) < ||
a||,inft> 0 f (t)} = 0.
f (tR,e)
(1 - e)R.
K(a, tRe)
Fix R > 0. If supt> 0 f (t)} > R, then, for any 0 < e < 1, there exists some tR,e e (0, +ro)
such that
By the definition of the K-functional, we can find be e H such that
a - be II +tR,e llbe |H < K(a, tR,f)/(1 - e).
K(a, tR,e)
(1 - e)tR,e
IIbe IIH <
It follows that
K(a, tR,e)


||a - be I <
But the definition of the norm | a| r implies that
< ||aNr.
K(a, tR,e)
a - be || " K(a,tR,e) -r/(1-r) K(a, tR,e)
. (1 e)tR,e _
/ i \1/(1-r)
< R-r/d-rW
^(a,R) < inf |||a - be ||2} < ||a||?/(1-r)R-2r/(1-r);
0<e<1 l

that is, the desired error estimate holds in this case.

Turn now to the case where supt>0{f (t)} < R. Then, for any 0 < e < 1 - supu>0{f (u)}/R
and any t > 0, there exists some bt,e e H such that
l|a - bt,e II + t||bt,e IIH < K(a, t)/(1 - e).
This implies that

\\H <

K(a, t) (1 - e)t
supf (u)} < R
a - bt,e
^ K(a, t) < 1 - e '
A(a, R) < inf
a - bte ||2 [ < (inf {K(a, t)} /(1
t> 0
< inf {\Ia\Irf} /(1
t> 0

= 0.

This again proves the desired error estimate. Hence the first statement of the theorem holds.
Conversely, suppose that A(a, R) < CR-2r/(1-r) for all R > 0. Let t > 0. Choose Rt = (VC/t)1-r.
Then, for any e > 0, we can find bte e H such that
\\bt,e IIH < Rt and ||a - bM ||2 < CR-2r/(1-r)(1 + e)2.
It follows that
K(a, t) < ||a - bt,e ll+t||bt,e IIH < VCR-r/(X-r)(1 + e)
+ tRt < 2(1 + e)C(1-r)/2tr.
Since e can be arbitrarily small, we have
K(a, t) < 2C(1-r)/2tr.
Thus, 11a |r = supt>0 {K(a, t)/tr}< 2C(1 r)/2 < TO.

The proof shows that if a e H, then A(a,R) = 0 for R > ||a||H. In addtion, A(a, R) = O(R-2r/
) if and only if a e (B, H)r. A special case of Theorem 4.16 characterizes the decay of A for
Corollary 4.17 Suppose p is a Borelprobability measure onZ. Let 0 > 0.Then A(fp, R) =
O(R-0) if and only if fp e (L2X, H+)0/(2+0), where H+ is the closed subspace of HK
spanned by the orthonormal system {VVfk : Ak > 0} given in Theorem 4.8 with the
measure pX.
Proof. Take B = Lpx and H = H+ with the norm inherited from HK. Then ||b\| < vXjlbllH for all
b e H. The statement now follows from Theorem 4.16 taking r = 9/(2 + 9).

Remark 4.18 When pX is nondegenerate, H+ = HK by Theorem 4.12.

Recall that the space LrK(L2(X)) is {JfXk>0 akXrkfk : ak} e l2} with the norm
y ' ak Xk fk
Xk >0
a i2
Xk >0

LrK (L2(X ))

Proof of Theorem 4.1 Take H+ as in Corollary 4.17 and r = 9/(2 + 9).

If fp e Range(LK/(4+29)), then fp = L9J(4+29> g for some g e Lpx .Without loss ofgenerality,
we may take g =
Xk >o ak fk .Then ||g|| = J2xk >o al < ^
and f
a X /(4 29)f
p = Y,Xk >0 k l + k.
We show that fp e (L2X, H+)r. Indeed, for every t < VX7, there exists some N e N such that
XN+1 < t2 < XN .

ak xk9/(4+29ll-1/\/Xkf,
2, -2/(2+9)
Choose f = Y^N=1 akX9k/(4+29)fk e HK. We can see from Theorem 4.8 that
K k=1
In addition,
\Wfp - f L =
J2ak Xk
k> N
= E a2X/(2+9) < xN+rwgw2.
L 2

Let K be the K-functional for the pair (L2X, H+). Then

K(fp, t) < \fp - f WL + t\f \K < xN+l+^llgll + t X-1/(2+9)\
By the choice of N, we have
Ktfp,t) <\g\2te/(2+e) = 2||g\\tr.
Since K(fp,t) < \\fp\L2 < ^1/2||g||, we can also see that for t > V^T, Ktfp,t)/tr < \g\ holds.
Therefore, fp e (L2X,H+)r and \fp \r < 2\g\. It follows from Theorem 4.16 that
A(fp, R) <
inf \f - fp L < (2||g||)2/(1-r)R-2r/(1-r)
f eH+,\f \\K <R
= 22+e\g\2+e R-e.
Conversely, if pX is nondegenerate and A(fp, R) < CR-e for some constant C and all R > 0,
then Theorem 4.12 states that H+ = HK. This, together with Theorem 4.16 and the polynomial
decay of A(fp, R), implies that fp e LX,HK)r and \\fp\r < 2C1/(2+e).
Let m e N. There exists a function fm e HK such that
\\fp -fm\\L?x + 2-m\\fm\K < 4C 1/(2+e)2-mr.
\\fp - fm\\L2 < 4C 1/(2+e)2-mr
\\fm\\K < 4C 1/(2+e)2m(1-r).


2-2m<Xk<2-2(m-1) Ak
r-2e <

(ck - b r )

2-2m<Xk <2-2(m-1) k


2-2m<Xk <2-2(m-1)
Write fp = J2 k ck $k and fm = Hk b{m)$k. Then, for all 0 < e < r,
which can be bounded by
21+2m(r-2e) \f f \2 ^ + 2 1+2(1-m)(1-r+2e) |f \2 < C2/(2+e)25-




4.7 References and additional remarks 47

+ C 2/(2+e)25+2(1-r)+4e(1-m)
h <1

v r2e

2/(2+e) 16f- 1

< TO.
This means that fp e Range(LK

An example


In this section we describe a simple example for the approximation error in RKHSs.
Example 4.19 Let X = [1,1], and let K be the spline kernel given in Example 2.15, that is, K(x,
y) = max{1 |x y|/2,0}. We claim that HK is the Sobolev space H 1(X) with the following
equivalent inner product:
f,)K = f',g')L2[U] + 1 (f (1) + f (1)) (g(1) + g(1)).
Assume now that pX is the Lebesgue measure. For 0 > 0 and a function fp e L2[1,1], we also
claim that A(fp,R) = O(R-0) if and only if \fp(x + t) fp (x) IIL 2[1,11] = O(t0/(2+0)).
To prove the first claim, note that we know from Example 2.15 that K is a Mercer kernel.
Also, Kx e H1 (X) for any x e X .To show that (4.3) is the inner product in HK, it is sufficient to
prove that If, Kx)K = f (x) for any f e H 1(X) and x e X .To see this, note that K'x = 2 X[i,x)
2 X(x,i] and Kx(1) + Kx(1) = 1. Then,
f, Kx )K = 2fXf '(y) dy 1 f V '(y) dy +1 (f (1) + f (1)) = f (x).
To prove the second claim, we use Theorem 4.16 with B = L2(X) and H = HK = H 1(X). Our
conclusion follows from the following statement: for 0 < r < 1 andf e L2,K(f,t) = O(tr)
ifandonlyif |f (x+1) f (X)||L2[1 1t] = O(tr).
To verify the sufficiency of this statement, define the function ft: X ^ R by
1 ft
(x) =
f (x + h) dh.
Taking norms of functions on the variable x, we can see that
1 f1
If - ft L =

f (X + h) - f (x) dh
1 f1


Chr dh =

(x + h) - f (X)IL2 dh
\H 1(X) =
- (f (X + f) - f (X))
< C1
K(f, 1 ) < If -f ||L2 + n\f1 llfl 1 (X) < (C/(r + 1) + C)f.
Conversely, if K(f, f) < Cf, then, for any g e H1 (X), we have
If (X + f) - f (X)\L2 = \f (X + i) - g(x + f) + g(x + i) - g(x)
+ g(x) - f (X)\L2 ,
which can be bounded by
2\f - g\L2 +
g'(x + h) dh < 2\f - g\L2 +
\g'(x + h)\L2 dh
< 2\f - g\L2 + 1\g\H 1 (X).
Taking the infimum over g e H1 (X), we see that

48 4

Polynomial decay of the approximation


||f (X +1) -f (X)\L2 < 2K(f, 1) < 2Cf.

This proves the statement and, with it, the second claim.


References and additional remarks

For a proof of the spectral theorem for compact operators see, for example, [73] and Section
4.10 of [40].
Mercers theorem was originally proved [85] for X = [0,1] and v, the Lebesgue measure.
Proofs for this simple case can also be found in [63, 73].
Theorems 4.10 and 4.12 are for general nondegenerate measures v on a compact space X.
For an extension to a noncompact space X see [123].
The map $ in Theorem 4.14 is called the feature map in the literature on learning theory
[37, 107, 134]. More general characterizations for the decay of the approximation error being
of type O(p(R)) with p decreasing on (0, +ro) can be derived from the literature on
approximation theory e.g., ([87, 94]) by means of K-functionals and moduli of smoothness. For
interpolation spaces see [16].
RKHSs generated by general spline kernels are described in [137]. In the proof of Example
4.19 we have used a standard technique in approximation theory (see [78]). Here the function
needs to be extended outside [-1,1] for defining ft, or the norm ^2 should be taken on [-1,1
t]. For simplicity, we have omitted this discussion.
The characterization of the approximation error described in Section 4.5 is taken from
Consider the approximation for the ERM scheme with a general loss function f in Section
3.5. The target function ff minimizes the generalization error Ef over H. If we minimize instead
over the set of all measurable functions we obtain a version (w.r.t. f) of the regression
Definition 4.20 Given the regression loss function f the f -regression function is given by
The approximation error (w.r.t. f) associated with the hypothesis space H is defined as
E f(fH) - Ef (ff) = min E f(f ) - Ef (ff).
Proposition 4.21 Let f be a regression loss function. If e > 0 is the largest zero of f
and |y| < M almost surely, then, for all x e X,ff (x) e [-M - e, M + e].

The approximation error can be estimated as follows [141].

Theorem 4.22 Assume |y - f (x) | < M and |y - ff (x) | < M almost surely.
If f is a regression loss function satisfying, for some 0 < s < 1,sup
t,t'e[-M M ]
f(t) - f(t')\ |t - t'|s
-I s
= C < ,

f(f) - f(f) < C
< C If - f L < C Ilf -f IILL2 .

f (X) - f(x)\sd PX



If f is C1 on [-M, M] and its derivative satisfies (4.4), then
f(f) - f(f) < C |lf - ff 1++ < C |lf - ff
ii; If f "(u) > c > 0 for every u e [-M, M ], then
f(f) - f(f) > If - f II+2 .

Estimating covering numbers

The bounds for the sample error described in Chapter 3 are in terms of, among other
quantities, some covering numbers. In this chapter, we provide estimates for these covering
numbers when we take a ball in an RKHS as a hypothesis space. Our estimates are given in
terms of the regularity of the kernel. As a particular case, we obtain the following.
Theorem 5.1 Let X be a compact subset of Rn, and Diam(X) := maxx,yeX ||x - y|| its
i; IfK e Cs(X x X) for some s > 0 andX has piecewise smooth boundary, then there
is C > 0 depending on X and s only such that




, V0 <n < R/2.

IfK (x, y) = ex^ ||x - yH2/ofor some a > 0, then, for all 0 <n < R/2,
640n(Diam(X ))2\n+1 ( R\n+1 ln N{IK(BR), n) < ^ 32 +
If, moreover, X contains a cube in the sense thatX 2 x* + [--f, -f ]n for some x* e X
and f > 0, then, for all 0 < n < R/2,


ln N{IK(BR), n) > Ci ln .
Here C1 is a positive constant depending only on a and f.
Part (i) of Theorem 5.1 follows from Theorem 5.5 and Lemma 5.6. It shows how the
covering number decreases as the index s of the Sobolev smooth kernel increases.Acase
where the hypothesis of Part (ii) applies is that of the box spline kernels described in Example
2.17. We show this is so in Proposition 5.25.
When the kernel is analytic, better than Sobolev smoothness for any index s > 0, one can
see from Part (i) that ln N(IK (Br), n) decays at a rate faster than (R/n)e for any e > 0. Hence
one would expect a decay rate such as (ln(R/v))s for some s. This is exactly what Part (ii) of
Theorem 5.1 shows for Gaussian kernels. The lower bound stated in Part (ii) also tells us that
the upper bound is almost sharp. The proof for Part (ii) is given in Corollaries 5.14 and 5.24
together with Proposition 5.13, where an explicit formula for the constant C1 can be found.


Reminders IV

To prove the main results of this chapter, we use some basic knowledge from function spaces
and approximation theory.
Approximation theory studies the approximation of functions by functions in some good
family - for example, polynomials, splines, wavelets, radial basis functions, ridge functions.
The quality of the approximation usually depends on, in addition to the size of the
approximating family, the regularity of the approximated function. In this section, we describe
some common measures of regularity for functions.
I; Consider functions on an arbitrary metric space (X, d). Let 0 < s < 1. We say that a
continuous function f on X is Lipschitz-s when there exists a constant C > 0 such that for all
x, y e X,
|f (x) -f (y)\<C(d(x,y))s.
We denote by Lip (s) the space of all Lipschitz-s functions with the norm
IIf 11 Lip(s) := \f I Lip(s) + \\f \\C(X),
lLip(s) :
I f (x) - f (y)| (d (x, y))s
where | |Lip(s) is the seminorm
This is a Banach space. The regularity of a function f e Lip(s) is measured by the index s. The

bigger the index s, the higher the regularity of f.

II; When X is a subset of a Euclidean space Rn, we can consider a more general measure of
regularity. This can be done by means of various orders of divided differences.

(-1)r-jf (x + jt).
Let X be a closed subset of Rn and f : X ^ R. For r e N, t e Rn, and x e X such that x, x +
t,..., x + rt e X, define the divided diference
In particular, when r = 1, Ajf (x) = f (x + t) - f (x). Divided differences can be used to
characterize various types of function spaces. Let
Xr,t = {x e X | x,x + t,...,x + rt e X}.
:= sup ||t|| s

For 0 < s < r and 1 < p < TO, the generalized Lipschitz space Lip*(s, Lp(X)) consists of
functions f in Lp(X) for which the seminorm
is finite. This is a Banach space with the norm
Hup*(s,LP(X)) := |f |Lip*(s,LP(X)) + 1 HLp(X).
The space Lip*(s, C(X)) is defined in a similar way, taking
f )) := sup n- sup
Lip*(s,C (X)) := \f |Lip*(s,C(X)) + |f 1C(X).
Clearly, || f HUp*-(sLp(x)) < (KX))1/p| f HUp*-(sC(x)) for allp < TO when X is compact.
When X has piecewise smooth boundary, each function f e Lip*(s, Lp(X)) can be extended to
Rn. If we still denote this extension by f, thenthere exists a constant CX,s,p depending on X, s,
and p such that for all f e Lip*(s,Lp(X)),
IIf II Lip*(s,LP (Rn)) - CX ,s,p>f Wup*(s,LP(X ))'
Whenp = <x>, we write || f ||Up*s instead of || f llUp*(sC(X^.Inthis case, under some mild
regularity condition for X (e.g., when the boundary of X is piecewise smooth), for s not an
integer, s = t + s0 with t e N, and 0 < so < 1, Lip*(s, C(X)) consists of continuous functions
on X such that Daf e Lip(s0), for any a = (a1,...,an) e Nn with |a| t. In particular, if X =
[0,1]n or Rn. In this case, Lip*(s, C(X)) = Cs(X) when s is not an integer, and Cs(X) c Lip*(s,
C(X)) when s is an integer. Also, C 2(X) c Lip(1) c Lip*(1, C(X)),the last being known as the
Zygmund class.
Again, the regularity of a function f e Lip*(s, Lp(X)) is measured by the index s. The bigger
the index s, the higher the regularity of f.

2 f >H-<'> := {anr* L(1 +,2)'17(l)|2d^2.

When s e N, Hs(Rn) coincides with the Sobolev space defined in Section 2.3.
Note that for a function f e Hs(Rn), its regularity s is tied to the decay of f. The larger s is, the
faster the decay of f is. These subspaces of L2(Rn) and those described in (II) are related as
follows. For any e > 0,


When p = 2 and X = Rn, it is also natural to measure the regularity of

functions in L2(Rn) by means of the Fourier transform. Let s > 0. The

fractional Sobolev space Hs (Rn) consists of functions in L2(Rn) such that the
following norm is finite:

smooth kernels

Covering numbers for Sobolev

Recall that if E is a Banach space and R > 0, we denote BR(E) = {x e E : ||x|| < R}.
If the space E is clear from the context, we simply write BR.
Lemma 5.2 Let E c C(X) be a Banach space. For all n, R > 0,
N (BR, n) = N (Bi, R).
Proof. The proof follows from the fact that {B( fi, n), , B( fk, n)} is a covering of BR if and
only if {B (fi/R, n/R), , B (fk/R, n/R)} is a covering of Bi.
It follows from Lemma 5.2 that it is enough to estimate covering numbers of the unit ball.
We start with balls in finite-dimensional spaces.
Theorem 5.3 Let E be a finite-dimensional Banach space, N = dim E, and R > 0. For 0
< n < R,

N(BR, n) <


and, for n > R, N(BR, n) = l.

|x| :
x = (xi, ,xN) e Rn.
Proof. Choose a basis {ej}j=i of E. Define a norm | | on RN by


and f ( B ( f ( t ) , n ) , Vj e{2,,l}.
Let n > 0. Suppose that N(BR, n) > ((2R/n) + i)N .Then BR cannot be covered by ((2R/n) +
i)N balls with radius n. Hence we can find elements f(i), , f(e> in BR such that
Therefore, for i = j, || f f j)|| > n.
Set f(j ) = Em=i xim j em e BR and
x( j )
,xN J'1)
RN. Then,
|x(l) x( j) | > n.
Also, \x( j)\ < R.
Denote by Br the ball of radius r > 0 centered on the origin in (RN, \ \).Then
{xU) + 2Bi} C ^ j=i
and the sets in this union are disjoint. Therefore, if x denotes the Lebesgue measure on RN,
x ^U p+ 2 B; } j= (x( j > + 2 B;) <,, (B-r}).
It follows that
t (2 )N x (Bo < (R+2 f x &),
and thereby
t ^)=(2R+)*
This is a contradiction. Therefore we must have, for all n > 0,

If, in addition, n > R, then BR can be covered by the ball with radius n centered on the
origin and hence N(BR, n) = 1.

The study of covering numbers is a standard topic in the field of function spaces. The

asymptotic behavior of the covering numbers for Sobolev spaces is a well-known result. For
the ball BR(Lip*(s,C([0,1]))) in the generalized
Lipschitz space on [0,1] satisfies

Cs < lnN (BR(Lip*(s, C([0,1]))), n < C's ,

where the positive constants Cs and C's depend only on s and n (i.e., they are independent of
R and n).
We will not prove the bound (5.2) here in all its generality (references for a proof can be found
in Section 5.6). However, to give an idea of the methods
involved in such a proof, we deal next with the special case 0 < s < 1 and n = 1. Recall that in
this case, Lip*(s, C([0,1])) = Lip(s). Since B1(Lip(s)) c B1(C([0,1])), we have N(Bi(Lip(s)), n) = 1
for all n > 1.
Proposition 5.4( Let1/s0 < s < 1 andX = [0,1], (Then,
for all 0 < n < 1,
4 \1/s
- mN(B1 (Lip(s)), n) < 4 l J
The restriction n < 4 is required only for the lower bound,
Proof. We first deal with the upper bound. Set e = (n/4)1/s. Define x = {xi = ie}d=1, where d
= | _ 1 J denotes the integer part of 1 /e. Then x is an e-net of X = [0,1] (i.e., for all x e X, the
distance from x to x is at most e). If f e B1(Lip(s)), then || f ||C(X) - 1 and -1 < f (xi) < 1 for all i
= 1,..., d. Hence, (Vi 1)2 - f (xi) < Vi2 for some Vi e J := {m + 1,..., m}, where m is the
smallest integer greater than 2. For V = (V1 , ..., vd) e Jd define
Vv := {f e B1 (Lip(s)) | (vt 1) 2 < f (xi) < Vi 2 for i = 1,..., dj. Then B1(Lip(s)) c (JveJd VV.
Iff, g e VV, then, for each i e{1,..., d}, max | f (x) g (x) | < If (xi) g(xt )| + max | f (x) f
(xi )|
|x xi |<e
+ max |g(x) g(xi)|
|xxi|< e
< 2 + 2es = n.
Therefore, VV has diameter at most n as a subset of C(X). That is, {VV}veJd is an n-covering of
B1(Lip(s)). What is left is to count nonempty sets VV.
If VV is nonempty, then VV contains some function f e B1(Lip(s)). Since 2 < f (xi+ 1 )
Vi+1 2 < 0 and 2 < f (xi) Vi n < 0, we have


< f (xi+1) f (xi) (vi+1 Vi)2 < 2 .

It follows that for each i = 1,..., d 1,
|Vi+1 Vi12 - < !(xi+1) f (xi)| < |xi+1 xi|s < es.
This yields lvi+1 Vi | < | (es + Q) = | and then vi+1 e{vi 1, Vi, V; + 1}.Since vi has 2m
possible values, the number of nonempty Vv is at most 2m 3d1. Therefore,
lnN(Bi(Lip(s)), n) < ln(2m 3d-1)
= ln 2 + (d 1) ln 3 + ln m
< ln 2 + 1 ln 3 + ln ( 2 + 1 e

But ln(1 + t) < t for all t > 0, so we have
and hence
ln N(B1 (Lip(s)), n) < ln2 + - ln3 + 2 -


4 \1/S

\nj \nj

We now prove the lower bound. Set e = (2n)1/s and x as above. For i = 1,..., d 1, define
f to be the hat function of height 2 on the interval [x; e, Xi + e]; that is,
2 t if 0 < t < e
fi (Xi +1)
2 + 2t
if e < t < 0
if t [e,e].
Note that fi (xj) = 2 8j.
For every nonempty subset I of {1,..., d 1} we define
fi (x) = ^2 fi(x).
If I1 = I2 , there is some i e (I1 \ I2 ) U (I2 \ I1 ) and

II fI1 fI2 11^(X) > f (xi) fI2 (xi)| = ^.
It follows that N(B1 (Lip(s)), n) is at least the number of nonempty subsets of {1,..., d 1}
(i.e., 2d1 1), provided that each fI lies in B1 (Lip(s)). Let us prove that this is the case.
Observe that f is piecewise linear on each [xi,xi+i] and its values on xi are either ^ or 0.
Hence || fi \\C(X) < n < 5 . To evaluate the Lipschitz-s seminorm of fi we take x, x + t e X with
t > 0. If t > e, then
Ifi(x +1 ) - f (x)\ ^ 2 | fj\\C(X) n 1
es < es 2 .
If t < e and xi e (x, x + t) for all i < d - 1, then fI is linear on [x, x +1] with slope at most 2 e
and hence
| fi (x +1) - fi (x)\
(n/2e)t n i-s
n i
~ ts
If t < e and xi e (x, x +1) for some i < d - 1, then
| fi (x + t) - fi (x) | < \ fi (x + t) - fi (xi )| + \ fi (xi) - fi (x) |
< 2 ~(x + 1 - xi) + 2 ~ (xi - x) = 2 ~ t
and hence
| fi (x +1 ) - fi (x) |
n A-,
n -s
Thus, in all three cases, | f | Lip(s) < 2, and therefore || f || Lip(s) < 1. This shows fi e Bi(Lip(s)).
Finally, since n < 4 , we have e < i and d > 2. It follows that N(Bi(Lip(s)), n) > 2d-i - i > 2d 2

> 2(i/2e)-2, which implies

i / Oii/S
ln N(Bi (Lip(s)), n) > 2 ln M 2^ )
- ln4 > ^.2^
Now we can give some upper bounds for the covering number of balls in RKHSs. The
bounds depend on the regularity of the Mercer kernel. When the kernel K has Sobolev or
generalized Lipschitz regularity, we can show that the RKHS HK can be embedded into a
generalized Lipschitz space. Then an estimate for the covering number follows.
Theorem 5.5 Let X be a closed subset of Rn, and K: X x X ^ R be a Mercer kernel. If s
> 0 and K e L i p*(s, C(X x X)), then HK c Li p*(2, C(X)) and, for allr e N, r > s,
f Hup*(s/2) <7 2r+lWK Li p*(s) II f IlK, Yf 6 HK .
Proof. Let s < r 6 N andf 6 HK .Let x, t 6 Rn such that x, x+1,..., x+rt 6 X. By Theorem

rf (x) =

( r.) (-1)r- Kx+jt f


= ( ( j ) (-1)r-iKx+jt f

It follows from the Cauchy-Schwarz inequality that

Af (x)| < If ||K
j (-1)r-j Q (-1)r-iK(x + jt,x + it)

= If IK j (-1)r-j ArmK (x + jt, x)

Here (0, t) denotes the vector in R2n where the first n components are zero.
A[0X)K (x + jt, x)
< \K|
By hypothesis, K 6 Li p*(s, C(X x X)). Hence
This yields
\Af (x)| < If ||K
j ) lK|Lip*(s) llt||s 1 < ^K|Lip*(s)Ilf IKl|t||s/2.
If iup*(2) <prlKluP*(s) II f IK.

Combining this inequality with the fact (cf. Theorem 2.9) that

5.3 Covering numbers for analytic kernels

I f IU < VlIKIUIf IK,we conclude that f e Lip*(2, C(X)) and


II f HLip*(s/2 ) <7
W HuP*(s) II f IlK The proof of the theorem is complete.

Lemma 5.6 Let n e N, s > 0, D > 1, andx* e Rn. IfX c x* + D[0, l]n and X has piecewise
smooth boundary then
N(BR(Lip*(s, C(X))), n) < N(BCX,SDSR( Lip* (s, C([0,1]n))), n/2),
where CX,s is a constant depending only on X and s.
f *(x) = f
x - x*
Proof. For f e C([0,1]n) define f * e C(x* + D[0,1]n) by
Then f * e Lip*(s,C(x* + D[0,1]n)) if and only if f e Lip*(s,C([0,1]n)). Moreover,
D J| f
l Hup*(s,C([0,1]n)) < If *^Lip*(s,C(x*+D[0,1]n)) < If 1 Lip*(s,C([0,1]n)).
Since X c x* + D[0,1]n and X has piecewise smooth boundary, there is a constant CX ,s
depending only on X and s such that
BR(Lip*(s, C(X))) c {f *|X | f * e BCX,sR(Lip*(s,C(x* + D[0,1]n)))}
C {f *|X I f e BCX,SDSR(L\P*(S, C([0 ,1 ]n)))}.
Let {f1,...,fN} be an 2-net of BCx sDsR(Lip*(s, C([0,1]n))) and N its covering number. For each
j = 1,...,N take a function g*IX eBR(Lip*(s,C(X))) with gj eBCX,sDsR(Lip*(s,C([0,1]n))) and ||gj fj|Lip*(s,C([0,1])) < 2 if it exists. Then {g*|X I j = 1,..., N} provides an n-net of BR(Lip*(s, C (X))):
each f e BR(Lip*(s,C(X))) can be written as the restriction g*|X to X of some function g e BCx
sDsR(Lip*(s, C([0,1] ))), so there is some j such that
llg - fj 1 Lip*(s,C([0,1])) < 2. This implies that
j* |X II Lip*(s,C(X)) < lg* - gj 1 Lip*(s,C(x*+D[0,1]n))
< llg - gj 11 Lip*(s,C ([0,1 ]n)) < 2.
This proves the statement.
Proof of Theorem 5.1(i) Recall that Cs(X) c Lip*(s, C(X)) for any s > 0. Then, by Theorem
5.5 with s < r < s + 1, the assumption K e Cs(X x X) implies that IK(BR) c By2,+2||Ky t
R(Lip*(s/2, C(X))). This, together with
(52; and Lemma 5.6 with D > Diam(X), shows Theorem 5.1(i).
When s is not an integer, and the boundary of X is piecewise smooth, Cs (X) = Lip*(s, C
(X)). As a corollary of Theorem 5.5, we have the following.
Proposition 5.7 Let X be a closed subset of Rn with piecewise smooth boundary, and
K: X x X ^ R a Mercer kernel. If s > 0 is not an even integer andK e Cs(X x X), then
HK c Cs/1(X) and
II f Hup*(s/2 ) <72r+1HKHuP*(s) II f IIK, Vf e HK.

Theorem 5.5 and the upper bound in (5.2) yield upper-bound estimates for the covering
numbers of RKHSs when the Mercer kernel has Sobolev regularity.
Theorem 5.8 Let X be a closed subset of Rn with piecewise smooth boundary, and K:
X x X ^ R a Mercer kernel. Let s > 0 such that K belongs to Lip*(s, C(X x X)). Then,
for all 0 <n < R,
R 2n/s
ln N (IK(BR), n) < C ,
where C is a constant independent ofR and n.

It is natural to expect covering numbers to have smaller upper bounds when the kernel is
analytic, a regularity stronger than Sobolev smoothness. Proving this is our next step.


Covering numbers for analytic kernels

In this section we continue our discussion of the covering numbers of balls of RKHSs and
provide better estimates for analytic kernels. We consider a convolution kernel K given by
K(x, t) = k (x -1), where k is an even function in ^2(R) and k(%) > 0 almost everywhere on
Rn. Let X = [0,1]n. Then K is a Mercer kernel on X . Our purpose here is to bound the covering
number N (IK(BR), n) when k is analytic.
We will use the Lagrange interpolation polynomials. Denote by ns(R) the space of real
polynomials in one variable of degree at most s. Let t0,..., ts e R be different. We say that wl,s
e ns(R), l = 0,, s, are the Lagrange interpolation polynomials with interpolating points
{t0,..., ts} when E/=0 wl,s (t) = 1 and
Sl,m, l, m e {0,1,..., s}.


5 Estimating covering numbers

js{0 ,t,...,s}\{l}
t-t t -t
J l j
It is easy to check that
satisfy these conditions.
^st(st - 1) ---(st - j + 1)
l,s(t) := 2^------------------J-----------------J

(-1) j-1 .
We consider the set of interpolating points {0,1,2,... ,1} and univariate functions
{wl,s(t)}sl=0 defined by
W j ) (-1)J-lzl = (z - 1)}, l=0 l
the following functions of the variable z are equal:
Z*Zt)z! = E SW - ') :,(s' -J + 1)(z - 1)J.
Sl,m, l, m e{0,1,..., s}.
In particular, J]s=0 wl,s(t) = 1. In addition, it can be easily checked that

je{0 ,1 ,...,s}\{l}
t - j/s l/s - j/s

je{0 ,1 ,...,s}\{l}

st - j l - j
This means that the wl,s are the Lagrange interpolation polynomials, and hence
The norm of these polynomials (as elements in C([0,1])) can be estimated as follows:
Lemma 5.9 Let s e N, l e{0,1,..., s}. Then, for all t e [0,1],
Wl,s(t)| < ^ s ^ '
Proof. Let m e{0,1,..., s - 1} and st e (m, m + 1). Then for l e{0,1,..., m - 1},
nj=1(st j)Y\J=l+1(st - j)Uj=m+1(st - j)
l!(s - l)!
(m + 1) ! (s - m)!
(st - l)l!(s - l)!
When l e{m + 1,..., s},
ujL0(st - j) ulj=L+i(st - j)n]=l+1(st - j)
l!(s - l)!
(m + 1)!(s - m)!
(l - m)l!(s - l)!
The case l = m can be dealt with in the same way.

We now turn to the multivariate case. Denote XN := {0,1,..., N}n. The multivariate
polynomials {wa,N (x)}aeXN are defined as
Wa,N (x) = Y[ Waj,N (Xj), x = (Xl, ..., Xn), a = (ai,..., an). (5.6)
We use the polynomials in (5.6) as a family of multivariate polynomials, not as interpolation
polynomials any more. For these polynomials, we have the following result.
Lemma 5.10 Let x e [0,1]n andN e N. Then
Wa,N (x)| < (N2N )n
a eXN
e~ie -Nx
WaN (x)e-i0 a

5.3 Covering numbers for analytic kernels



+ 2N

max |0; |
1< j<n
and, for 0 e[-j, \]n,
Proof. The bound (5.7) follows directly from Lemma 5.9.
= (i + (z - i))Nt = J2
. Nt(Nt - 1) ---(Nt - j + 1)
(z - 1)j.
It follows that for n e[2, 2] and z = e in,
, Nt(Nt - 1) (Nt - j + 11
-in-Nt - ^;
(e-in - 1)j
To derive the second bound (5.8), we first consider the univariate case. Let t e [0,1]. Then
the univariate function zNt is analytic on the region |z - 1| < 2. On this region,

j=N +1
Nt(Nt - 1) (Nt - j + 1)
Nt(Nt - 1) (Nt - j + 11 _,inn
WIN(t)e inl =
(e - 1)j
< 1 + | n |N < 1 +


1 ____

J^WIN (t )e~in'1
< lnl
This, together with (5.4) for z = e in, implies
Now we can derive the bound in the multivariate case. Let 0 e [-2,2]n. Then e-i0-Nx = \\nm=i ei0mNXm
. We approximate e-i0mNXm by
E*m= 0 wamN (Xm)e tdm'am for m = 1 , 2 , . . . , n. We have
id -Nx id-a
= n
e WaN (x)e
X eidm'Nxm WamN(xm)eidm'am



m Wl
m=1 v
1 +
<n1 +

-1 /

J2 Was,N (xs)e



5 Estimating covering numbers

(.ms 1
Applying (5.9) to the last term and (5.10) to the middle term, we see that this expression can
be bounded by
Thus, bound (5.8) holds.


, 1 \ 2n 2 3
N) = n 1 +

lj l

(2n)n max
l<j'< f e[N/2N/2]n


+ 1 + (N 2N )n (2n)n

k () d .
We can now state estimates on the covering number N(IK(BR), n) for a convolution-type
kernel K(x, t) = k (x t). The following function measures the regularity of the kernel
function k:
The domain of this function is split into two parts. In the first part, . [ N/2, N/2]n, and
therefore, for j = 1,..., n, (\j|/N)N < 2N; hence this first part decays exponentially quickly as
N becomes large. In the second part, . [N/2, N/2]n, and therefore is large when N is
large. The decay of k (which is equivalent to the regularity of k; see Part (III) in Section 5.1)
yields the fast decay of Tk on this second part. For more details and examples of bounding Tk
(N) by means of the decay of k (or, equivalently, the regularity of k), see Corollaries 5.12 and
Theorem 5.11 Assume that k is an even function in ^2(R) and k() > 0 almost
everywhere on
Rn. Let K(x, t) = k (x
t) for
x, t
[0,1]n. Suppose
limNTk (N) = 0. Then, for 0 < n < R,

lnN (IK(BR), n) < (N + 1)n ln 8^kf0)(N + 1)n/2(N2N)nR (5.11)

holds, where N is any integer satisfying
Tk (N) < (f
Proof. By Proposition 2.14, K is a Mercer kernel on Rn. Let f e BR. Then, by reproducing that
property of K, f (x) = (f, Kx}K. Recall that XN = {0,1,..., N}n. For x e [0,1]n, we have

/ (x) - / (N)


N (x)

= (/, Kx

WaN (x)Kaj
aeXN K

< 11/IIK {QN(x)}1/2,

where {QN(x)}l/2 is the HK-norm of the function Kx ~Y.aXN wa,N (x)Ka/N. It is explicitly given
QN(x) := k(0) - 2
WaN(x)k (x - N)
WaN (x)k
WpN (x).
By the evenness of k and the inverse Fourier transform,
k(x - ) = (2n)-n
k()ei'(x- ) d,
QN (x) = (2n) n k () 1 - V wa,N (x)ei<x N ] JRn
= (2n)-n


e iN -Nx - '^2 waN(x)e 1N a GXN

d .
we obtain
Now we separate this integral into two parts, one with e[-Nr, N]n and the other with ^[-Nf,
N ]n. For the first region, (5.8) in Lemma 5.10 with 0 = Ntells us that

% [N/2 ,N/2]n
k (%)
e-iN Nx

5.3 Covering numbers for analytic kernels


- ^2 wa,N (x)e


< n2 1 +
2 n2
2N %
%)(m d%.
% e[ N/2,N/2]n
1< j<n
For the second region, we apply (5.7) in Lemma 5.10 and obtain

% i[N/2,N/2]n
k (%)
e tN Nx ^2 wa,N(x)e
a XN
< (1 + (N2N )n)2


tN a

1 % i[N/2,N/2]n Combining the two cases above, we have

k (%) d %.
1 \2n 2
QN (x) < n ( 1 + 2N


(1 + (N 2N )n)2
% i[N/2,N/2]
;[ N/2,N/2]n
k(%) d% = Tk(N).
k (0) 2 Wa,N (X)k ^X N)


Wa,N (X)k
Wp,N (X)
< Tk(N).
Since N satisfies (5.12), we have
(Na) WaN(x)

Also, by the reproducing property, |f(f)| = |{f,Ka/N)K| < ||f||K VK(a/N, a/N) < R^/kifl). Hence
+ iy /!
llf (N )Ht2(ZN ) <
' .
Here (XN) is the i space of sequences {x(a)}aeXN indexed by XN.
Apply Theorem 5.3 to the ball of radius r := RVk(0)(N + 1)n/! in the finite-dimensional
space i!(XN) and e = n/(!(N!N)n). Then there are {cl : l = 1 , . . . , [(!r/e + 1)X]} c i2(XN) such
that for any d e i2(XN) with ||d Hi2(XN) < r, we can find some l satisfying
lld - C lli2{XN) < e
This, together with Lemma 5.10, yields
Y \Wa,N (x)|
< (N2n)ne < n/2.
C (X )
Y da Wa,N (x) - Y c^aN (X) aeXN aeXN
Here i(XN) is the i space of sequences {x(a)}aeXN indexed by XN that satisfies the
relationship Hcll^^) < ||c|i2 (XN) for all c e i(XN).
Thus, with d = {f (a)}, we see that || f (x) -J2aeXN clawa,N(x)llc(X) can be bounded by
(x) f
f (N) Wa,N (x)
C (X)
Y daWaN (x) - Y claWa,N (x)
< n.
C (X)
We have covered IK (BR) by balls with centers
aeXN c awa,N (x) and radius
n. Therefore,
N(IK(BR), n) <

#X N

e +1
That is,
ln N(IK(BR), n) < (N + 1)n ln + 1< (N + 1)n ln SvkOKN + 1)n/2(N2N)

The proof of Theorem 5.11 is complete.

To see how to handle the function Tk (N) measuring the regularity of the kernel, and then
to estimate the covering number, we turn to the example of Gaussian kernels.
Corollary 5.12 Let a > O, X = [0,1]n, and K(x, y) = k (x - y) with
k(x) = exp - 2 , x e Rn.
Then, for
O < n < R,
. R 54n n
R 90n2
ln N (IK(BR), n) < 3ln +
2 + 6
(6n + 1) ln +
2 + 11n + 3
n a2
holds. In particular, when O < n < R exp{-(90n2/a2) - 11n - 3}, we have
R \ n+1
ln N (IK(BR), n) < 4n(6n + 2)( ln - .
Proof. It is well known that
%) = (a jn)ne-a2^ l|2/4.
Hence k(f) > O for any f e Rn.
Let us estimate the function Tk. For the first part, with 1 < j < n, we have
a Vne-a2f2/4 dfi < 1
when l = j. Hence (2n)-n

( d
\N J

a *Jn r
2n -N/2

\/n \ a Nj

2 \2N / 2N + 1\N+(1/2)

1/(6(2N +1))


aN J

\ 2e )

V2N + 1

As for the second term of Tk, we have

If we apply Stirlings formula, this expression can be bounded by
(2nyn (
(a^fe-^l|2/4 dj

j=1 Jfjt[-N/2N/2]
e-a2 |j|2/4 dfj
< n^n T'e-(a2/4)(t2-t/2)e-(a2/4)(t/2) dt N
h [-N/2N/2]n
e-a N(N-1)/16_^e-a N/16
- vne
a2 e
(a /16)N
_(a 2/i^w 2
ea Jn
If we combine these two estimates, the function Tk satisfies
/16)N 2

T (N)


< nil +


1 \ 2n-2
< * 1 + 2n /
Notice that when N > n + 3,
(1 + (N2n)n)2 - 21-2n+4Nn

eNf + ark"+(N 2N n*e-(a

.It follows that

Tk(N) < n e(
Tk (N) < n5e



, +------------------2
16en ln2
<e+ 2-nN
1 1
< e +-----; max ,
<' + a Jn
16n 2n
When N > 2ln | + 5 and N > | In | + f - In(a*Jn)/(nln2), we know that each term in the
estimates for Tk is bounded by (n/(2R))2/2. Hence (5.12) holds.
Finally, we choose the smallest N satisfying
80n ln 2

+ 3 ln + 5.
Then, by checking the cases a > 1 and a < 1, we see that (5.12) is valid for any 0 < n < R.
By Theorem 5.11,
ln N (IK(BR), n) <
80n ln 2
3 ln2 +
3 ln +
)((2ln^ nN + ln R + ln8)
R 90n2
(6n + 1) ln + 22+ 11 n + 3
Choose N > 80n ln2/a2. Then
This proves (5.15).
When 0 < n < Re(902/a2)-11-3, we have

R \ n+1

ln n '
This yields the last inequality in the statement.

One can easily derive from the covering number estimates for X = [0,1]n such estimates
for an arbitrary X.
Proposition 5.13 Let K (x, y) = k (x - y) be a translation-invariant Mercer kernel on Rn
and X Rn. Let A > 0 and KA be the Mercer kernel on X A = [0,1]n given by
KA(x,y) = k(A(x - y)), x,y e[0,1]n.

IfX x* + [-A/2, A/2]n for some x* e X, then, for all n, R > 0, N(IK(BR), n) <

N(IKA(BR), n).

IfX 2 x* + [-A/2, A/2]n for some x* e X, then, for all n, R > 0, N(IK(BR), n) >

N(IKA(BR), n).
ctcjk(xi - xj) = ^2 ctcjkl A
xi - x*
+ t0
xj - x*

- t0
i,j=1 m


i j=
xi - x*
+ t0 ,
+ t0 ) =
I>K A-x*

+t 0

xj - x

i; Denote t0 = (5 ,5 , . . . , 5 ) e[0,1]n. Let g = iXi ciKXi e IK(BR). Then
the last line by the definition of KA. Since g e IK(BR), we have
IX-x* t e IK A(BR).
' +t0
If {f1,. . . , fN} is an n-net of IKA(BR) on XA = [0,1]n with N = N(IKA (BR), n), then there is some j
e { 1 , . . . , N} such that
< n.
J2c>K A t,
xi - x*
+ t0 - fj (t)
< n.
This means that
x - x*
xi - x*
+ t0, -A----+ t0
x - x*
+ t0
< n.
Take t = ((x - x*)/A) +1 0 . When x e X c x* + [-A/2, A/2]n, we have t e[0,1]n. Hence
J^Cik (x - xi) - fj
+ t0
= sup
j2ciKxt(x) - fj
i= 1
x - x*
+ t0
< n.
This is the same as
This shows that if we define fj*(x) := fj(((x - x*)/A) + t0), the set f*,..., fj*} is an n-net of the
function set {^j=1 CiKXi e IK (BR )} in C (X). Since this function set is dense in IK(BR), we have
N(IK(BR), n) < N = N(IKA (BR), n).
(ii) If X 2 x* + [-A/2, A/2] and {^1,..., gN} is an n-net of IK(BR) with N = N(IK(BR), n),

then, foreach g e BR, wecanfind somej e {1,...,N} suchthat ||g -gj\\C(X) < n.
Let f = J2l=1 cKtA e IKA (BR). Then, for any t e X A,

f (t) =


Cik(A(t - ti)) =


Cik(x - Xi) = Y1 CiKxi (x),

wherex =
= x* + A(t
e X andxi =
+ A(ti
e X.It
follows from this expression that
II f IlKA = ^2 CiCjKA(ti, tj) =^2 CiCjk(A(ti - tj)) i,j=1
= J2 CiCjK(xi,xj) = ||g||K < R,
where g = !Xi cKXi So, g e IK(BR) and we have llg - gj\\C(X) = supxeX |g(x) - gj(x)| < n for some j
e{1,...,N}. But for x = x(t) e X,
g(x) = ^2 Cik(x - Xi) = ^ Cik(A(t - ti)) = f (t).
It follows that
If (t) - gj (x*
+ A(t - to))|
< sup |g(x)
- gj (x) |
ts[0 ,l]n
This shows that if we define g*(t) := gj (x*+A(t1 0 )),theset { g * , . . . , g^*} is an n-net of
IKA(BR) in C([0,1]n).

If we take A = 2Diam(X) to be twice the diameter of X, then the condition X x* + [-A/2,

A/2]n holds for any x* e X. If, moreover, k(x) = exp{-||x||2/o2} then KA(x,y) = exp{-^x -y\\2/
(o2/A2)}. In this situation, Corollary 5.12 and Proposition 5.13 yield the following upper bound
for the covering numbers of Gaussian kernels.
Corollary 5.14 Let o> 0, X c Rn with Diam(X)<, and K(x, y) = exp{-||x - y||2/o2}. Then, for
any 0 <n < RR, we have
ln N(IK(BR), n) < 3ln +
360n (Diam(X ))2
(6 n + 1) ln +
+ lln + 3

We now note that the upper bound in Part (ii) of Theorem 5.1 follows from Corollary 5.14.
We next apply Theorem 5.11 to kernels with exponentially decaying Fourier transforms.
Theorem 5.15 Let k be as in Theorem 5.11, and assume that for some constants Co > 0
and X > n(6 + 2ln 4),
k () < C0 e-K ", Vf e Rn.
ln N(IK(BR), n) <
ln (1/A)
ln + 1 + C1 n
ln (1/A) + 1
ln + C2 n
Denote A := max{1/ek,4n/e/2}. Then for 0 < n < 2R*/C0A(2n 1)/4,
Ci := 1 +
2ln(32C0) ln (1/A)
, C2 := ln(8^C02n/2(A-3/&2n)C^j.
holds, where


( f ) l - j l df - c f
e-^" N
f e[-N/2,N/2]n \N J
Jf e[-N/2,N/ 2 ] n
e '"""I -TT | d f
NNNn-1 i
\fj\Ne-\fj\ dfj
< n2_C N Nn-1N! - XN+1NN
< 2C0V2n2(1/12)+1

the last inequality by Stirlings formula. Hence the first term of Tk (N) is at most
\ 2 n-2



- 4C0 Nn-(1/2) ^
1 \N +1
Nn-(1/2) (eX)N +1
Here we have bounded the constant term as follows
n3 (1 + 2 N f" (2n)-2V2n2(1/12)+1 = ^il+i/^)
n- 1

(1 + (1/2))2\


2n J





8n 7


< 4, for all n e N.

Proof. Let N e N and 1 - j n. Since \fj\/N < 1 for f e [-N/2, N/2]n, we have

(f) df - Co I
f e[-N/2,N/2]n
J ||f ||>N/2
For the other term in Tk (N), we have
To estimate this integral, we recall spherical coordinates in Rn:
f1 = r cos 01
f2 = r sin 01 cos 02
f3 = r sin 01 sin 02 cos 03
fn-1 = r sin 01 sin 02 ... sin 0n-2 cos 0n-1 fn = r sin 01 sin 02 ... sin 0n-2 sin 0n-1,
where r e (0, TO), 0 1 , . . . , 0n-2 e [0,0), and 0n-1 e [0,2n). For a radial

f (||f II) df = wn-1 I" f (r)rn~1 dr, n<llf ||<r2

function f (||f ||) we have
r2n rn rn
sinn 2 01 sinn 3 02 ... sin 0n-2 d01 d02 ... d0n-2 d0n-1
wn-1 = /
/ ..
Jo J0 J0
~2 r n
= 2^ 0 / sinn-j-1 0j d0j j=J0
2n n/2 = T(n/2)'
Applying this with f (||f ||) = e-xyf y, we find that
e-A-|lf y df = ------ f
e~Xrrn-1 dr.
J l l fl l >N/2
r (n/2) JN/2
2n n/2


r(n/2) JN/2

By repeatedly integrating by parts, we see that

P e-krrn-1 dr = UNY ' e-kN/2 +
e-krrn-2 dr
JN /2
j=1 V (n - j ) \ \ 2
= ^ _L (n - 1)]' (NX 7 N/2
< C0^ V __N)n-je-*N/2
V(n/2) j=i (2n)j (n - j)! \ 2j

< Co^ V (n ~ 1)! 2-nNn-je~XN/

V(n/2) j=l (n - j)!j!
It follows that the second term in Tk (N) is bounded by
Therefore, returning to (5.20), since X > 2n,
T(n/2) = X (n - j)! \ 2
$ $[-N/2,N/2]' n-j
A 1 (n - 1)! /N\
k($) d$ < Co
Combining the two bounds above, we have
1 \N+ 1
1 N +1
4n N
Tk (N) < 4CoNn-1 /2
+ 4CoN3 n -^2
< CN3nA
3n N
Since X > n(6 + 2ln 4), the definition of A yields
A < max
e(2n ln 4 + n) en ln 4+3n
Since xe x < e 1 for all x e (o, (x>), we have
. 3n
= ^Ne-N/6n ln(1/A)^
N 3n AN/2 = ^ AN/6"^

< 1.
(1 + (N2N)"f (2n)-2C/12N"~le~XN/2 < 4CoN3n (eC)"
Then, for N > 4n/ln(1/A),
Tk (N) < 8CoAN/2.
Thus, for o < n < 2R^/COA(2" 1 )/4 , we may take N e N such that N > 4n/ln(1/A) and N > 2
to obtain


5.4 Lower bounds for covering numbers

8CoAN/2 <
y < 8CoA(N-1 )/2 . (5.22)Under this choice, (5.12) holds. Then, by Theorem 5.11,
lnN (IK(BR), n) < (N + 1)" ln(^8y/k(0j(N + 1)n/2(N2N)nRj.
4 R + ln .
ln(1/A) n
Now, by (5.22),
Also, since
(N + 1)n/2(N2n)" < 2n/2(A-3/82")N,
lnN(IK(BR), n) <
,_Ri 0 , 2ln(32Co)
ln + 2 +
ln(1/A) n
ln (1/A)
ln 8 vk(0 )2 n / 2
we have
(A-3/82")1+2ln(32Co)/ln(1/A) (R/n)(4/ ln(1/A))+1
Finally observe that

\k (0)| = \(2n)k(f) df | < (2 n)~n J Coe- ^ 1 1 df < C0 .

Then (5.19) follows.

We can apply Theorem 5.15 to inverse multiquadric kernels.

Corollary 5.16 Let c > n(6 + 2l n 4) , a > n/2, and

k(x) = (c2 + ||x||2)-a, x e Rn.

lnN(I (B )
K R n) < I JJ)ln n + 1 + CV ((nTA + 1 1 l n " + C 2
Then there is a constant C0 depending only on a such that for 0 < n < 2R^C0A(2n-1)/4,
holds, where A = max{1/ec,4n/ec/2} and Ci, C2 are the constants defined in Theorem
Proof. For any e > 0, we know that there are positive constants C0 > 1, depending only on a,
and C^, depending only on a and e, such that
C*e-(c+'mII < T(f) < C0e-clliII Vf e Rn.
Then we can apply Theorem 5.15 with X = c, and the desired estimate follows.


Lower bounds for covering numbers

In this section we continue our discussion of the covering numbers of balls in RKHSs and
provide some lower-bound estimates. This is done by bounding the related packing numbers.
Definition 5.17 Let S be a compact set in a metric space and n > 0. The packing number
M(S, n) is the largest integer m e N such that there exist m points x i , . . . , xm e S being nseparated; that is, the distance between xt and xj is greater than n if i = j.
Covering and packing numbers are closely related.
Proposition 5.18 For any n > 0,
M(S, 2n) < N(S, n) < M(S, n).
Proof. Let k = M(S,2n) and {a 1 , . . . , ak} be a set of 2n-separated points in S. Then, by the
triangle inequality, no closed ball of radius n can contain more than one at. This shows that
N(S, n) > k.
To prove the other inequality, let k = M(S, n) and {a 1 , . . . , ak} be a set of n-separated
points in S. Then, theballs B(at, n) cover S. Otherwise, there would exist a point ak +1 whose
distance to aj, j = 1 , . . . , k, was greater than n and one would have M(S, n) > k + 1.

The lower bounds for the packing numbers are presented in terms of the Gramian matrix
K [x] = (K (xt, xj}) J=1 ,
where x := (x1 ; . . . , xm} is a set of points in X. Denote by ||K [x]-1 1|2 the norm of K [x]-1 (if it
exists) as an operator on Rm with the 2-norm.
We use nodal functions in the RKHS HK to provide lower bounds of covering numbers.
They are used in the next chapter as well to construct interpolation schemes to estimate the


5 Estimating covering numbers

approximation error.
Definition5.19 Letx := (x1,xm} cl.Wesaythat{u}"! = 1 i s a s et o f nodal functions associated
with the nodes x1,, xm if Ui e span(Kx1 , . . . , KXm) and
u (x )
i j = Sij.
The following result characterizes the existence of nodal functions.
Proposition 5.20 Let K be a Mercer kernel on X and x :={x i , . . . , xm} c X. Then the
following statements are equivalent:
i; The nodal functions {ui }f 1 exist.
ii; The functions [KXi }f 1 are linearly independent.
iii; The Gramian matrix K[x] is invertible.
iv; There exists a set of functions {f }1=1 e HK such that f (xj) = Sj for i, j = 1 , . . . ,
In this case, the nodal functions are uniquely given by
u(x) = Y (K [x]~l)ijKXj (x), i = 1 , . . . , m.
Moreover, for each x e X, the vector (ui(x))t[=1 is the unique minimizer in Rm of the
quadratic function Q given by
Q(w) = ^2 WiK(xi,xj)wj - 2^2WiK(x,xi) + K(x,x), w e Rm.
i; ^ (ii). The nodal function property implies that the nodal functions {ui} are linearly
independent. Hence (i) implies (ii), since the m-dimensional space span{ui}1=1 is
contained in span{Kxi }"=1.
ii; ^ (iii). A solution d = ( d 1 , . . . , dm) e Rm of the linear system
K[x]d = 0
Yd Y, K(xi, xj)dj
i=1 j=1

= 0.

Then the linear independence of {Kxj }mi=1 implies that the linear system has only the zero
solution; that is, K[x] is invertible.
Yj= 1 (K [x] 1 ) ijKxj satisfy
iii; ^ (iv). When K[x] is invertible, the functions {f }"L1 given by fi =
fi (Xj) = ^2,(K[x] X')i,tK(xe, Xj) = (K[x] 1K[x])ij = Sj.
These are the desired functions.
iv; ^ (i). Let Px be the orthogonal projection from HK onto span{KXi }f 1. Then for i, j = 1, . . . ,
Px(fi )(Xj ) = (Px (fi), Kxj >K = {fi, Kxj >K = fi (x} ) = Sj.
So {ui = Px(fi)}f=l are the desired nodal functions. Theuniqueness ofthe nodal functions
follows from the invertibility of the Gramian matrix K[x].
Since the quadratic form Q can be written as Q(w) = wTK[x]w - 2bTw + K(x, x) with the
positive definite matrix K [x] and the vector b = (K(x,xi))1=1, we know that the minimizer
w* of Q in Rm is given by the linear system K[x]w* = b, which is exactly
u (x))m=l.
When the RKHS has finite dimension l, then, for any m < l, we can find nodal functions
{uj}m= 1 associated with some subset x = {xi,...,xm} c X, whereas for m > l no such nodal
functions exist. When dim HK = TO, then, for any m e N, we can find a subset x = {x1,..., xm} c
X that possesses a set of nodal functions.
Theorem 5.21 Let K be a Mercer kernel onX, m e N, and x = {x1,..., xm} c X such that K
[x] is invertible. Then
M(IK(BR), n) > 2m - 1


5.4 Lower bounds for covering numbers

for all n > 0 satisfying

Proof. By Proposition 5.20, the set of nodal functions {uj(x)}j=1 associated with x exists and
can be expressed by
ui(x) =
(K [xrjj (x),
i = 1 , . . . , m.
For each nonempty subset J of { 1, . . . , m}, we define the function uj (x) := Y^jej n'uj(x),
where n' > n satisfies ||K[x]-1 || 2 < m (R/nO2 . These 2m - 1 functions are n-separated in C(X).
For J1 = J2 , there exists some j0 e { 1, . . . , m} lying in one of the sets J1 , J2 , but not in the
other. Hence
- uj2 ||OT > juj! (xjo) - uj2 (xjo)| >n' > n.
K = in'

jt Kt. n'
jej t=\
j'ej s=l
[x]"0E (K[x]"1)(K[x]).t
jj'ej t=1



= n'2
(K [x])-1e)j
What is left is to show that the functions uj lie in BR. To see this, take 0 = j c {1 , . . . , m}.
n'2W[x])-1e|t1 (j) < n'2vmw[x])-1e|t2 (j)
n'2jmMW[x]) - 1 |2 = n/2 m||(K[x])-1 |2 ,
where e is the vector in t2(j) with all components 1. It follows that |uj ||K < n'Vm i l K [x]-1 1| 1 /2,
and uj e BR since |K[x]- 1 12 < m iR/n')2 .

Thus lower bounds for packing numbers and covering numbers can be obtained in terms
of the norm of the inverse of the Gramian matrix. The latter can be estimated for convolutiontype kernels, that is, kernels K(x, y) = k (x-y) for some function k in Rn, in terms of the
Fourier transform k of k.
Proposition 5.22 Suppose K (x, y) = k (x-y) is a Mercer kernel onX = [0,1]n and the

Fourier transform ofk is positive; that is,

k(f) > 0, Vf e Rn.

For N e N, ifXN := { 0, 1 ,,N - 1 } n and x =
l|K[x]-1 ||2 < N-n( inf
$ e[-N n ,N n ]n
Proof. By the inverse Fourier transform,
k(x) = (2n)-n
k($)eix'$ d$,
we know that for any vector c := (ca)aeXN,
cTK[x]c = VCacp(2n)-n
Ik($)ei((a/N)-(p/N)y$ d$
I k($)T,1
= (2n)~
c e
= (2n)-nNn
Ik (N $)




5 Estimating covering numbers

d $.
Bounding from below the integral over the subset [-Nn, Nn]n, we see that
cTK[x]c > (2n)-nNn
inf k(n)
ne[-Nn ,N n ]n
c e
= IIc|22 (X )Nn
($) .
1 (XN
$ e[-Nn ,N n ]n
It follows that the smallest eigenvalue of the matrix K [x] is at least
inf k ($) ,
$ e[-N n N n ]n
from which the estimate for the norm of the inverse matrix follows.
Combining Theorem 5.21 and Proposition 5.22, we obtain the following result.
Theorem 5.23 Suppose K (x, y) = k (x - y) is a Mercer kernel onX = [0,1]n and the Fourier
transform ofk is positive. Then, for N e N,
lnN (IK(BR), D > ln M(IK(BR), n) > ln2{Nn - 1},provided N satisfies
t(H) > (n)2.

He[-Nn N n ]n
As an example, we use Theorem 5.23 to give lower bounds for covering numbers of balls
in RKHSs in the case of Gaussian kernels.
Corollary 5.24 Let o > 0, n e N, and
k(x) = exp j-^r} , x e Rn.
Set X = [0,1]n, and let the kernel K be given by
K(x, t) = k(x - t), x, t e[0 ,1 ]n.
Then, for 0 < n < R (o *Jn/2)n/2e no2n2/&,
+ ln(os/n) ln 2 .
ln N(IK(BR), n) > ln 2
on n


ln R
} n| 2

- + 1 ln2
Proof. Since k is positive, we may use Theorem 5.23.


On the smoothness of box spline kernels

The only result of this section, Proposition 5.25, shows a way to construct box spline kernels
with a prespecified smoothness r e N.
Proposition 5.25 Let B0 = [ b i , . . . , bn] be an invertible n x n matrix. Let B = [ B 0 B 0 . . .
B 0 ] beans-foldcopyofB0 andk(x) = (MB* MB)(x) be induced by the convolution of the
box spline MB with itself. Finally, let X c R n , and let K: X x X ^ R be defined by K(x,
y) = k (x - y). Then K e Cr (X x X) for all r < 2 s - n.

sin((f b})/2) (H b} )/2


Proof. By Example 2.17, the Fourier transform k of k(x) = (MB * MB)(x) satisfiesTo get the
smoothness of K we estimate the decay of k. First we observe that the function t ^ (sin t)/t
satisfies, for all t e (-1 ,1 ),

< <
sin t
1 + |t|
sin t

1 +1 '
Also, when t e (-1,1), |(sin t)/t| < 1. Hence, for all t e R,
It follows that for all % e Rn,
'sin((% bj)/2)\2
j=A (% bj)/2
' ~ nn=1 (1 +\ ( % bj)/2\2)'
j=1 V
% bj
n ( nf\
1 A 2

+ "4" 1 > 1 + z nj = 1 + 41
If we denote n = Bj% , then % bj = bj % = nj and
(%) <
(1 + y % y2)-s.
min{1, |A0 12 /4}
1 +1 M 2 I I % y 2
But | | n y 2 = ||Bj%y2 >|A0|2||%||2, where X0 is the smallest (in modulus) eigenvalue of B0. It
follows that for all % e Rn,

(1 + y % y ?k(%)| d% <
min{1, |A0 12 /4}
1 + y% y2
d % < TO
Therefore, for any p < 2s - |,


and, thus, k e Hp(Rn). By the Sobolev embedding theorem, K e Cr(X x

X) for all r < 2s n.

References and additional remarks

Properties of function spaces on bounded domains X are discussed in [120]. In particular, one can find conditions on X (such as
having a minimally smooth boundary) for the extension of function classes on a bounded domain X to the corresponding classes on
Estimating covering numbers for various function spaces is a standard theme in the fields of function spaces [47] and
approximation theory [78, 100]. The upper and lower bounds (5.2) for generalized Lipschitz spaces and, more generally, TriebelLizorkin spaces can be found in [47].

The upper bounds for covering numbers of balls of RKHSs associated with Sobolev smooth kernels described in Section 5.2
(Theorem 5.8) and the lower bounds given in Section 5.4 (Theorem 5.21) can be found in [156]. The bounds for analytic translation
invariant kernels discussed in Section 5.3 are taken from [155].
The bounds (5.23) for the Fourier transform of the inverse multiquadrics can be found in [82] and [105], where properties of
nodal functions and Proposition 5.20 can also be found.
For estimates of smoothness of general box splines sharper than those in Proposition 5.25, see [41].

Logarithmic decay of the approximation error

In Chapter 4 we characterized the regression functions and kernels for which the approximation error has a decay of order O(Re

). This characterization was in terms of the integral operator LK and interpolation spaces. In this chapter we continue this

We first show, in Theorem 6.2, that for a C kernel K (and under a mild condition on pX) the approximation error can decay
as O(R-e) only if fp is C as well. Since the latter is too strong a requirement on fp, we now focus on regression functions and
kernels for which a logarithmic decay in the approximation error holds. Our main result, Theorem 6.7, is very general and allows for
several applications. The result, which will be proved in Section 6.4, shows some such consequences for our two main examples of
analytic kernels.
Theorem 6.1 LetX be a compact subset of Rn with piecewise smooth boundary andf p e

Hs(X) with s > 0. Let a, c > 0.

i; (Gaussian) For K(x, t) = ex-tlla2 we have
,, inf R II fp - g\\L2(x) < C(lnR)-s/8, R > 1,
IlgllK <R
where C is a constant independent ofR. When s > 2,
,, inf D II fp - glc(X) < C(lnR)("/16)(s/8), R > 1.
IlgllK <R
ii; (Inverse multiquadrics) ForK(x, t) = (c2 + |x -1 |2)- with a > 0 we have
,, inf p II fp - g\L2(x) < C(lnR)-s/2, R > 1,
IlglK <R
where C is a constant independent
ofR. When s > ;,
inf || fp - g 1 C(x) < C(ln R)(n/4 )-(s/2 ), R > 1 . ugik < R
The quantity inf ||g|^ < R || fp g!^2(x) is not the approximation error (-op;) A( fp, R)
unless px is the Lebesgue measure p. It is, however, possible to obtain bounds on A( fp, R)
using Theorem 6.1. If s > f, one can use the bound in C(X) for bounding inf ug^ < R || fp glL
(X) for an arbitrary px. This is so since sup{|| f ||L; (X) : PX is a probability measure on X} = ||f
Uc(X). In the general case,
ugii'<R U fP gUL2 X (X) < DPP UgUnf<R U fP
where Vpp denotes the operator norm of the identity
L;(X) -ii L2 X (X).
We call Dpp the distortion of p (with respect to p). It measures how much pX distorts the
ambient measure p. It is often reasonable to suppose that the distortion Dpp is finite.
Since p is not known, neither, in general is Dpp. In some cases, however, the context may
provide some information about Dpp .An important case is the one in which, despite p not
being known, we do know pX. In this case Dpp may be derived.
In Theorem 6.1 we assume Sobolev regularity only for the approximated function fp. To
have better approximation orders, more information about p should be used: for instance,
analyticity of fp or degeneracy of the marginal distribution pX.




Polynomial decay of the approximation error for

In this section we use Corollary 4.17, Theorem 5.5, and the embedding relation (5.1) to prove
that a CX kernel cannot yield a polynomial decay in the approximation error unless fp is CX
itself, assuming a mild condition on the measure pX.
We say that a measure v dominates the Lebesgue measure on X when d v(x) >
C0 dx for some constant C0 > 0.Theorem 6.2 Assume X c Rn has piecewise smooth
boundary and K is a C Mercer kernel on X. Assume as well that pX dominates the
Lebesgue measure on X. If for some 0 > 0
p R :=,giK f11 f - g^Lk )=O(R-0),
thenfp is C onX.

Proof. Since pX dominates the Lebesgue measure p we have that pX is nondegenerate. Hence,

H+ = HK by Remark 4.18. By Corollary 4.17, our decay assumption implies thatfp e (Lp,x (X),
HK)0/(2+0). We show that for all s > 0, fp e Lip*(s, Ljk(X)). To do so, we take r e N, r > 2s(2 +
0)/0 > s,and t e Rn. Let g e HK and x e Xr,t. Then
Artfp(x) = Art (fp-g)(x)+Artg(x) = [ (-1)r J(fp-g)(x +jt)+Artg(x).
j = 0 VJ/
Let t = 2s(2 + 0)/0. Using the triangle inequality and the definition of
\up*(t/2,C(X)), it follows that
\Artfp(x)\2 dx
II fp - g\\L2(X) + ll^glL^ X )
j=oV /
2 II fp - g\L2(X) + Vp(X) ll g|Lip*(t/2,C(X)) ll t^ .
Since K is Cg e HK, and r > |, we can apply Theorem 5.5 to deduce that
n g i i u p*(t/2,c (X )) <y 2 r +1| K n LiP*(t) i i g \ K .
Also, dpx(x) > Codx implies that || fp - g l l ^ x ) (1A/C0)II fp - glL^ X) and p(X) 1 /C0. By
taking the infimum over g e HK, we see that
\L dx] < 7C0get I*' 1 fp-giLx<X)
+y 2r+1 ||K ||Lip*(t) llg IK ||t\t/2}

2r +
- vC (2 r
* "*>)
II fp - gIMPX(X) + \\t\\l,2\\g\\K .
Since fp e (L2X (X), HK)e/(2+e), by the definition of the interpolation space in terms of the Kfunctional, we have
W fp - glLPX(X) + W t l P l I g l k = K fp, W t | P
- C 0\ t w e / 2 ( 2 + e ) = C 0w t r ,
where C0 may be taken as the norm of f p in the interpolation space. It follows that
ifpiuP*(iL2(X)) = sup m\ s\f \Artfp(x)\2dx\
+J2+KUP^ C < ^.
Therefore, fp e Uip*(s,L^(X)). By (5.1) this implies fp e Cd(X) for any integer d < s - . But s
can be arbitrarily large, from which it follows that fp e C(X).


Measuring the regularity of the kernel

The approximation error depends not only on the regularity of the approximated function but
also on the regularity of the Mercer kernel. We next measure the regularity of a Mercer kernel
K on a finite set of points x = {x i , . . . , xm} c X. To this end, we introduce the following
1/2 '
K(x,x) - 2^2wK(x,xi) + ^2
weR i=1
i,j = 1

We show that by choosing wi appropriately, one has eK (x) ^ 0 as x becomes dense in X .It is
the order of decay of eK (x) with respect to the density of x in X that now measures the
regularity of functions in HK. The faster the decay, the more regular the functions.
As an example to see how tK (x) measures the regularity of functions in HK, suppose that
for some 0 < s < 1, the kernel K is Lip(s); that is,
\K (x, y) K (x, t)\<C(d (y, t))s, Wx, y, t e X,
where C is a constant independent of x, y, t. Define the number
dx : = max min d (x, Xi )
xX i < m
to measure the density of x in X. Let x e X. Choose xt e x such that d(x, xt) < dx. Set the
coefficients {wj}m= 1 as wt = 1, and Wj = 0 if j = t. Then
K(x,x)-2^2wiK(x,xi)+ ^2 wiK(xi,xj)Wj = K(x,x)-2K(x,xt)+K(xt,xt). i=1
ij = 1
The Lip(s) regularity and the symmetry of K yield
K(x,x) 2K(x, xt) + K(xt,xt) < 2C(d(x, xt))s < 2Cdsx.
tK(x) < 2CdX.
In particular, if X = [0,1] and x = {j/N}N=0, then dx <
, and therefore

tK(x) < 21sCNs. We obtain a polynomial decay with exponent s.

When K is Cs, the function eK (x) decays as O(d). For analytic kernels, eK (x) often
decays exponentially. In this section we derive these decaying rates for the function tK (x).
Recall the Lagrange interpolation polynomials {wl,s1(t)}S0 on [0,1] with interpolating
points {0,1/s 1, . . . , 1} defined by (5.3) with s replaced by s 1. For any polynomial p of
degree at most s 1,
J2wi,-i(t)p(l/(s - 1)) = p(t).
f (y) =E
(y - t) + Rs(f )(y, t),
This, together with the Taylor expansion off at t,
J2 wi, s -1 (t)Rs(f )(l/(s - 1),t) .
J2wi,s-1 (t)(f (l/(s - 1)) -f (t))
implies that
\Rs( f )(y, t)\ =

(s - 1 )
y - tis

(y - u)s-1f (s)(u)du
Il f
< -1 f (s)I
< s!1
Here Rs( f) is the linear operator representing the remainder of the Taylor expansion and
J2 wi,s-1 (t)( f (l/(s
1)) - f (t))
Il f(s)

Using Lemma 5.9, we now obtain

Recall also the multivariate Lagrange interpolation polynomials
Wa ,s-1 (X) = ]~[ Waj ,s-1 (Xj ),
x = (xi, . . . , xn) , a = ( a i, . . . , an) e{0 , . . . , s - 1 }n
defined by (5.6).
Now we can estimate eK(x) for Cs kernels as follows.
-K(x,y) e C( [ 0,1] 2n) .
da (
d| a |
- Ka x y) =
dy- ---dyr
Theorem 6.3 LetX = [0,1]n, s e N, and K be a Mercer kernel on X such that for each a e
Nn with | a | < s,
CK (x) <
22sn+1 ss+2n
C (X xX )
Then, for N > sand x = {a/N }ae{0,1,..., N -1 }n, we have

Proof. Let x e X. Then x = N + 1-11 for some fi e { 0, 1, . . . , N - s + 1}n and t e [0,1]n. Choose
the coefficients in the definition of eK (x) to be

wa =

Then we can see that the expression in the definition of eK is

fi + Y'
K (x, x) - 2 ^2
WY ,s-1 (t)K x,
Y e{0,...,s-1}n
WY ,s-1(t)K
Y ,ne{0,...,s- 1}n
fi + Y fi + n
wv,s-1 (t).
wy ,s-1 (t) if a = fi + Y, Y e{ 0, . . . , s - 1}n 0 otherwise.

Y ,s-1(t) K (xx) - K x

WY ,s-1(t)
Y e{0,...,s-1}
Y e{0,...,s-1}n

w,J.1(t) K(, i + l ) - K (,x

e{0 ... s- 1 }n wY,s-1(t) = 1. So the above expression equals
Yi,s-1(ti) g
y-~1 - g(ti)
Yi =0
(s - 1)2 s!
d sf
d tf
Using Equation (6.2) for the univariate function g(z) = f (Y1/(s - 1) , . . . , (Yi-1)f(s - 1) , z ,
t j + i , . . . , tn) with z e [0,1] and all the other variables fixed, we get
~\ w s-1(t) f
' Y1
- 1
Vs - 1 . s
1) 2



d sf


s-1 s-1 s-1
Yi-1 Yi
i+1, ... , tt
< ((s - 1)2s-1)n-1


6 Logarithmic decay of the approximation error

Using Lemma 5.9 for Yj, j = i, we conclude that for a function f on [0,1]n and for i =
1 , . . . , n , Y e{0,...,s-1}'
WY ,s-1(t) f
1)2s-1)n s!
C ( [0 ,1 ]n)



- f (t )
Replacing Y /(s - 1) by u each time for one i e { 1 , , n}, we obtain
(1 + ((s - 1)2s-1)n'j


-Un\ ((s - 1)2s-1)n S s \ s s!
d sK
d ys
C (X xX )

Applying this estimate to the functions f (t) = K (x, ^+(sN


^ and K , ^+(sN~V>t^j, we find that

the expression for cK can be bounded by

This bound is valid for each x e X. Therefore, we obtain the required estimate for eK (x) by
taking the supremum for x e X.

The behavior of the quantity eK (x) is better if the kernel is of convolution type, that is, if
K(x, y) = k (x y) for an analytic function k.
Theorem 6.4 Let X = [0,1] and K (x, y) = k (x y) be a Mercer kernel on X with
k(f) < Coek^', Vf e Rn
for some constants C0 > 0 and X >4 + 2n ln4. Then, for x = {N}ae{0,1,...,N1}n with N >
4n/ ln min{eX, 4neX/2}, we have
\ 1 4n 1\N/ 2
fK (x)
<4 C 0 max ex

Proof. Let XN : = {0, . . . , N - 1}n. For a fixed x e X, choose the coefficients Wi in (6.1)tobe wa,N
(x). Then the expression of the definition of eK (x) becomes QN (x) given by (5.13). It follows
that eK(x) < supxeX QN (x). Hence by (5.14),
CK(x) < Tk (N).

But the assumption of the kernel here verifies the condition in Theorem 5.15. Thus,
we can apply the estimate (5.21) for Tk (N) to draw our conclusion here. i=1

Kx ,f

6.3 Estimating the approximation error in RKHSs

Recall the nodal functions {Hi = Hi,x}1=1 associated with a finite subset x of X, given by
(5.25). We use them in the RKHS HK on a compact metric space (X, d) to construct an
interpolation scheme. This scheme is defined as follows:
ix( f)(x) =^2f (x i)Hi(x), x e X,f e C(X).
It satisfies Ix(f )(xi) = f (xi) for i = 1, . . . , m.
The error of the interpolation scheme for functions in HK can be estimated as follows.
Proposition 6.5 Let K be a Mercer kernel and x = {x1, . . . , xm} c X such that K[x] is
invertible. Define the interpolation scheme Ix by (6.3). Then, for f e HK,
l|Ix(f) - f lie(X) < eK(x)y f IIK
|Ix (f )IK <llf IIK .
Proof. For x e X
Ix ( f )(X) - f (X) =
f (Xi )Ui (X) - f (X) = Ui (x)(Kxi ,f )K - Kx,f )K
|Ix ( f )(x) - f ( X ) | <
Ilf IIK.
J^Ui (x)Kxi - Kx i=1
the second equality by the reproducing property of HK applied to f. By the Cauchy-Schwarz
inequality in HK,
Since (Ks, Kt )K = K (s, t), we have
'Y^Ui(x)KXi - Kx = K (x, x) - 2^2 Ui (x)K (x, Xi ) Ui (x)K (Xi, Xj )Uj (x).
i= 1
By Proposition 5.20, the quadratic function
Q(w) = K(x,x) - 2^2wiK(x,Xi) +^2 wiK(Xi,Xj)wj
is minimized over Rm at (ui(x))m_j. Therefore,
< K (x). K
J^ut (x)KXi - Kx 1=1
It follows that
|Ix(f )(x) f (x)\<CK(x)|| f ||K.
This proves the estimate for ||/x(f) f ||C(X).
To prove the other inequality, note that, since Ix (f) e HK and
Ix (f )(Xi) = f (Xi), for i = 1, . . . , m,
0 = Ix( f )(Xi) - f (Xi) = (Kxi, Ix( f ) - f )K.
This means that Ix(f ) f is orthogonal to span{KXi}"L 1. Hence Ix(f ) is the orthogonal
projection off onto span{KXi}"L 1 and therefore ||Ix(f )yK < ||f ||K.
Proposition 6.5 bounds the interpolation error ||Ix(f) f ||CX) in terms of the regularity of
K and the density of x in X measured by eK (x), when the approximated function f lies in the
RKHS (i.e., f e HK).
In the remainder of this section, we deal with the interpolation error when the
approximated function is from a larger function space (e.g., a Sobolev space), not necessarily
from HK. To this end, we need to know how large HK is compared with the space where the
approximated function lies. For a convolution-type kernel, that is, a kernel K (x, y) = k (x
y) with k ($) > 0, this depends on how slowly k decays. So, in addition to the function eK
measuring the smoothness of K, we use the function Ak : R + ^ R + defined by
Ak(r) :=( inf &) ' ,
[rn ,rn ]n
measuring the speed of decay of k. Note that Ak is a nondecreasing function. We also use the

function Tk (N), which measures the regularity of K, introduced in Section 5.3.

Lemma 6.6 Let k e L2 (Rn) be an even function with kQ) > 0, and K be the kernel on X =
[0,1]n given by K(x,y) = k(x y). Forf e L2 (Rn) and M < N e N, we define fM e L2 (Rn) by
M (M)
f (M) if M e [M n, M n ]n 0 otherwise.
Then, for x = {0, N ' , , N-1 }n, we have
||/x( fM )||K < \ \ f IIL2 Ak (N).
ii; II fM - Ix( fM )IC (X) < If L Ak (M m (N ).
iii; I f - fM IlL 2 (X) < (2n)-n
-Mn,Mn]n f ()1 d ^ 0 (asM ^rn). Proof.
(i) For i, j e XN : = {0, 1, . . . , N - 1} andXi = i/N e x, expression (5.25) for the nodal function Hi
associated with x gives
(Hi, Hj )K =
(K [x]-1 )is(K [x]-1)jt Kxs, Kxt )K
= E (K[x]-1)s(K[x]-1 )jt (K[x])ts = (K[x]-1 )j.
\ix (g)iiK =
E g(Xi)Hi (x)
Then, for g e C(X), we have
= E g(xi)g(xj) (Hi, Hj)K = (g|x)TK[x ] 1 (g|x),

i,j eX

where g|x is the vector (g(xi))ieXN e RN. It follows that

l|Ix ( g ) l l K < |K [x]-112 Ilg|x 112'2 (XN ) = |K [x]-112
lg(xt )|2.
We now apply this analysis to the function fM satisfying fM () = 0 for e [Nn, Nn]n \ [Mn,
Mn]n to obtain
| fM (xj) |2

= E (2n)-n J ()ei(j/Ny d
jeXN e[-M n M n ]n
< E (2n)~n f J (N )ej Nnd
jeXN e[-n n ]
fM (N)Nn\2 d <Nn\\ f 2 -

< (2n)-n n
e[-n ,n ]
I Ix ( fM )\K <|K [x ] - 1 \ 2 N n | f Il L


But, by Proposition 5.22, ||tf [x]- 1 ||2 <N-n(Ak(N))2. Therefore, l|Ix( fM)k <llf II2^k(N).
This proves the statement in (i).
fM (x)-Ix( fM )(x) = (2n)n
e[-M n M n ]
f () eix -J2 u(x)eaj d.
,ix ,
By the Cauchy-Schwarz inequality,

2 11/2
j- d
ii; Let x e X. Then
The first term on the right is bounded by || f ||L2 Ak(M), since k(f) > A-2(M) for f e [Mn,Mn]n. The second term is
\ 1/2
k (0) - 2^2 uj (x)k (x - xj) + '22 ui (x)k (xi - xj )uj (x) ,


which can be bounded by Tk (N) according to (5.14) with {0, 1, . . . , N}n replaced by {0, 1, . . . ,
N - 1}n. Therefore,

Il fM Ix ( fM )ll c (X ) < Il f L Ak (M )Tk (N ).

iii; By Plancherels formula (Theorem 2.3),
II f fM II%2(W) = (2n)n
()\2 d.

4\M n M n ]n
Thus, all the statements hold true.

+ If |L2 Ak (M )Tk (N)

Lemma 6.6 provides quantitative estimates for the interpolation error:
\Ix( fM )||K < \\f IIL2 Ak (N).
Choose N = N(M) > M such that Ak(M)Tk(N) ^ 0 as M ^ +ro. We then have || f - Ix(fM)\L2(X) ^
0. Also, the RKHS norm of Ix(fM) is asymptotically controlled by Ak (N).
We can now state the main estimates for the approximation error for balls in the RKHS HK
on X = [0,1]n. Denote by A-1 the inverse function of the nondecreasing function Ak,
A-1(R) := max{r > 0 : Ak(r) <R}, for R > 1/k(0).
Theorem 6.7 LetX = [0,1]n, s > 0, andf e Hs(Rn). Then,forR > || f ||2,
inf || f - glL2 (X) < inf , {Ak(M)||f IbTk(NR) + \\f ||*(nM)-s},
where NR = |_A-1(R/|| f ||2)J, the integer part of A-1(R/||f| f ||2). If
s > j, then
inf II f -g\\c(X)y <
Ak(M)||f ||2T*(NR) + _\ Mm)-s .
l l g IK < R

0 < M < NR
s - n/2
Proof. Take N to be NR.
Let M e (0,N]. Set the function fM as in Lemma 6.6. Then, by Lemma 6.6, l|Ix(fM)IIK < | | f ||
2Ak(N) <R
\ f - Ix (fM )\\L2(X ) < \\fM - Ix (fM )\C (X) + \\f - fM \\L2 (X)
< Ak(M)|| f ||2Tk(N) +\\f Is(nM)-s.
If s > |, then
ll f - fM 1C (X) < (2n)-n
V(m H
s - n/2
Hence the second statement of the theorem also holds.
Corollary 6.8 Let X = [0,1]n, s > 0, and f e Hs(Rn). If for some a1, a2, Ci, C2 > 0, one has
k(%) > C1(1 + %|)-a1, V% e R'and
Tk (N) < C2 N-2 ,

VN e N,

then, for R >(1 + 1/Ci)(1 + 1/(0))|| f ||2 ,

ii < R f - g^* > < |C3f 2 + C-2 "1 IIf 4 (fi;)" ,
where C3 := 22 C2 C-2 2 M (21 /Ci + (Vnn)ai Ci)1 /2 and

(a1 +2 s)

if a1



2s ai ,
if a1 + 2s < 2a2.
If, in addition, s > n/2, then


I / R \-Y'

3\l f ||2 +
. -------= ~
Vs - n / 2
v'= ( aaa+ fai + s - n > a
I 2a-n,
if ai + 2s - n < 2a2.
Proof. By the assumption on the lower bound of k, we find
Ak (r) <
(1 + Vn r)

It follows that for R/|| f ||2 - max{1/C1,1/k(0)},

A- ^ R / H f 2) - C^1 (R/||f ||2)2M and
NR - 2C2M (R/|| f 2)2/a1.
Tk (N) < C2 ([A-1(R/|| f H2)]) 2 < 2a2C2(C1R/\f Ik)-22/a1.
Then, by Theorem 6.7,
(1 + nnM ) \1/2
inf || f - g\\L2(X) < inf IlgllK < R
0 < M < NR
( C,R \ -2 a2/a1
22 C2 f
II f 112 +llf \\s(nM )-s
Take M = 2 C^ (R/ f \\2 ) with y as in the statement. Then, M < NR, and we can see that
gin R f - g2(X) < Nf 2 + C-!'/"1 1 f -I fy.
This proves the first statement of the corollary. The second statement can be proved in the
same way.

Corollary 6.9 Let X = [0,1]n, s > 0, and f e Hs(Rn). If for some a1, a2, Si, S2, Ci, C2 > 0, one
Tk (N) < C2 exp {-S2Na2}, VN e N, then, for R > (1 + A/C1)|| f 2,
f -gL2(X) < 2= f 2 + SsM2sns/2ll f lls lnR + ln
IlglK <R
where A, B are constants depending only on a1, a2, S1, S2, and
-Y s
k(f) > C1 exp {-S1H |a1},
Vf e Rn
if a
1 < a2
2 if 1 > 2.
If, in addition, s > 2, then, for R > (1 + A/C1)\\ f 2,
Y( 2 -s)
C2 B
C1 \ry2
inf f - g H C(X) < B' = f ||2 + f ||s ln R + ln
IlglK<R 1
C1IU2 +s
f 2
where B' is a constant depending only on a1, a2, S1, S2, n, and s.
Proof. By the assumption on the lower bound of k, we find
Ak (r) < -= exp j y (nn r)1 J .
It follows that for R/|| f ||2 > max{1/Ci, 1/k(0)},
-1/ n i i i ^ M

R/U f


> ln




R Ci
U f U2

and its integer part NR satisfies

1 (2
R-C1 \1M
nn 1 Uf U2

Also, for R > U f U2 exp{(2/51)1 (2nn)a1}/C[,

Tk(NR) < C2 exp { -2[A-1 (R/U f U2)]2} < C2 exp {-2R2}.
Then, by Theorem 6.7, for R > (1 + (1 + exp{(2/51)(2nn)a1}/C1))|| f ||2,
llgu^ <R
inf II f - guL2(X) < inf R C^f}2 exp 51 (nnM)a - 2R2
0 <M <R C
+ u f IU(nM)Take
r1/121/1-1 (


M = ^----------------------- - ln
inf u f - guL2(x) < C2^12 exp
11^ <R

V u2 f 2112
2 151 /1

llgk <R
ln R-CI V2,"1 + 5f,2V/2|| f |jj(ln R-C11


is given in our statement. For R > || f ||2A/CT, M < R < A- 1 (R/| f ||2)

holds.Therefore,ifR > exp J(522-25-2/a1 (nn)-a2)a1/(Ya1 -2) J ||f ||2/C1, we have

C3 + sf12 V/2 j f IU} (ln R + ln fi)
where C3 := supx>1 x-Ysexp{-8xa2/a1} and 8 = 52-2-15-2/ai (^nn)-a2. This proves the first
statement of the corollary. The second statement follows using the same argument.

Proof of Theorem 6.1


Now we can apply Corollary 6.9 to verify Theorem 6.1. We may assume that X c [0,1]n. Since X
has piecewise smooth boundary, every function f e Hs can be extended to a function F e
Hs(Rn) such that || f ||Hs(Rn) < CX || f 11Hs(X), where the constant CX depends only on X, not
on f e Hs(X).
i; From (5.17) and (5.18), we know that the condition of Corollary 6.9 is satisfied with
C1 = (a^)11, 81 = ^, ax = 2 and
C2 > 0, a2 = 1,82 = ln min{16n, 2n}.
Then the bounds given in Corollary 6.9 hold. Since a1 > a2, we have Y = 8 and the first
statement of Theorem 6.1 follows with bounds depending on CX .
ii; By (5.23), we see that the condition of Corollary 6.9 is valid. Moreover,
Ci > 0,
81 = c + e, a1 = 1
2 > 0, a2 = 1, 82 = lnmin ec,
Then a1 = a2 and the bounds of Corollary 6.9 hold with Y = 5. This yields the second
statement of Theorem 6.1.


References and additional remarks

Logarithmic decays of the approximation error can be characterized by general interpolation

spaces [16], as done for polynomial decays in Chapter 4. However, sharp bounds for the
decay of the K-functional are hard to obtain. For example, it is unknown how far the power
index | in Theorem 6.1 can be improved.
The function eK(x) defined by (6.1) is called the power function in the literature on radial
basis functions. It was introduced by Madych and
Nelson [82], and extensively used by Wu and Schaback [147], and it plays an important role
in error estimates for scattered data interpolation using radial basis functions. In that
literature (e.g., [147,66]), the interpolation scheme (6.3) is essential. What is different in
learning theory is the presence of an RKHS Hk, not necessarily a Sobolev space.
Theorem 6.1 was proved in [113], and the approach in Section 6.3 was presented in

On the bias-variance problem

Let K be a Mercer kernel, and HK its induced RKHS. Assume that
K e Cs(X x X), and
the regression function fp satisfies fp e Range(LK/(4+20)), for some 0 > 0.
Fix a sample size m and a confidence 1 8, with 0 < 8 < 1. To each R > 0 we associate a
hypothesis space H = HKR, and we can consider fa and, for z e Zm, fz. The bias-variance
problem consists of finding the value of R that minimizes a natural bound for the error E(fz)
(with confidence 1 8). This value of R determines a particular hypothesis space in the family
of such spaces parameterized by R, or, to use a terminology common in the learning
literature, it selects a model.
Theorem 7.1 Let K be a Mercer kernel on X c Rn satisfying conditions (i) and (ii)
(i) We exhibit, for each m e N and 8 e [0,1), a function
Em,8 = E: R+ ^ R

such that for allR > 0 and randomly chosen z e Zm,

(fz fp)2 dpx < E(R) x
with confidence 1 8.
ii; There is a unique minimizer R* ofE(R).
iii; When m ^-<x>, we have R*
and E(R*) ^ 0.
The proof of Theorem 7.1 relies on the main results of Chapters 3, 4, and 5. We show in
Section 7.3 that R* and E(R*) have the asymptotic expressions
R* = O (m1/((2+0)(1+2/s))) and E(R*) = O (m0/i(2+0)(1+2n/s))).
It follows from the proof of Theorem 7.1 that R* may be easily computed from m, 5, ||IKII, Mp,
|| fp ||OT, ||g||^2 , and 0. Here g e L^x (X) is such that
L0J<'4+20)(g) = fp. Note that this requires substantial information about p and, in particular,
about fp. The next chapter provides an alternative approach to the one considered thus far
whose corresponding bias-variance problem can be solved without information on p.


A useful lemma

The following lemma will be used here and in Chapter 8.

Lemma 7.2 Let c1, c2,
... , ci > 0 and s > q1 > q2 >
> qi-1
the equation
xs - cixqi - c2Xq2---------------------------ci-ixqi-1 - ci = 0
has a unique positive solution x*. In addition,
x* < max {(ici)1/(s-qi), (ic2)1/(s-q2), ... , (lci-i)1/(s-qi-i), (lci)1/s}.
Proof. We prove the first assertion by induction on i. If l = 1, then the equation is xs - c1 = 0,
which has a unique positive solution.
For l > 1 let y(x) = xs - cixqi - c2xq2---------ci-1xqi-1 - cl. Then, taking
the derivative with respect to x,
y'(x) = sxs-1 - q1c1xq1-1----------------------- cl-1ql-1xqi-1-1
= sxqi-1-1 (xf-qi-1 - q1c1 xq1-qi-1 -...- ci-1qi-1 ^
=: sxqi-1 -1f(x).
By induction, hypothesis y' has a unique positive zero that is the unique positive zero x of f.
Since f(0)< 0 and limx^+OT f(x) = +c^, we deduce that f(x)<0 for x e [0, x) and f(x)>0 for x
e (x, +ro). This implies that y'(x) < 0 for x e (0,x) and y'(x) > 0 for x e (x, +ro). Therefore, y
is strictly decreasing on [0, x) and strictly increasing on (x, +ro). But y(0) < 0 and, hence y(x)
< 0. Since y is strictly increasing on (x, +ro) and limx^+OT y(x) = +c, we conclude that y has
a unique zero x* on (x, +ro) which is its unique positive zero. The shape of y is as in Figure

7.1. This proves the first statement.

Figure 7.1

CiXqi < -x
s-qixqi _ xs.
To prove the second statement, letx > max {(a )1/(s qi') | i = 1, ... , }, where we set q = 0.
Then, for i = 1, ... , , ci < 1 xs-qi. It follows that
that is, y(x) > 0.
Remark 7.3 Note that given ci, c2, ... , c and s, qi, q2, ... , q-l, one can efficiently compute (a
good approximation of) x* using algorithms such as Newtons method.


Proof of Theorem 7.1

We first describe the natural bound we plan to minimize. Recall that E (fz) equals the sum

EH(fz) + E(fH) of the sample and approximation errors, or, equivalently,

We first want to bound the sample error. Le

tThen, for all f e HKR, | f (x) - y|< M almost everywhere since

|f (x) - y\ < \f (x)| + |y| < \ f (x)| + |y - fp(x)\ + \fp(x)\
< \\IK |R + Mp + \\fp |U.
The sample error e satisfies, with confidence 1 5, by Theorem 3.3,
N (HKR,i2M) e-"' > 5
exp C
12MR\ 2n/s
300M 2
and therefore, by Theorem 5.1 (i) with n = 1M (which applies due to assumption (i)),
12M 2 \2n/s
\\IK ||e
ln 2
(with C = C(Diam(X))n ||K11^^xX) and C depending on X and s but independent of R, e, and
M) or
where we have also used that R\\Ik|| < M. Write v = e/M2. Multiplying by v2n/s, the inequality
above takes the form
c0vd+1 - civd - c2 < 0,
where d = 2f, c0 = 355, c1 = ln (^, and c2 = C (12/\\IK||)d.
If we take the equality in (7.2), we obtain an equation that, by Lemma 7.2, has exactly
one positive solution for v. Let v*(m, 5) be this solution. Then e(R) = M2v*(m, 5) is the best
bound we can obtain from Theorem 3.3 for the sample error.
Now consider the approximation error. Owing to assumption (ii), Theorem 4.1 applies to
A(fp,R) < 22+e
R-6 =: A(R),


where g e (X) is such that Lexi'4+'W)(g) = fp and

fp,R) =
inf E(f) fp) = , nf D \l f- fp tL2 .
f eHK,R
IIf \K <R
We can therefore take E(R) = A(R) + e(R) and Part (i) is proved.We now proceed with
Part (ii). For a point R > 0 to be a minimum of A(R) + s(R), it is necessary that A'(R) + s' (R)
= 0. Taking derivatives and noting that by (7.1), M fR) = \\IK ||, we get
A'(R) = -22+0 ||g||2+0 OR-0-1
and s'(R) = 2M \\IK||v*(m, 5).


Therefore, writing Q = R, it is necessary that

(22+0 ||g ||2+00) Q0+2 - (2(MP + \\fp ||,)\\IK II v* (m, 8)) Q - 2|IK ||2v*(m, 8) = 0.


By Lemma 7.2, it follows that there is a unique positive solution Q* of (7.3) and, thus, a
unique positive solution R* of A'(R) + e'(R) = 0. This solution is the only minimum of E since
E(R) ^ when R ^ 0 and when R ^.
We finally prove Part (iii). Note that by Lemma 7.2, the solution of the equation induced by
\ d 1/(d +1)
v*(m, 8) < max
600ln(1/8) / 600C / 12 V\
Therefore, v*(m, 8) ^ 0 when m ^ TO. Also, since 1/R* is a root of (7.3), Lemma 7.2 applies
again to yield
( 4(Mp +IIfp ||,)I|/K II v* (m, 8) Y/(e+1)
WK II v* (m, 8) \ *+ |
\ 22+*||g||2+*0

from which it follows that R* ^TO when m ^TO. Note that this implies that lim A(R*) < lim
||g ||2+0R-0 = 0.
(22+e Ng ||2+*e) Q* v*(m, 8)



= 0,

2(Mp +IIfpN)NlKNQ* - 2N KII2

Finally, since Q* is a solution of equation (7.3),
and therefore v*(m, S)R* = v*(m, S)/Q2 ^ 0 when m ^ro, and, by (7.1), lim s(R*) = lim

M2v*(m, S)
= lim (\\IK ||R*+ Mp + \\fp\\ro)2v*{m, S) = 0.
This finishes the proof of the theorem.

A concrete example of bias-variance


Let R > 1 in the proof of Theorem 7.1. Then M < (||IK\\+MP + \\fp ||ro)R and we may take
e(R) = (\\IKII +Mp + f \\ro)2R2v*(m, S)
as an upper bound for the sample error with confidence 1 - S. Hence, under conditions (i) and
(ii), we may choose
E(R) = (\\IKII +Mp + \fp ||ro)2R2v*(m, S) + 22+e ||||2+eRe.
With this choice,
/ e \ ^/(2+e)
R* = (v*(m, S)) ^^MailK || +Mp + \\fp Wro)-2/(2+6) ^
tends to infinity as m does so and
e \2/(7,+9)

E(R )

* = <2 + 2

41| ||2(\\IK II + Mp + II fp ||ro)2e/(2+e) (v* (m, S))e/(2+e) =



as m ^ro.

K be the spline kernel on X = [1,1] given in Example 4.19. If pX is the Lebesgue

measure and || fp(x + t) fP(x)\\L2([i,it]] = O (te) for some e > 0, then A(fp,R) = O(Re).
Take s = 1 and n = 1. Then we have
E(R*) = O (me/(5(2+e) .
0 is sufficiently large, ||fz - fp ||2L2 = O m (1/5)+e for an arbitrarily
Example 7.4 Let


small e.


References and additional remarks

In this chapter we have considered a form of the bias-variance problem that optimizes the
parameter R, fixing all the others. One can consider other forms of the bias-variance problem
by optimizing other parameters. For instance, in Example 2.24, one can consider the degree
of smoothness of the kernel K. The smoother K is, the smaller HK is. Therefore, the sample
error decreases and the approximation error increases with a parameter reflecting this
We have already discussed the bias-variance problem in Section 1.5. Further ideas on this
problem can be found in Chapter 9 of 18 and in [95].
Bounds for the roots of real and complex polynomials such as those in Lemma 7.2
are a standand theme in algebra going back to Gauss. A reference for several such

Least squares

bounds is [91]. Theorem 7.1 was originally proved in [39].


We now abandon the setting of a compact hypothesis space adopted thus far and change the
perspective slightly. We will consider as a hypothesis space an RKHS HK but we will add a
penalization term in the error to avoid overfitting, as in the setting of compact hypothesis
In what follows, we consider as a hypothesis space H = HK - that is, H is a whole linear
space - and the regularized error EY defined by
Ey(f) = j (f (x) - y)2 d p + YII f IlK

for a fixed Y > 0. For a sample z, the regularized empirical error Ez,Y is defined by
1 m
EZ,Y (f) = m J2 (yi - f (xi ))2 + Y i f IIK .
One can consider a target function fY minimizing EY (f) over HK and an empirical target fz,Y
minimizing Ez,Y over HK. We prove in Section 8.2 the existence and uniqueness of these target
and empirical target functions. One advantage of this new approach, which becomes apparent
from the results in this section, is that the empirical target function can be given an explicit
form, readily computable, in terms of the sample z, the parameter Y , and the kernel K.
Our discussion of Sections 1.4 and 1.5 remains valid in this context and the following
questions concerning fz,Y require an answer: Given Y > 0, how large is the excess
generalization error E (fz,Y) - E (fp)? Which value of Y minimizes the excess
generalization error? The main result of this chapter provides some answer to these
Theorem 8.1 Assume that K satisfies logN (B\, n) < C0(1/n)s* for some s* > 0, and p
satisfies fp e Range(LK/2) for some 0 <9 < 1. Take Y* = m-z with Z < 1/(1 + s*). Then,
for every 0 < S < 1 and m > mg, with confidence 1 - S,
fz,y* (x) fp(x)f d PX < C'o \og(2/S)m~ez
Here C'0 is a constant depending only on s*, Z, CK, M, C0, and \\L~Ke/2fp ||, and mS
depends also on S. We may take
mg := max j(108/Co)1/s*(log(2/S))1+1/s*, (1/(2c))2/(Z-1/(1+s*),
where c = (2CK + 5)(108C0)1/(1+s*}.
At the end of this chapter, in Section 8.6, we show that the regularization approach just
introduced and the minimization in compact hypothesis spaces considered thus far are closely
The parameter y is said to be the regularization parameter. The whole approach
outlined above is called a regularization scheme.
Note that y* can be computed from knowledge of m and s* only. No information on fp is
required. The next example shows a simple situation where Theorem 8.1 applies and yields
bounds on the generalization error from a simple assumption on fp.
Example 8.2 Let K be the spline kernel on X = [1,1] given in Example 4.19. If pX is the
Lebesgue measure and \\fp(x +1) fP(x)\\L2([i,1t]) = O(te) for some0 < e < 1, then, by
the conclusion of Example 4.19 and Theorem 4.1, fp e Range^L^e}/2^ for any e > 0.
Theorem 5.8 also tells us that log N(B1, n) <
(fz,y* ) E{fp) = \\fz,y*
p II L2



log Sm
C0 (1/n)2. So, we may take s* = 2. Choose y* = mZ with Z =
with confidence 1 S.


< 3. Then Theorem 8.1

Bounds for the regularized error


Let X, K, fY, and fz,Y = fz be as above. Assume, for the time being, that fY and fz,y exist.
Theorem 8.3 LetfY e HK andfz be as above. Then E(fz) - E(fp) < E(fz) - E (fp) + Y IfzIlK
which can be bounded by
{E(fY) - E(fp) +

f IlK} + {E(fz) - Ez(fz) + Ez(fY) - EVY)}. (8.1)

Proof. Write E(fz) - E(fp) + Y IfzIlK as

{E (fz) - Ez (fz)} + {(Ez (fz) + Y fz IlK) - (Ez(fY) + Y Y IlK)}
+ [EZ(VY) - EVY)] + {E(fY) - E(fp) + YY IlK}.

The definition offz implies that the second term is at most zero. Hence E(fz) - E (fp) + YI fz IlK
is bounded by (8.1).

The first term in (8.1) is the regularized error of fY. We denote it by D(Y); that is,
D(Y) := E(Y) - E(fp) + Y IIVY IlK = inf (E(f) - E(fp) + Y IIf IlK}.
f eHK
Note that by Proposition 1.8,
Y) = Wfy - fp lip + Y IIfY III > Wfy - fp lip.
We call the second term in (8.1) the sample error (this use of the expression differs
slightly from the one in Section 1.4).
In this section we give bounds for the regularized error. The bounds (Proposition 8.5
below) easily follow from the next general result.
Theorem 8.4 Let H be a Hilbert space and A a self-adjoint, positive compact operator
on H. Let s > 0 and Y > 0. Then
(i) For all a e H, the minimizer b of the optimization problem minbsH (||b - a|2 + Y l|A-sb||2)
exists and is given by
b = (A2s + Y Id)-1A2sa.
(ii) For 0 < 0 < 2s,

l\b - a| < Y0,(2s)IIA-0aW where we define ||A- a|| = ifa e Range(A ).


iii; For 0 < 0 < s,

min ||b - ay2 + Y \\A-sb\\2 < y0/s\\A-0a\\2
iv; For s < 0 < 3s,
\\A-s(b - a)|| < //(2s)-1/2||A-0a\\.
Proof. First note that replacing A by As and 0/(2s) by 0 we can reduce the problem to the case s
= 1 where A-0a = [(As)2] 6/(25) a.
(i) Consider
p(b) =\\b - a||2 + Y ||A-1b||2.
If a point b minimizes p, then it must be a zero of the derivative Dp whose value at b e
Range(A) satisfies <p(b + ef) - p(b) = (Dp(b), ef} + o(e) forf e Range(A).But p(b + ef) p(b) = 20b - a, ef )+2Y (A-2b, ef} + e2||f ||2 + e2Y\\A-1f ||2. So b satisfies (Id + YA-2)b = a,
which implies b = (Id+YA-2)-1a = (A2+Y Id)-1A2a. Note that theoperator Id+YA-2 is invertible since
it is the sum of the identity and a positive (but maybe unbounded) operator.
k >1
k pk,
We use the method from Chapter 4 to prove the remaining statements. If A1 > A2 > ...
denote the eigenvalues of A2 corresponding to normalized eigenvectors {pk}, then


= J2

k Pk.
where a =
- a|

>1 akPk. It follows that



k+Y A
Assume ||A 0a\\ = {'^ka'2/Adk}1/2 <


(ii) For 0 < 0 < 2, we have





- a|

|2 + Y A-1?2 = ^(h + y =

k+Y y

_k + y)2
_k + Y
(iii) For 0 <9 < 1, A lb = J2 k>i {(V_k )/(_k + _)) ak fk .Hence
which is bounded by
_k + _y \_k + _y _

(_-)' < Y9A-9all2
A 3(b - a) = Y,
1 _k + S/_
k fk.
iv; When 1 <9 < 3, we find that
\\A-1(b - a)4 = J2
1 _k + _ _k
= Y9-T
U-1 1 < Y 9-1 |A-9 a,2.
Ak + _
_k + _
It follows that
Thus all the estimates have been verified.

Bounds for the regularized error D(_) follow from Theorem 8.4.
p IIK < Y
(9 -1)/2

Proposition 8.5 Let X c Rn be a compact domain and K a Mercer kernel such that for some

0 <9 < 2 fp e Range(LK/2). ThenProof. Apply Theorem 8.4 with H = L2, s = 1, A = L|/2, and a
= fp, and use that ||L-1/2f || = ||A-1f || = ||f ||K .We know that fY is the minimizer of
min (|| f - fp ||2 + Y || f Hi ) = min (|| f - fp ||2 + y || f ||| ) f et2 K
feHx K
since || f ||K = for f ^ HK. Our conclusion follows from Theorem 8.4 and Proposition 1.8.


On the existence of target functions

Let X, K,fY, and fz,Y = fz be as above. Since the hypothesis space HK is not compact, the
existence offY andfz,Y is not obvious. The goal of this section is to prove that both fY and fz,Y
exist and are unique. In addition, we show that fz,Y is easily computable from Y , the sample z,
and the kernel K on the compact metric space X .
Proposition 8.6 Let v = pX in the definition of the integral operator LK. For all Y > 0 the
function fY = (LK + Y M)-1LK/P is the unique minimizer of EY over HK.
Proof. Apply Theorem 8.4 with H = L^iX), s = 1, A = L^2, and a = fp. Since, for all f e HK, |f ||K

= ||L-1/2f \\L2(X), we have

||b - a||2 + Y l|A-sb||2 = ||b - fp LPxiX) + Y IbllK = Yib) - <
Thus, the minimizer fY is b in Theorem 8.4, and the proposition follows.
Proposition 8.7 Let z e Zm and Y > 0. The empirical target function can be expressed as
fzix) = ^ aK (x, Xi),

3 f_ - fp lip = E (f_) - E (fp) < Y

4For 0 <9 < 1,

\\L~K6/2fp ||2.

D(_) = E (f_) - E (fp) + Y f_ K < Y9 L^^fp P.

For 1 <9 < 3,

where a = (a1,..., am) is the unique solution of the well-posed linear system in Rm
iY m Id + K [x])a = y.
Here, we recall that K[x] is the m x m matrix whose (i,j) entry is K(Xi,Xj), x = (x1,...,xm)
e Xm, and y = iy1,...,ym) e Ym such that z = i(X1, y1),..., (Xm, ym)).
Proof. Let Hif) = Ynhiyi - f (Xi))2 + Y Ilf lK. Take v to be a Borel, nondegenerate measure on
X and LK to be the corresponding integral operator.
Let {pk }k >1 be an orthonormal basis of L2 (X) consisting of eigenfunctions of LK, and let
{Xk}k>1 be their corresponding eigenvalues. By Theorem 4.12, we can then write, for any f e
HK, f = EXk>o ckPk with ||f fK = EXk>o C2fXk.
ForeverykwithXk > 0, dHfdCk = m YT^Mi -f X))Pk(Xi)+2y(CkfXk). If is a minimum of
H, then, for each k with Xk > 0, we must have d H fd ck = 0 or, solving for ck,
k = Xk ^ \aipk(xiX
where at = (yi - f (xi))fym. Thus,

^ 2,


f (X) =

ck Pk (X) =
Xk^ai pk (Xi)pk (x)
Xk >0
Xk >0
= m a^2 XkPk (Xi )Pk (X) = ^2 aK(Xi, x),
i= 1 Xk >0
where we have applied Theorem 4.10 in the last equality. Replacing f (Xi) in the definition of
ai above we obtain
ai =
yi - Em=1 ajK (Xj, Xi) Y m
Multiplying both sides by Y m and writing the result in matrix form we obtain (Y m Id + K
[x])a = y, and this system is well posed since K [x] is positive semidefinite and the result of
adding a positive semidefinite matrix and the identity is positive definite.


A first estimate for the excess generalization error

In this section we bound the confidence for the sample error to be small enough. The main
result is Theorem 8.10.
In what follows we assume that

M = inf {M > 0 | {( , y) e Z ||y|> Ml} has measure zero}

is finite. Note that
|y|<M and \fp(x)\< M
almost surely.
For R > 0 let BR= {f
HK :
||f ||K <
Recall that
for each f e
II fII x < CK II f ||K, where CK = supxe^/K(x,x).
For the sample error estimates, we require the confidenceN (B1, n) exp { m^} to be at
most 5. So we define the following quantity to realize this confidence.
Definition 8.8 Let g = gK,m : R+ ^ R be the function given by
g(n) = logN(Bi, n) - -54.
The function g is strictly decreasing in (0, + x) with g(0) = + x and g(+x) = -ex Also, g(1)
= -|4. Moreover, limn^e+ N(B1, n) = N(B1, e) for all e > 0. Therefore, for 0 < 5 < 1, the
g(n) < log 5


has a unique minimal solution v*(m, 5). Moreover,

lim v*(m, 5) = 0.
More quantitatively, when K is Cs on X c Rn, logN (B1, n) < C0(1/n)s* with s* = 2- (cf.
Theorem 5.1(i)). In this case the following decay holds.
Lemma 8.9 If the Mercer kernel K satisfies logN (B1, n) < C0(1/n)s* for some s* > 0,

_ f 108log( 1/5) (108C0y/(1+l
v (m, 5) < max
Proof. Observe that g(n) < h(n) := C0(1 /n) - mn. Since h is also strictly decreasing and
continuous on (0, +x), we can take A to be the unique positive solution of the equation h(t) =
log 5.Weknowthat v*(m, 5) < A.Theequation h(t) = log 5 can be expressed as
t1+s* 54log(1/5) s* 54C0 = 0. mm
Then Lemma 7.2 with d = 2 yields A < max{108log(1/5)/m, (108C0/m)1/(1+s J}. This verifies the
bound for v*(m, 5).

2(CK + 3)2M2v* (m,

E(fx) (fp) < K
+ 6M + 4
Theorem 8.10 For all Y e (0,1] and 0 < 8 < 1, with confidence 1 8,
x D(Y) +
(48M2 + 6M) log(2/8)
Theorem 8.10 will follow from some lemmas and propositions given in the
remainder of this section. Before proceeding with these results, however, we note
that from Theorem 8.10, a convergence property for the regularized scheme
Corollary 8.11 Let 0 < 8 < 1 be arbitrary. Take Y = Y(m) to satisfy Y(m) ^ 0, limm^TO
mY(m) > 1, and Y(m)/(v*(m, 8/2)) ^ +c. If D(Y) ^ 0, then, for any e > 0, there is
some M8,e e N such that with confidence 1 8,
(fz) (fp) < e, Vm > Ms,e

As an example, for Cs kernels on X c Rn, the decay of v*(m, 8) shown in Lemma 8.9 with
s* = 2- yields the following convergence rate.
(fz(x) fp(x)f dpx < Ci X
lg(2) +
mY m1/(1+s*)Y
+ Y9 +
Corollary 8.12 Assume that K satisfies logN (B\, n) < C0(1/n)s* for some s* > 0, and p
satisfies fp e Range(LK/2) for some 0 <9 < 1. Then, for all Y e (0,1] and all 0 < 8 < 1,
with confidence 1 8,

holds, where C1 is a constant depending only on s, CK, M, C0, and \\LK9/2fp ||. IfY = m1/
((1+9)(1+s*)), then the convergence rate is Jx(fz (x) fp(x))2 dpx < 6C1 log(2/8)m9/((1+9)
(1 s ))
+* .
Proof. The proof is an easy consequence of Theorem 8.10, Lemma 8.9, and Proposition 8.5.

For C kernels, s* can be arbitrarily small. Then the decay rate exhibited in Corollary 8.12 is
m(1/2)+e for any e > 0, achieved with 9 = 1. We improveTheorem 8.10 in the next section,
where more satisfactory bounds (with decay rate m-1+s) are presented. The basic ideas of the
proof are included in this section.
To move toward the proof of Theorem 8.10, we write the sample error as
1 m

E (fz )
- Ez(fz ) +
Ez (fy)- E (fy) =
E(fi) - fl(zt )
g := (fz(x)
- y)2 - (fp(x)
y)2 and & :=
(fy (x) - yf
(fp(x) - y) .
The second term on the right-hand side of (8.4) is about the random variable f2on Z. Since
its mean E(f2) = E (fY) E (fp) is nonnegative, we may apply the Bernstein inequality to

estimate this term. To do so, however, we need bounds

for || fv |U.
Lemma 8.13 For all y > 0,
IIfr Ik < VD(y)/y and ||fY || < CKy/V(y)/y.
Proof. Since fY is a minimizer of (8.2), we know that
y IIfY Ilk < E(fY) - (fp) + Y y fY Ilk = D(y).
Thus, the first inequality holds. The second follows from I fY || < Ck I fY ||k.

J2&(Zt) - E(f2) <
4Ck log (1/8
+ 3M +
+ 3M + 1 D(y)
Proposition 8.14 For every 0 < 8 < 1, with confidence at least 1 - 8,

(24M2 + 3M log^ 1/8

Proof. From the definition of f2, it follows that
(x) f
2 = fy
fp(x)){(fy(x) - y) + (fp(x) - y)}.
Almost everywhere, since | fp (x) | < M, we have
ifci < (IIfY IU + M)(\ fy IU + 3M) < c := (|| fY ||< + 3M)2.
Hence |f2(z) - E(f2)| < B := 2c. Moreover, we have
E(^22) = E((fy(x) -fp(x))2{(fy(x) -y) + (fp(x) -y)}2)
< \\fy - fp l|2(ll fy IU + 3M)2,

) - Efe) < t
which implies that a2(&) < E(f|) < cV(y). Now we apply the one-side Bernstein inequality in
Corollary 3.6 to f2. It asserts that for any t > 0,
1 - exp > 1 - exp 2dp(y) + t)

2( a 2fe) + 1 Bt)

with confidence at least

Choose t* to be the unique positive solution of the quadratic equation
= log 5.

2c(V(y) + 21) Then, with confidence 1 - 5,


< t*
) - Efe)
i=1 holds. But
t* = ^y log(1/5) + ^^y log(1/5^ + 2cmlog(1/5)V(y) j jm
< 4cy + ^2clog(1/5)V(y)/m.
By Lemma 8.13, c < 2C|V(y)/y + 18M2. It follows that

t* <
8C| log(l/g) 3my
3M log(l/5)

log(1/5)(V(y)/m^jand, therefore, that

D(y) +
72M2 log(l/5)
+ 2CK
+ 3MD(y).
This implies the desired estimate.
The first term on the right-hand side of (8.4) is more difficult to deal with because f1
involves the sample z throughfz. We use a result from Chapter 3, Lemma 3.19, to bound this
term by means of a covering number. For R > 0, define FR to be the set of functions from Z to
FR := {(f (x) - y)2 - (fp (x) - y)2 :f e BRJ .


Proposition 8.15 For all e > 0 and R > M,

E (f ) - (fp) - (z(f ) - Ez(fp)

f GBR VE (f ) (fp) +

(CK + 3)2R2
54(CK + 3)2R2 '
Proof. Consider the set FR. Each function g e FR has the form g(z) = (f (x) - yf - (.fp(x) - y)2
with f e BR. Hence E(g) = E(f) - E(fp) = II f - fp II 2 > 0, Ez (g) = Ez (f) - Ez(fp), and
g(z) = (f (x) - fp(x)){ (f (x) - y) + (fp(x) - y)}.
Since |f ||OT < CK|f ||K < CKR and |fp(x)\ < M almost everywhere, we find

( )( )

|g(z) | < (CK

+ M)(CK
+ 3M) < c := (CKR + 3M)2.
So we have |g(z) - E(g)| < B := 2c almost everywhere.
In addition,
E(g2) = E
(f (x) - fp(x))2{(f (x) - y) + (fp(x) - y) }2
< (CKR + 3M)2 y f - fp
Thus, E(g2) < cE(g) for each g e FR
.Applying Lemma 3.19 with a = 1 to the function set FR, we deduce that
E(f) - E(fp) - (Ezf) - Ez(fp))
E(g) - m TT=1 g(zi) .
E(f) - E(f )
= sup
f BR
p +
y/E(g) + e
with confidence at least
54 (CK + 3)2R2
1 -N (FR, el4) exp ms/16 > 1 -N (FR, e/4) exp
Here we have used the expressions for c, B = 2c, and the restriction R > M.
What is left is to bound the covering number N (FR, e/4). To do so, we note that
- y)2 - (/2 (x) y)2\ <
Wfi - f2\\\(fi(x) - y) + (f2(x) - y)
But |y| < M almost surely, and W f || < CK W f ||K < CKR for each f e BR. Therefore, almost
\(fl(x) - y)2 - (f2(x) - y)2\ < 2(M + CKR)W/I - f2\\, VAf e BR.
N (FR, n) < N
2 (MR + CKR2)

Vn > 0.

Since an n/(2(MR + CKR2))-covering of Bi yields an n/(2(M + CKR))- covering of BR, and vice
versa, we see that for any n > 0, an n/(2(MR + CKR2))-covering of B1 provides an n-covering of
FR. That is,
But R > M and 2(1 + CK) < (CK + 3)2. So our desired estimate follows. Now we can derive
the error bounds. For R > 0, denote
W(R) := {z e Zm : ||fz|K < R}.
Proposition 8.16 For all 0 < 8 < 1 and R > M, there is a set VR c Zm with p(VR) < 8 such
that for all z e W(R) \ VR, the regularized error EY(fz) = E(fz) - E(fp) + Y\\fzl\K
2(CK + 3)2R2v*(m, 8/2) + 8CK lcg2/8) + 6M + 4 D(Y)

(48M2 + 6M) log (2/8)Proof. Note that ^E(f) - E(fp) + e^fe < 1 (E(f) - E(fp)) + e. Using the
quantity v*(m, S), Proposition 8.15 with e = (CK + 3)2R2v*(m, ) tells us that there is a set
VR c Zm of measure at most | such that
E(f) - E(fp) - (Ez(f) - Ez(fp)) < 2(E(f) - E(fp)) + (CK + 3)2
R2v*(m, S/2),
Vf e BR, z e Zm \ VR.
In particular, when z e W(R) \ VR, fz e BR and
E(x) - m^ ^(Zi ) = E (fz) - E (fp) - (Ez(fz ) - Ez (fp))
< 1(E(fz) - E(fp)) + (CK + 3)W(m, 5/2).
Now apply Proposition 8.14 with 5 replaced by |. We can find another set VR c Zm of
measure at most 2 such that for all z e Zm \ VR,
( 4CK log (2/5)
&(Zi) - E(fc) < K
+ 3M + 1 )V(y)

(24M2 + 3M) log (2/5)

Combining these two bounds with (8.4), we see that for all z e W (R)\(VRUVR ),
E(fz) - Ez(fz) + Ez(fv) - E(fY) < 2 (E(fz) - E(fp)) + (CK + 3)2R2v*(m, 5/2)
!og(2/3) + 3M + AD(Y) mY

(24M2 + 3M) log (2/5)

This inequality, together with Theorem 8.3, tells us that for all z e W(R) \
(VR u VR\
E (fz ) - E (fp) + Y || fz IlK < D(Y) + 2 (E (fz) - E (fp)) + (CK + 3)2
R2v*(m, 5/2) + 4CK lo42/^ + 3M + 1 D(Y)

(24M2 + 3M) log (2/5)

This gives the desired bound with VR = VR U VR.
To prove Theorem 8.10, we still need an R satisfying W(R) = Zm. Lemma 8.17 For all y > 0
and almost all z e Zm,

IIfz IIK < .

Proof. Sincefz minimizes Ez,Y, we have

Y IIfz III < z,y (fz) < z,y (0) = m ^ - 0)2 < M2,
the last almost surely. Therefore, ||fz||K < M/^/y for almost all z e Zm.

Lemma 8.17 says that W(M-U/V) = Zm up to a set of measure zero (we ignore this null set
later). Take R := M/^/y > M. Theorem 8.10 follows from Proposition 8.16.


Proof of Theorem 8.1

In this section we improve the excess generalization error estimate of Theorem 8.10. The
method in the previous section was rough because we used the bound ||fz||K < M/^/y shown in
Lemma 8.17. This is much worse than the bound for fY given in Lemma 8.13, namely, ||fY ||K <
VD(y)U/Y. Yet we expect the minimizer fz of Ez,Y to be a good approximation of the minimizer
fY of EY . In particular, we expect ||fz||K also to be bounded by, essentially, S/D(Y)U/Y . We
prove that this is the case with high probability by applying Proposition 8.16 iteratively. As a
consequence, we obtain better bounds for the excess generalization error.
Lemma 8.18 For all 0 < 8 < 1 and R > M, there is a set VR c Zm with P(VR) < 8 such that
W(R) W(amR + bm) U VR,
where am := (2CK + 5^/v*(m, 8/2)/y and

V(y) + (7M + 1) 2 log (2/5 )Proof. By Proposition 8.16, there is a set VR c Zm with p(VR) < 5
such that for all z e W(R) \ VR,
Y y fz IlK < 2(CK + 3)W (m, S/2) + 8CK loe(2/S) + 6M + 4 D(y)
(48M + 6M) log(2/5) m
This implies that

II fz IIK < amR + bm, Vz e W(R) \ VR,

with am and bm as given in our statement. In other words, W(R) \ VR c W (amR + bm).

Lemma 8.19 Assume that K satisfies logN (B\, n) < C0(1/n)s for some s* > 0. Take Y =
m-Z with Z < 1/(1 + s*). For all 0 < 8 < 1 and m > ms, with confidence 1 - 38/( 1/(1 +
s*) - Z
IIfzIlK < C^/log(2/5)^D(Y)/Y + 1)
holds. Here C2 > 0 is a constant that depends only on s*, Z, CK, C0, and M, and ms e N
depends also on 8.
Proof. By Lemma 8.9, when m > (108/C0)1/s (log(2/8))1+1/s ,
v*(m, 8/2) < (108C0/m)1/(1+s*>


holds. It follows that

am < (2CK + 5)(108Co)1/(2+2s*>mZ/2-1/(2+2s*>.
Denote c := (2CK + 5)(108C0)


+ * . When m > (1/(2c))

2s )


+s*, we have am < 5.

Choosems := ma^ (108/C0)1/s* (log(2/8))X+1/s*, (1/(2c))2/(Z-1/(1+s*J. Define a sequence (R^/N by

R(0) = M/ jy and, for j > 1,
R(j) = amRU-1) + bm.
Then Lemma 8.17 proves W(R(0)) = Zm, and Lemma 8.18 asserts that for each j > 1, W(R(/-r))
c W(R(/)) U VR(j-1) with p(VR(j-1)) < 8. Apply this inclusion for j = 1,2,...,J, with J satisfying 2/
( 1/(1 + s*) - f) < J < 3/(1 /(1 + s*) - f ). We see that
Zm = W(R(0)) W(R(1)) U VR(0) W(R(J)) U U VR(j) .
R(J ] = aJmR

+ bm ^ ^ am < Me
JmJ (f/2-1/(2+2s*))+f/2
+ bm < Me + bn
It follows that the measure of the set W(R(J)) is at least 1 - J8 > 1 - 3 8/( 1/(1 + s*) - Z) .By the
definition of the sequence, we have
Here we have used (8.8) and am < 2 in the first inequality, and then the restriction J > 2/(1 /(1 +
s*) - Z) > Z/(1/(1 + s*) - Z) in the second
inequality. Note that cJ < ((2CK + 5)(108Co + 1))3/^1/(1+s ) ^. Since Y = m-Z, bm can be bounded
bm <J2 log(2/5
2CK + V6M + 4)7V(y)/Y + 7M + 1 .
Thus, R(J) < C2^log(2/5)(VD(y)/y + 1) with C2 depending only on s*, f, CK, C0, and M.

Proof of Theorem 8.1 Applying Proposition 8.16 with R := C2^log(2/5)

(VD(Y)/Y + 1), and using that Y*
= m-f
and D(y*)
< Y*
\\L~K6/2fp ||2, we
deduce from (8.7) that for m > m5 and all z e W(R) \ VR,
E(fz) - E(fp) < 2(CK + 3)2C|log(2/5)2mf(1-e)(108C0)i/(i+s*)m-1/(1+s*) + C2 log(2/5)m-ef < C3
holds. Here C'2, C3 are positive constants depending only on s*, f, CK, C0,M, and WL-^^fp ||.
By Lemma 8.19, the set W (R) has measure at least 1 - 35/( 1/(1 + s*) - f) when m > m5.
Replacing 35/( 1/(1 + s*) - f) + 5 (the last 5 from the
bound on VR) by 8 and letting C0 be the resulting C3, our conclusion follows.

When s* ^ 0 and 0 ^ 1, we see that the convergence rate 0Z can be arbitrarily close to 1.


Reminders V

We use the following result on Lagrange multipliers.

Proposition 8.20 Let U be a Hilbert space and F, H be real-valued C1 functions on U.
Let c e U be a solution of the problem
min F (f) s.t. H(f) < 0.

Then, there exist real numbers p, X, not both zero, such that
pDF (c) + XDH (c) = 0.


Here D means derivative. Furthermore, if H (c) < 0, then X = 0. Finally, if

either H (c) < 0 orDH (c) = 0, then p = 0 and p > 0.

If p = 0 above, we can take p = 1 and we call the resulting X the Lagrange multiplier of
the problem at c.


Compactness and regularization

Let z = (z1,..., zm) with zi = X, y0 e X x Y for i = 1,..., m. We also write x = (x1,...,xm) and y =
(y1,...,ym). Assume that y = 0 and K[x] is invertible. Let a* = K[x]-1y, f * = YT=1 a*KXi, and
= \\f * \\2K = yK[x]-1y.
Let EZ(Y) and Ez(R) be the problems
(f (xi) - yi)2 + YII f \\K
i=1 s.t. f e HK

min m

(f (xt) - yi)2

s.t. f e B(HK, R), respectively, where R, Y > 0.

In Proposition 8.7 and Corollary 1.14 we have seen that the minimizers fz,Y andfz,R of Ez (Y)
and Ez (R), respectively, exist. A natural question is, What is the relationship between the
problems Ez(Y) and Ez(R) and their minimizers? The main result in this section answers this
Theorem 8.21 There exists a decreasing global homeomorphism Az : (0, +w) ^ (0, Ro)
for all Y > 0,fz,y is the minimizer of Ez(Az(y)), and
ii; for all R e (0, R0),fzR is the minimizer of Ez (A-1(R)).
To prove Theorem 8.21, we use Proposition 8.20 for the problems Ez (R) with U = HK, F(f)
= m YT=i(f (Xi) - yi)2, and H(f) = ||f \\2K - R2. Note that for x e X, the mapping
HK ^ R
f ^ f (x) = f, Kx )K
is a bounded linear functional and therefore C1 with derivative Kx. It follows that F is C1 as well
and DF (f) = m Sm=1Cf (xi) - yi )Kxi .Also, H is C1 and DH (f) = 2f. Define
Az : (0, +ro) ^ (0, Ro)
^ \\fz,y fK.
Also, for each R e (0,R0), choose one minimizer fz,R of Ez(R) and let
Tz : (0, R0) ^ (0, +<x>)
R ^ the Lagrange multiplier of Ez(R) atfz,R.Lemma 8.22 The function Tz is well
Proof. We apply Proposition 8.20 to the problem Ez(R) and claim thatfz,R is not the zero
function. Otherwise, H(JZR) = H(0) < 0, which implies DF(fzR) = DF(0) = -m T!m=i yiKxi = 0
by (8.9), contradicting the invertibility of K [x].
Sincefz,R is not the zero function, DH(fz,R) = 2fz,R = 0. Also, DF(fz,R) = m Y!i=1(fz,R(Xi) yi)KXi = 0, since K[x] is invertible. By Proposition 8.20, p = 0 and X = 0. Taking p = 1, we
conclude that the Lagrange multiplier X is positive.

Proposition 8.23
For all Y > 0,fz,Y is the minimizer of Ez(Az(y)).
ii; Let R e (0, R0). Thenfz,R is the minimizer of Ez (Tz (R)).
Proof. Assume that

(f (xi) - yi)2 < (fz,Y(xt) - yi)2

mm i=1
for some f e B(HK, AZ(Y)). Then

m J2(f (xi) - yi)2 + Y if IIK < -J2(fz,Y(xt) - yd2 + YAZ(Y)2

=(fz,Y (xi) - yt)2 + Y II fz,Y iiK,
i= 1
contradicting the requirement that fz,Y minimizes the objective function of EZ(Y). This proves
For Part (ii) note that the proof of Lemma 8.22 yields p = 1 and X = Tz (R) > 0. Sincefz,R
minimizes Ez(R), by Proposition 8.20, fzR satisfies
D(F + XH )(fz,R) = D(F + X|| f IiK )(fz,R) = 0;
that is, the derivative of the objective function of Ez(X) vanishes atfz,R. Since this function is
convex and Ez (X) is an unconstrained problem, we conclude that fz,R is the minimizer of Ez(X)
= Ez(Tz(R)).

Proposition 8.24 Az is a decreasing global homeomorphism with inverse Tz.

Proof. Since K is a Mercer kernel, the matrix K[x] is positive definite by the invertibility
assumption. So there exist an orthogonal matrix P and a
diagonal matrix D such that K[x] = PDP-1. Moreover, the main diagonal entries d1,..., dm of D
are positive. Let y' = P-1y. Then, by Proposition 8.7, fz,y = J2m=1 aiKXi with a satisfying (y mid
+ D)P-1a = P-1y = y'. It follows
P-1a =
Y m + di
Az(y) =\\fz,y ||K = aTK[x]a = (P 1a)TDP 1a =
Y m + di
and, using PT = P-1,
AZ (Y)

\l fz,y IIK = (Ym + di)3
which is positive for all Y e [0, +ro) since y = 0 by assumption. Differentiating with respect to Y
This expression is negative for all Y e [0, +ro). This shows that Az is strictly decreasing in its
domain. The first statement now follows since Az is continuous, Az(0) = R0, and AZ(Y) ^ 0 when
Y ^^.
To prove the second statement, consider Y > 0. Then, by Proposition 8.23(i) and (ii),
fz,Y = fz,Az (Y) = fz,Tz(Az(Y)).
To prove that Y = TZ(AZ(Y)), it is thus enough to prove that for Y, Y' e (0, +TO), iffz,Y = fz,Y/, then
Y = Y'. To do so, let i be such that y- = 0 (such an i exists since y = 0). Since the coefficient
vectors forfz,Y andfz,Y/ are the same a with P-1a = (y-/(Ym + di^^x, we have in particular
Y m + di
Y 'm + di
whence it follows that Y = Y'.

Corollary 8.25 For all R < R0, the minimizer fz,R ofEz (R) is unique.
Proof. Let Y = rz(R). Thenfz,R = fz,Y by Proposition 8.23(ii). Now use that fz,Y is unique.
Theorem 8.21 now follows from Propositions 8.23 and 8.24 and Corollary 8.25.
Remark 8.26 Let E(y) and E(R) be the problems
min J (f (x) - y)2 dp + Y II f IlK s.t. f e HK
s.t. f e B(HK, R),
respectively,whereR, Y > 0. DenotebyfY andfR their minimizers, respectively. Also, let R1 = llfp
||K if fp e HK and R1 = ro otherwise. A development similar to the one in this section shows the
existence of a decreasing global homeomorphism
A : (0 , +ro) ^ (0 ,Ri)

i; for all Y > 0, fY is the minimizer of E(A(Y)), and

ii; for all R e (0,R1), fR is the minimizer of E(A-1 (R)).
Here fR is the target functionH when H is IK(Br).


References and additional remarks

The problem of approximating a function from sparse data is often ill posed. A standard
approach to dealing with ill-posedness is regularization theory [36,54,64,102,130].
Regularization schemes with RKHSs were introduced to learning theory in [137] using spline
kernels and in [53,134,133] using general Mercer kernels. A key feature of RKHSs is ensuring
that the minimizer of the regularization scheme can be found in the subspace spanned by
{KXi }m=1. Hence, the minimization over the possibly infinite-dimensional function space HK
is reduced to minimization over a finite-dimensional space [50]. This follows from the
reproducing property in RKHSs. We have devoted Section 2.8 to this feature. It is extended to
other contexts in the next two chapters.
The error analysis for the least squares regularization scheme was considered in [38] in
terms of covering numbers. The distance between fz,Y and fY was studied in [24] using stability
analysis. In [153], using leave-one-out techniques,
E (E fy)) <
2C2 2

inf E(f ) + Y IIf IlK

it was proved that
In [42, 43] a functional analysis approach was employed to show that for any
confidence 1 8,
E(fZ,y) - E(fy)I < m
(1 + ^CY) (1 +72log(2/5))

< 8 < 1 , with

Parts(i)and(ii)ofProposition8.5weregivenin[39].Part(iii)with 1 <0 < 2 was proved in [115],

and the extension to 2 < 0 < 3 was shown by Mihn in the appendix to [116]. In [115], a
modified McDiarmid inequality was used to derive error bounds in the metric induced by || ||K.
Iffp is in the range of LK, then, for any 0 < 8 < 1 with confidence 1 8,
holds, where C is a constant independent of m and 8. In [116] a Bennett inequality for vectorvalued random variables with values in Hilbert spaces is applied, which yields better error
bounds. Iffp is in the range of LK, then we have
Ifz,y -fp IlL2 < C (log(4/)) 2 (1/m) 2/ 3 by taking y = log(4/) (1/m)1/3.


These results are capacity-independent error bounds. The error analysis presented in this
chapter is capacity dependent and was mainly done in [143]. When fp e HK and s* < 2, the
learning rate given by Theorem 8.1 is better than capacity-independent ones.
A proof of Proposition 8.20 can be found, for instance, in [11].
For some applications, such as signal processing, inverse problems, and numerical
analysis, the data (xi )f=1 may be deterministic, not randomly drawn according to pX
.Then the regularization scheme inff e%K z,Y (f) involves only the random data
(yi)f=1. For active learning [33, 81], the data (xi)f=1 are drawn according to a userdefined distribution that is different from pX. Such schemes and their connections

Support vector

to richness of data have been studied in [114].

machines for classification

In the previous chapters we have dealt with the problem of learning a function f : X ^ Y when
Y = R. We have described algorithms producing an approximation fz of f from a given sample
z e Zm and we have measured the quality of this approximation with the generalization error E

as a ruler.
Although this setting applies to a good number of situations arising in practice, there are
quite a few that can be better approached. One paramount example is that described in Case
1.5. Recall that in this case we dealt with a space Y consisting of two elements (in Case 1.5
they were 0 and 1). Problems consisting of learning a binary (or finitely) valued function are
called classification problems. They occur frequently in practice (e.g., the determination,
from a given sample of clinical data, of whether a patient suffers a certain disease), and they
will be the subject of this (and the next) chapter.
A binary classifier on a compact metric spaceX is a function f : X ^{1, -1}. To provide
some continuity in our notation, we denote Y = {-1,1} and keep Z = X x Y. Classification
problems thus consist of learning binary classifiers. To measure the quality of our
approximations, an appropriate notion of error is essential.
Definition 9.1 Let p be a probability distribution on Z := X x Y. The misclassification error
R( f) for a classifier f : X ^ Y is defined to be the probability of a wrong prediction, that is, the
measure of the event {f (x) = y},
R(f) := Prob {f (x) = y} =
Prob(y = f (x) | x) dPX. (9.1)
X yeY
Our target concept (in the sense of Case 1.5) is the set T :={x e X | Prob{y = 1 | x} > 2 },
since the conditional distribution at x is a binary distribution.
One goal of this chapter is to describe an approach to producing classifiers from samples
(and an RKHS HK) known as support vector machines.

Needless to say, we are interested in bounding the misclassification error of the classifiers
obtained in this way. We do so for a certain class of noise-free measures that we call weakly
separable. Roughly speaking (a formal definition follows in Section 9.7), these are measures
for which there exists a function fsp e HK such that x e T fsp (x) > 0 and satisfy a decay
condition near the boundary of T. The situation is as in Figure 9.1, where the rectangle is the
set X, the dashed regions represent the set T, and their boundaries are the zero set of fsp.
Support vector machines produce classifiers from a sample z, a real number Y > 0 (a
regularization parameter as in Chapter 8 ), and an RKHS HK. Let us denote by Fz,Y such a
classifier. One major result in this chapter is the following (a more detailed statement is given
in Theorem 9.26 below).
Theorem 9.2 Assume p is weakly separable by HK. Let B1 denote the unit ball in HK.
(i) If log N (B1 , n) < C0(1/nf for some p, C0 > 0 and all n > 0, then, taking Y = m,-^ (for
some 8 > 0), we have, with confidence 1 - 8,
^ < e(( mn^ 2 ) 2
where r and C are positive constants independent ofm and 8.(ii) If log N (Bi, n)
< C0(log(1/n))P for some p, C0 > 0 and all 0 < n < 1, then, for sufficiently large m
and some j3 > 0 , taking Y = m-^, we have, with confidence 1 - 8,
m g mf log2
R(FZ,Y) < C

where r and C are positive constants independent ofm and 8.

We note that the exponent j3 in both parts of Theorem 9.2, unfortunately, depends on p.
Details of this dependence are made explicit in Section 9.7, where Theorem 9.26 is proved.


sgn( f )(x)

Binary classifiers


if f (x) > 0 if f (x) < 0 .

Just as in Chapter 1, where we saw that the real-valued function minimizing the error is the
regression function fp, we may wonder which binary classifier minimizes R. The answer is
simple. For a function f : X ^ R define
Also, let Kp :={x e X :fp(x) = 0} and Kf, = PX(Kp).
Proposition 9.3

(i) For any classifier f,

R(f ) = 2Kp + fX\Kp Prob(y = f (x)|x) dpx.
(ii) R is minimized by any classifier coinciding onX \ KP with
fc := sgn(fp).
Proof. Since Y = {1, -1}, we have fp(x) = ProbY(y = 1 | x) - ProbY(y = -1 | x). This means that
fc(x) = sgn( fp)(x)

if ProbY ( y = 1 | x)
> ProbY ( y = -1
if ProbY(y = 1 | x)
< ProbY(y = -1
For any classifier f and any x e Kp, we have ProbY (y = f (x)|x) = 5 . Hence statement (i) holds.
For the second statement, we observe that for x e X \ Kp, Proby (y = fc (x) | x) > Proby (y
= fc (x) | x). Then, for any classifier f, we have either f (x) = fc(x) or Prob(y
= f (x) |
= Prob(y
= fc(x)
| x) > Prob(y =
fc(x) |
Hence R(f) > R(fc), and equality holds if and only iff and fc are equal almost
-peverywhere on X /Kt
The classifier fc is called Bayes rule.
Remark 9.4 The role played by the quantity KP is reminiscent of that played by op in the
regression setting. Note that KP depends only on p. Therefore, its occurrence in Proposition
9.3(i) - just as that of op in Proposition 1.8 - is independent off. In this sense, it yields a lower
bound for the misclassification error and is, again, a measure of how well conditioned p is.
As p is unknown, the best classifier fc cannot be found directly. The goal of classification
algorithms is to find classifiers that approximate Bayes rule fc from samples z e Zm.
A possible strategy to achieve this goal could consist of fixing an RKHS HK and a Y > 0 ,
finding the minimizer fz,Y of the regularized empirical error (as described in Chapter 8 ), that is,


fzY = argmin - ( f (xt) - yi)2 + y || f \\K,

S.HK m i= i and then taking the function sgn(fz,Y) as an approximation offc. Note that this
strategy minimizes a functional on a set of real-valued continuous functions and then applies
the sgn function to the computed minimizer to obtain a classifier.
A different strategy consists of first taking signs to obtain the set {sgn( f) | f e HK} of
classifiers and then minimizing an empirical error over this set. To see which empirical error
we want to minimize, note that for a classifier
f : X ^ y,
for f e HK satisfying f (x) =dp0 almost everywhere,
R(sgn( f ))
X{sgn( f (x))y=-1 } =
X{sgn( yf (x))=-1 } dp=
(x)<0 } .
By discretizing the integral into a sum, given the sample z = {(xi, yt)}m=_1 e Zm, one might
consider a binary classifier sgn( f), where f is a solution of

W (xi )<0 },
f eHK m

=0 a.e.

f (x)

or, dropping the restriction that f (x) =

a.e. for simplicity,

gmin -J2 X{ytf (Xi )<0 }.
f eUK m i=i
Note that in practical terms, we are again minimizing over HK. But we are now minimizing
a different functional.
It is clear, however, that iff is any minimizer of (9.4), so is af for all a > 0. This shows that
the regularized version of (9.4) (regularized by adding the term Y If ||K to the functional to be
minimized) has no solution. It also shows that we can take as minimizer a function with norm
1. We conclude that we can approximate the Bayes rule by sgn(fz0), wherefz 0 is given by
f 0:
z = argmin m
X{yif (Xi)<0}.
f eUK m i=i II f IIK=1
We show in the next section that although we can reduce the computation of fz 0 to a
nonlinear programming problem, the problem is not a convex one. Hence, we do not possess
efficient algorithms to findfz 0 (cf. Section 2.7). We also introduce a third approach that lies
somewhere in between those leading to problems (9.3) and (9.5). This new approach then
occupies us for the remainder of this (and the next) chapter. We focus on its geometric
background, error analysis, and algorithmic features.



Regularized classifiers

A loss (function) is a function $ : R ^ R+. For (x, y) e Z and f : X ^ R, the quantity $ (yf
(x)) measures the local error (w.r.t. $). Recall from Chapter 1 that this is the error resulting
from the use off as a model for the process producing y from x. Global errors are obtained by
averaging over Z and empirical errors by averaging over a sample z e Zm.
Definition 9.5 The generalization error associated with the loss $ is defined as E$(f) = j $
(yf (x)) dp.
The empirical error associated with the loss $ and a sample z e Zm is defined as
1 m
E$(f) := - V $(yif (xt)).
Iff e HK for some Mercer kernel K, then we can define regularized versions of these errors. For
Y > 0, we define the regularized error
E$(f) :=j $(yf (x)) dp + Y |f ||K
and the regularized empirical error
1 m
$(yf (Xi)) + Y If IlK.
Examples of loss functions are the misclassification loss
Mt )
if t > 0
if t < 0 and the least-squares loss $is = (1 - t)2. Note that for functions f : X ^ R and
points x e X such that f (x) = 0, $0(yf (x)) = /{y=sgn(f (X))}; that is, the local error is 1 if y
and f (x) have different signs and 0 when the signs are the same.
Proposition 9.6 Restricted to binary classifiers, the generalization error w.r.t. $0 is
the misclassification error R; that is, for all classifiers f,
R(f) = E $0 (f).
In addition, the generalization error w.r.t. $is is the generalization error E. Similar
statements hold for the empirical errors.
Proof. The first statement follows from the equalities
R(f) = f X{yf (x)=1 } dp = f $0 (yf (x)) dp = E$0 (f).
For the second statement, note that the generalization error E(f) of f satisfies

E (f ) = j (y - f (x))2 dp = j (1 - yf (x))2 dp = Efs (f ),

since elements y e Y = {1,1} satisfy y2 = 1 and therefore
(y f (x))2 = (y y2f (x))2 = y2(1 yf (x))2 = (1 yf (x))2.

Recall that HK,z is the finite-dimensional subspace of HK spanned by {Kx1,..., Kxm} and P :
HK ^ HK,z is the orthogonal projection. Corollary 2.26 showed that when H = IK (BR), the
empirical target function fz for the regression problem can be chosen in HK,z. Proposition 8.7
gave a similar statement for the regularized empirical target function fz,Y (and exhibited
explicit expressions for the coefficients of fz,Y as a linear combination of {Kx1,..., Kxm}). The
proofs of Proposition 2.25 and Corollary 2.26 readily extend to show the following result.
Proposition 9.7 Let K be a Mercer kernel on X, and f a loss function. Let also B HK,
> 0, and z e Zm. If e HK is a minimizer of Ez in B, then P(f ) is a minimizer of Ez in
P(B). If, in addition, P(B) B and Ez can be minimized in B, then such a minimizer
can be chosen in P(B). Similar statements hold for .
We can use Proposition 9.7 to state the problem of computing fz0 as a nonlinear
programming problem.
Corollary 9.8 We can take

f z : = argmin -J2 x { yf ( xi )<0}.

f eHK,z m t= i If IIK=1

Proof. Let f be a minimizer of Efo(f ) = m T,T=1 X{ytf (xt)<0 } in HK n{ | If IK = 1}. By

Proposition 9.7, P(f*) e HK,z satisfies Ef<0(f*) = Efo(Pf )).
If Pf*) = 0, we thus have Ef0(f*) = Ef0 (P(/*)/||P(/*)||K), showing that a minimizer exists in
HK,z n{f | If ||K = 1}.
If P(f*) = 0, then Efo(f*) = Ez(0) = 1, the maximal possible error. This means that for all f
e HK, Ez0(f ) = 1 , so we may take any function in HK,z n {/ I If fK = 1} as a minimizer of Ez0.
Proposition 9.7 (and Corollary 9.8 when p = p0 ) places the problem of finding minimizers of or p,y in the setting of
the general nonlinear programming problem. But we would actually like to deal with a programming problem for which efficient
algorithms exist - for instance, a convex programming problem. This is not the case, unfortunately, for the loss function p 0 .

Remark 9.9 Take p = p0 and consider the problem of minimizing on {f e HK | |f ||K = 1}. By Corollary 9.8, we can
minimize on {f e HK,z | ||f ||K = 1 } and take
z = X! CzjKxj,
a cjyiK(xi ,xj)<0

X\ m

1 =1


cz =m( cT z , i , c z , m ) = argmin
csR c K [x]c=1
Since SK-1 = {c e Rm | cTK[x]c = 1} is not a convex subset of Rm and X{^ CjyiK(xi Xj)<0 } may
not be a convex function of c e SK-\ the optimization problem of computing cz is not, in
general, a convex programming problem.
We would like thus to replace the loss p0 by a loss p that, on one hand, approximates
Bayes rule - for which we will require that p is close to the misclassification loss p0 - and, on
the other hand, leads to a convex programming problem. Although we could do so in the
setting described in Chapter 1 (we actually did it withfz0 above), we instead consider the
regularized setting of Chapter 8 .
Definition 9.10 Let K be a Mercer kernel, p a loss function, z e Zm, and Y > 0. The
regularized classifier associated with K, p, z, and Y is defined as sgnfzY), where
P{yf (Xi)) + Y Ilf III .
(9 .6 )
Note that (9.6) is a regularization scheme like those described in Chapter 8 . The constant
Y > 0 is called the regularization parameter, and it is often selected as a function of m, Y =
Proposition 9.11 If : R ^ R+ is convex, then the optimization problem induced by
(9.6) is a convex programming one.
Proof. According to Proposition 9.7,fzfY =J2jL1 cZjKXj, where


z, i , . . . , cz,m) = argmin - <p I
csRm m



yK(xi, Xj)cj I

+ y ^2 cK(xi, Xj)cj. =1
For each i = 1 , . . . , m , $ j=tyK(xi,xj)cj = $(yTK[x]c) is a convex
function of c e Rm. In addition, since K is a Mercer kernel, the Gramian matrix K [x] is positive
semidefinite. Therefore, the function c ^ cTK [x]c is convex. Thus, cz is the minimizer of a
convex function.

Regularized classifiers associated with general loss functions are discussed in the next
chapter. In particular, we show there that the least squares loss $|s yields a satisfactory
algorithm from the point of view of convergence rates in its error analysis. Here we restrict
our exposition to a special loss, called hinge loss,
$h(t) = (1 - t)+ = max{1 - t, 0}.
The regularized classifier associated with the hinge loss, the support vector machine, has
been used extensively and appears to have a small misclassification error in practice. One
nice property of the hinge loss $h, not possessed by the least squares loss $|s, is the
elimination of the local error when yf (x) > 1. This property often makes the solution fy of
(9.6) sparse in the representation fZfj) = J2=t cz,iKXi. That is, most coefficients cz,i in this
representation vanish. Hence the computation offZfjh can, in practice, be very fast. We return
to this issue at the end of Section 9.4.


Although the definition of the hinge loss may not suggest at a first

glance any particular reason for inducing good classifiers, it turns out that
there is some geometry to explain why it may do so. We next disgress on

Optimal hyperplanes: the separable case

this geometry.
Suppose X Rn and z = (zi, . . . , zm) is a sample set with zi = (xi,yi), i = 1,, m. Then z consists
of two classes with the following sets of indices: I = {i I yi = 1} and II = {i | yi = -1}. Let H
be a hyperplane given by w x = b with w e Rn, ||w|| = 1, and b e R. We say that I and II are
separable by H when, for i = 1 , . . . , m,

w xi >

b if i e I

w xi <
b if i e II.
That is, points xi corresponding to I and II lie on different sides of H. We say that I and II are
separable (or that z is so) when there exists a hyperplane H separating them. As shown in
Figure 9.2, if w is a unit vector in Rn, then the distance from a point x* e Rn to the plane w x
= 0 is ||x* ||| cos 01 = ||w||||x* HI cos 01
= \w x*|. For any b
e R, the
by w x
is parallel to w x
=0 and the distancefrom the point x* to
H is
|w x* b|.
When w x* - b < 0, the point x* lies on the side of H opposite to the direction w.
If I and II are separable by H,points xi with i e I satisfy w xi - b > Oandthe point(s) in
this set closest to H is (are) at a distance bI (w) := minieI{w xi-b} = minieI wxi-b. Similarly,
points xi with i e II satisfy wxi-b < Oandthe points in this set closest to H is (are) at a
distance bu(w) := - maxieII{w xi - b} = b - maxign w xi.
If we shift the separating hyperplane to w x = c(w) with
c(w) = 1 min w xi + max w xi ,


w xbthese distances become the same and equal to

A(w) = 1 min w xi - max w xi



= 2{bi(w) + b + (bii(w) - b)} = 2{bi(w) + bii(w)} > 0 .

Therefore, the two classes of points are separated by the hyperplane w x = c(w) and
w xi - c(w) > minisI w xi - c(w)
= 2 (minisI w xi - maxisII w xi} = A(w)

if i e I

w xi - c(w) < maxieII w xi - c(w)

= 2 {maxieII w xi - minieI w xi} = -A(w) if i e II.
Moreover, there exist points from z on both hyperplanes w x = c(w) A(w) (see Figure
The quantity A(w) is called the margin associated with the direction w, and the set {x | w
x = c(w)} is the associated separating hyperplane.

Different directions w induce different separating hyperplanes. In Figure 9.3, one can
rotate w such that a hyperplane with smaller angle still separates the data, and such a
separating hyperplane will have a larger margin (see Figure 9.4).

Figure 9.4
Any hyperplane in Rn induces a classifier. If its equation is w x b = 0, then the
function x ^ sgn(w x b) is such a classifier. This reasoning suggests that the best
classifier among those induced in this way may be that for which the direction w yields a
separating hyperplane with the largest possible margin A(w). Given z, such a direction is
obtained by solving the optimization problem
max A(w) l|w|| = 1
or, in other words,
max 1 min w Xi - max w Xi .
l|w||=l 2 yi=1
If w* is a maximizer of (9.8) with A(w*) > 0, then w* x = c(w*) is called the optimal
hyperplane and A(w*) is called the (maximal) margin of the sample.
Theorem 9.12 If z is separable with I and II both nonempty, then the optimization
problem (9.8) has a unique solution w*, A(w*) > 0, and the optimal separating
hyperplane is given by w* x = c(w* ).
Proof. The function A : Rn ^ R defined by
A(w) = 1 min w Xi - max w Xi
yi = 1
= minw Xi + minw %,


2 Xi


- 2 Xi

= min
ieI ||w
Xi + min ieII

if yi = 1
if yi = -1 ,


Xi =
| w*
-A(w*) > A(w*).
is continuous. Therefore, A achieves a maximum value over the compact set {w e Rn | || w y
< 1}. The maximum cannot be achieved in the interior of this set; for w* with || w* || < 1 , we
Furthermore, the maximum cannot be attained at two different points. Otherwise, for two
maximizers w* = w*, we would have, for any i e I and j e II,

Xi + w* Xj

> A(w*),

w* Xi

+ w*




which implies
(2 w* + 2 wi) Xi + (2 w* + 2 w*) Xj > A(wi).
That is^2 w* + 2 w*) would be another maximizer, lying in the interior, which is not possible.

For the optimal hyperplane w* x = c(w*), all the vectors xi satisfy

yi(w* Xi - c(w*)) > A(w*)
no matter whether yi = 1 or yi = -1 . The vectors xi for which equality holds are called
support vectors. From Figure 9.4, we see that these are points lying on the two separating
hyperplanes w* x = c(w*) A(w*). The classifier R" ^ Y associated with w* is given by
x ^ sgn (w* x c(w*)).


Support vector machines

When z is separable, we can obtain a classifier by solving (9.8) and then taking, if w* is the
computed solution, the classifier x ^ sgn (w* x c(w*)). We can also solve an equivalent
form of (9.8).
Theorem 9.13 Assume (9.8) has a solution w* with A(w*) > 0. Then w* = w/||w||, where w
is a solution of

|| w ||2
weRn, beR
(9 .9 )
yt (w Xi - b) > 1 , i = 1 ,m.
Moreover, A(w*) = 1 /1w| is the margin.
Proof. Aminimizer (\v,b) of the quadratic function ||w||2 subject to the linear constraints exists.
Recall that
A(w) = 1 min w Xi - max w Xi
2 yi=1

1 w
2 yi=1
- max
y; =-1 ||W^^
since w Xi -b > 1 when yi = 1 , and w Xj - b <1 when yj = 1 .
We claim that A(w0 ) < 1 /1w| for each unit vector w0. If this is so, we can conclude from
Theorem 9.12 that A(w/||w||) = 1 /1w| = A(w*) and
w* = w/||w||.

Suppose, to the contrary, that for some unit vector w0 e Rn, A(w0 ) > 1/HwH holds. Consider
the vector w = w0/A(w0) together with
2 (minyi=1 w0 Xi + maxyj=-1 w0 Xj)
They satisfy
w Xi b =
wo xi 2 (minyi=1 wo Xi + maxy.=1 wo Xj)
> 1 if yi = 1
w Xj b =
wo Xj 2 (minyi =1 wo Xi + maxy.=_ 1 wo Xj)
< 1 if yj = 1 .But ||w|| 2 = ||w0 ||2 /A(w0 ) 2 = 1/A(w0 ) 2 < ||w||2, which is in contradiction with w
being a minimizer of (9.9).

Thus, in the separable case, we can proceed by solving either the optimization problem
(9.8) or that given by (9.9). The resulting classifier is called the hard margin classifier, and
its margin is given by A(w*) with w* the solution of (9.8) or by 1/||w|| with w the solution of
It follows from Theorem 9.12 that there are at least n support vectors. In most
applications of the support vector machine, the number of support vectors is much smaller
than the sample size m. This makes the algorithm solving (9.9) run faster.
Support vector machines (SVMs) consist of a family of efficient classification algorithms:
the SVM hard margin classifier (9.9), which works for separable data, the SVM soft margin
classifier (9.10) for nonseparable data (see next section), and the general SVM algorithm (9.6)
associated with the hinge loss ph and a general Mercer kernel K. The first two classifiers can
be expressed in terms of the linear kernel K(x, y) = x y + 1, whereas the general SVM
involves general Mercer kernels: the polynomial kernel (x y + 1)d with d e N or Gaussians
exp{-||x - y||2 /a2} with a > 0. These SVM algorithms share a special feature caused by the
hinge loss ph: the solution = J2m=i Cz,iKx often has a sparse vector of coefficients cz =
(cz, 1 , . . . , cz,m), which makes the algorithm computing cz run faster.

9.5 Optimal hyperplanes: the nonseparable case

In the nonseparable situation, there are no w e R" and b e R such that the points in z can be
separated in to two classes with yi = l a n d yi = -1 by the hyperplane w x = b. In this case,
we look for the soft margi" classifier. This is defined by introducing slack variables f = (f
1,. . . , fm) and considering the problem
wsRn, bsR, sRm s.t.
INI2 + jm %i
yi (w xi - b) > 1 - %i > 0 , i = 1 , . . . , m .
Here Y > 0 is a regularization parameter. If (w,b,f) is a solution of (9.10), then its associated
soft margin classifier is defined by x ^ sgn (w x - b).
The hard margin problem (9.9) in the separable case can be seen as a special case of the
soft margin one (9.10) corresponding to 1 =ro, in which case all solutions have f = 0 .
We claimed at the end of Section 9.2 that the regularized classifier associated with the
hinge loss was related to our previous discussion of margins and separating hyperplanes. To
see why this is so we next show that the soft margin classifier is a special example of (9.6).
Recall that the hinge loss ph is defined by
ph(t) = (1 - t)+ = max{1 - t,0 }.

min n
wsR , bsK m
J2^h(yt(w xt i=1
b)) + Y l|w
If (w,b,l;) is a solution of (9.10), then we must have ^ = (1 -yi(w xi b))+; that is, i =

ph(yi (w xi b)). Hence, (9.10) can be expressed by means of the loss ph as

If we consider the linear Mercer kernel K on Rn x Rn given by K (x, y) = x y, then HK = {w
x | x e Rn}, \\Kw\\2K = ||w||2, and (9.10) can be written as

f &HK,bsR m
J2$h(yitf (Xi) - b)) + Y If llK. i=i
The scheme (9.11) is the same as (9.6) with the linear kernel except for the constant term b,
called ofset. 5
One motivation to consider scheme (9.6) with an arbitrary Mercer kernel is the
expectation of separating data by surfaces instead of hyperplanes only. Let f be a function on
Rn, and f (x) = 0 the corresponding surface. The two classes I and II are separable by this
surface if, for i = 1 , . . . , m,
f (xi) > 0if i e I
f (xi) < 0
if i e II;
that is, if yf (xi) > 0 for i = 1 , . . . , m. This set of inequalities is an empirical version of the
separation condition yf (x) > 0 almost surely for the probability distribution p on Z. Such a
separation condition is more general than the separation by hyperplanes. In order to find such
a separating surface using efficient algorithms (convex optimization), we require that the
function f lies in an RKHS HK. Under such a separation condition, one may take Y = 0 and
algorithm (9.6) corresponds to a hard margin classifier. This is the context of the next two
sections, on error analysis.


Error analysis for separable measures

In this section we present an error analysis for scheme (9.6) with the hinge loss = (1 - t)+ for
separable distributions.
Definition 9.14 Let HK be an RKHS of functions on X, and p a probability measure on Z = X x
Y. We say that p is strictly separable by HK with margin A > 0 if there is somefsp e HK such
that |fsp ||K = 1 and yfsp(x) > A almost surely.
Remark 9.15
i; Even under the weaker condition that yfsp (x) > 0 almost surely (which we consider in the
next section), we have y = sgn(fsp(x)) almost surely. Hence, the variance op vanishes
(i.e., p is noise free) and so does KP .
ii; As a consequence of (i), fc = sgnfsp).
iii; Sincefsp is continuous and fsp (x) |> A almost surely, it follows that if p is strictly
separable, then
px(T n X \ T) = 0 ,

/sp = 0
Figure 9.5
where T = {x e X | fc(x) = 1}. This implies that if X is connected, pX (T) > 0, and pX (X \ T)

5 We could have considered the scheme (9.11) with offset. We did not do so for simplicity of
exposition. References to work on the general case can be found in Section 9.8.

> 0, then p is degenerate. The situation would be as in Figure 9.5, where the two dashed
regions represent the support of the set T, those with dots represent the support of X \ T, and
the remainder of the rectangle has measure zero (for the measure pX).
Theorem 9.16 If p is strictly separable by HK with margin A, then, for almost every z e Zm,
Et' ft ) <
and IfY ||I < -A.
Proof. SincefSp/A e HK, we see from the definition offZf-J) that
Et ft ) + f III < Efh
A +Y
But y(fSp(x)/A) > 1 almost surely, that is, 1 - y(fSp(x)/A) < 0, so we have th (yfsp(x)/A)) = 0
almost surely. It follows that EZfh (/Sp/A) = 0. Since
lfsp/AiK = I/A2 ,
Efh (ft) + Y\ft \\K <
holds and the statement follows.

The results in Chapter 8 lead us to expect the solution ft of (9.6) to satisfy Eth (ft) ^ Eth
(ft), where ft is
this is indeed the case. To this end, we first characterize ft. For x e X, let nx := Proby(y = 1 |
Theorem 9.17 For any measurable function f : X ^ R
E( f ) > E(fc )
That is, the Bayes rulefc is a minimizerft of Eth.
Proof. Write Eth (f) = fX $h,x(f (x)) dpX, where
h,x(t) = j th(yt) dp(y | x) = th(t)nx + th(-t)(i - nx).
When t = fc(x) e { 1 , -1}, for y = fc(x) one finds that yt = 1 and th(yt) = 0, whereas for y =
-fc(x) = fc(x), yt = -1 and th(yt) = 2. So y th(yt) dp(y | x) = 2 Prob(y = fc(x) | x) and
$h,x(fc(x)) = 2 Prob(y = fc(x) | x).
According to (9.2), Prob (y = fc (x) | x) < Prob (y = s | x) for s = 1. Hence, $h,x(fc(x)) <
2Prob(y = s | x) for any s e {1, -1}.
If t > 1, then
th(t) = 0 and $h,x(t)
= (1 + t)(1 - nx) > 2(1
h,x (fc (x)).
If t < -1, then ) = 0 and $h,x(t) = (1 - t)nx > 2nx > $h,x(fc(x)). If -1 < t < 1, then $hx(t) =
(1 - t)nx + (1 + t)(1 - nx)>(1 - t)
$h,x (fc(x)) + (1 + t) 2 $h,x (fc (x)) = $h,x (fc(x)).
Thus, we have $h,x(t) > $h,x(fc (x)) for all t e R. In particular,
fh (f) =
$h,x(f (x)) dpx >$h,x(fc(x)) dpx = fh (fc).
Eth (ft) = E(ft) - Eth (ft) + t (ft) < Eth (ft) - E
4 >h
(ft) + 2 (9.12)

When p is strictly separable by HK, weseethatsgn( yfsp(x)) = 1 and hence y = sgn( fsp(x))
almost surely. This means fc (x) = sgn( fsp(x)) and y = fc(x) almost surely. In this case, we
have fh (fc) = jZ (1 - yfc(x))+ = 0.Therefore, we expect (ft) ^ 0. To get error bounds
showing that this is the case we write
Here we have used the first inequality in Theorem 9.16. The second inequality of that theorem
tells us that ft lies in the set {f e HK | Ilf ||K < 1/A). So it is sufficient to estimate E(f) - h (f)
for functions f in this set in some uniform way. We can use the same idea we used in Lemmas
3.18 and 3.19.
t____m 1=1 tZi)
> a*/e
< exp
Lemma 9.18 Suppose a random variable f satisfies 0 < f < M. Denote p = E(f). For
every e > 0 and 0 < a < 1,

Proof. The proof follows from Lemma 3.18, since the assumption 0 < f < M implies |f - p|< M
and E(f2) < ME(f).

E(f) - Eth (f) sup
f sF x/E(f) + e
> 4^/e
< N(F, ae) exp
3a2me 8(1 + B) '
Lemma9.19 Let F be a subset of C (X) suchthat If IIC(X) < B for all f e F. Then, for
every e > 0 and 0 < a < 1, we have
Proof. Let { fi , . . . , fN} be an ae-net for F with N = N(F, ae). Then, for each f e F, there is
some j < N such that || f - fj ||C(X) < ae. Since fh is Lipschitz, lfh(t) - fp(t')l < |t - t'l for all t,t'
e R. Therefore, \*h(yf(x)) - *h(yfj(x))| < yf - fjIle(X) < as. It follows that \E*h(f ) - E*h(fj)\ < as
and \E*h(f ) - E*h(fj)\ < as. Hence,
\E*'n (f ) - E*h (fj)l V *h ( f ) + s
Ez*h ( f ) - E*h ( fj )\
r .
< a s and
< as.
E *h ( f ) + s
y E ( fj ) + s < y E *h ( f ) + s + E*h ( fj ) - E*h ( f )
<yE*h(f ) + s + J\*h(fj) - E*h(f )!.
Since a < 1, we have \E*h (f ) - E*h (fj)\< s < s + E*h (f ) and then
< 2^*hf)+l.
E*h ( f ) - E*h ( f )
E*h ( f ) - E*h ( f )
E*" ( f ) - E*" ( f )
s/E *h ( f ) + s

E *h ( f ) + s

E *h ( f ) + s

E( f ) - E*h ( f ) < 2a^_ + E( f ) - E*h ( f )
V E *h ( f ) + s
V E *h ( f ) + s
It follows that if (E *h ( f ) - E*h ( f ))/^/ E *h ( f ) + s ) > 4a V for some f e F, then
E*h ( f ) - E*h ( f )
s/E *h ( f ) + s
This, together with (9.13), tells us that
E*h ( f ) - E*h ( f )
> 2 a V.
yE ( fj )+s
> a*/s.
E 4>h ( f ) - E*h ( f )
sup ------;

f eF VE *h ( f ) + s
> 4a*/s
E ^ ( fj ) - E*h f )
yE *h ( fj )+s
The statement now follows from Lemma 9.18 applied to the random variable H =
<h(yfj(x)), for j = 1 ,N, which satisfies 0 < f < 1 + \\fj\\C(X) < 1 + B.

We can now derive error bounds for strictly separable measures. Recall that Bi denotes
the unit ball of HK as a subset of C(X).
Theorem 9.20 If p is strictly separable by HK with margin A, then, for any 0 < 5 < 1 ,
E(ft) < 2s* (m, 5) + A

with confidence at least 1 - 5, where s*(m, 5) is the smallest positive solution of the
inequality in s
log N
< log 8 .
128(1 + CK/A)
In addition,
(i) If log N (B1, n) < Co(1/n)p for some p, Co > 0, and all n > 0, then
e*(m, 8) < 86(1 + CK/A) max
jlogO/S) Cl/d+p) m 0
/ 4 \p/(1 +p) / 1 \ 1/(1+p) 1
e*(m,8) <
(log mf m
{1 + 43(1 + CK/A)(2pCo + log(1/8))}.
(ii) If log N (B1, n) < C0(log(1/n)Y for some p, C0 > 0 and all 0 < n < 1, then, for m >
max{4/A, 3},
Proof. We apply Lemma 9.19 to the set F ={f e HK | ||f ||K < A}. Each functionf e F satisfies || f
||C(X ) < CK || f ||K < CK/A. By Lemma 9.19, for any 0 <a < 1, with confidence at least 1-N (F,
ae) exp {-3a2 me/(8(1 + CK / A ) ) } , we have
< 4a*fe.
E^ (f) (f)
f eF yMf)
In particular, the function fy, which belongs to F by Theorem 9.16, satisfies
E*h( fZty) (fih)
This, together with (9.12), yields
E (fZY) < 4a^eJs*f)Ve + A.
If we denote t = ^E^ (fy) + e, this inequality becomes t6 4a*fet (e + A) < 0.
Solving the associated quadratic equation and taking into account that t = yE(fy) + e > 0 ,
we deduce that
0 < t < 2a
+ (2a^~s)2 + e + ^2'
Hence, using the elementary inequality (a + b)2 < 2a2 + 2b2, we obtain
E(fy) = t2 e < 2(4a2e)+2 ((2a je)2 + e + A) e = 16a2e+e+A^.
Set a = 4 and e = e*(m, 8) as in the statement. Then the confidence 1 N(F, ae) exp{
3a2me/(8(1 + CK/A))} is at least 1 8, and with this confidence we have
E( f$) < 2e*(m, 8) + -A
256(1 + CK/A)
e(m, S) < e < max
- log ,
256(1 + CK/A)C0\1/(1+p) ( 4 \p/(1+p)


(ii) I fl o g N (Bi, n) < C0 (log(l/n))p, then e*(m, 5) < e*, where e* satisfies
3 me
log 0
128(1 + C-/A)
= log S.
Then Lemma 7.2 with d = 2 yields
A = max
(2pCo + log(1/S)) 128(1 + C-/A) 3

The function h: R+^ R defined by h(e) = C0 ^log A^^ - 3me/ (128(1 + CK/A)) is decreasing.
6 If log N(Bi, n) < C0 (1/n^,then e* (m, 8) < e*, where e* satisfies
Ae128(1 + CK/A)
This equation can be written as
i+p _ 128(1 +

K/A)1 ep _ 128(1 + CK/A)CQ/4\p= o



and e = A(log
m)/m. Then, for m > max{4/A, 3},
A (log m)p
> and log + log m < 2 log m. m
It follows that
h ( m ) < C^log A + log m) - (log m) ^2 Co + log ^
< -(log m) log 1 < log 5.
Since h isp decreasing, we have
A(log m)
e (m, 5) < e* <

In order to obtain estimates for the misclassification error from the generalization error,
we need to compare R with E. This is simple.
Theorem 9.21 For any measure p and any measurable function f : X ^ R, R(sgn(f)) - R(fc) <

E(f) -


Proof. Denote Xc = {x e X | sgn(f )(x) = fc(x)}. By the definition of the misclassification

R(sgn(f)) - R(fc) =
Prob(y = sgn(f )(x) | x)
Xc Y
- Prob( y = fc(x) | x) dpX.
For a point x e Xc, we know that ProbY(y = sgn(f )(x) | x) = ProbY(y = fc(x) | x). Hence,
ProbY(y = sgn(f )(x) | x) - ProbY(y = fc(x) | x) = fp(x) or -fp(x) according to whetherfp(x) >
0. It follows that | fp(x)| = ProbY(y = fc(x) | x) - ProbY(y = fc(x) | x) and, therefore,
R(sgn(f)) - R(fc) =
By the definition of 0h,
0 h(yfx))=(1 - yfc(x))+=[ 2 if y = fjl
(fc) = fX jY fa(yfc(x)) dp(yx dpX = fX 2ProbY(y =fc(x)|x)
dpX. Furthermore,
(f) = fXfY $h(yf(x)) dp(yx dpX. Thus, it is
sufficient for us to prove that
(yf(x)) dp(y|x) - 2Prob(y = fc(x)|x) > ^p(x)l Vx e Xc. (9.15)
We prove (9.15) in two cases.
If | f (x)| > 1 , then x eXc implies that sgn(f (x)) = fc(x) and <ph(-fc(x) f (x)) = 0. Hence,
<h(yf(x)) dp(y^) = $u(fc(x)f (x)) Prob(y = fc(x)|x)
= (1 - fc(x)f (x)) Prob(y = fc(x)|x).
Since -fc(x)f (x) = |f (x)| > 1, it follows that
<h(yf(x)) dp(y|x) - 2 Prob(y = fc(x)x
> (1 + \f (x)|) Prob(y = fc(x)|x) - Prob(y = fc(x)\x) = (1 + \f (x)\)| fp(x)\ > \fp(x)\.If | f Mi
< 1, then
<h(yf(x)) dp(yix) - 2 Prob(y = fc(x)ix) Y
= (1 - fc(x)f (x)) Prob(y = fc(x)ix) + (1 + fc(x)f (x)) Prob(y = fc(x)|x) - 2 Prob (y = fc (x)|x)
= Prob(y = fc(x)|x) - Prob(y = fc(x)ix)

+fc(x)f (x)(^Prob(y = fc(x)|x) - Prob(y = fc(x)|x)j = (1 - fc (x)f (x))| fp(x)i.

But x e Xc implies that fc(x)f (x) < 0 and 1 - fc(x)f (x) = 1 + |f (x) |. So in this case we have
<h(yf(x)) dp(yix) - 2 Prob(y = fc(x)|x) = (1 + if (x)|)|fp(x)i > ifp(x)i.
Combining Theorems 9.20 and 9.21, we can derive bounds for the misclassification error
for the support vector machine soft margin classifier for strictly separable measures satisfying
R( fc) = E(fc) = 0 .
Corollary 9.22 Assume p is strictly separable by HK with margin A.
(i) If log N (B1 , n) < C0(1/nY for some p, C0 > 0 and all n > 0, then, with confidence 1 8,
log(1 /S)


t (t+p) , C0

4 \ p/(t+p)

\ 1/(1+p)

R(sgn($)) <
+ 172(1 + CKM)
(ii) If log N (B\, n) < C0(log(1/n))P for some p, C0 > 0 and all 0 < n < 1, then, for m >
max{4/A, 3}, with confidence 1 - 8,
R(sgn($)) < A2 + (lQgmmF {2 + 86(1 + CK/A)(2pC0 + log(1 /8 ) ) } .
It follows from Corollary 9.22 that for strictly separable measures, we may take Y = 0. In
this case, the penalized term in (9.6) vanishes and the soft margin classifier becomes a hard
margin one.
Col Y -

9.7 Weakly separable measures

= log 8 .
We continue our discussion on separable measures. By abandoning the positive margin A > 0
assumption, we consider weakly separable measures.
Definition 9.23 We say that p is weakly separable by HK if there is some function fsp e HK
satisfying || fsp ||K = 1 and yfsp(x) > 0 almost surely. It has separation triple (0, A, Co) e (0,
TO] X (0, TO) 2 if, for all t > 0,
px {x e X : | fsp (x) | < At} < C0 te.
The largest 0 for which there are positive constants A, C0 such that (0, A, C0) is a separation
triple is called the separation exponent of p (w.r.t. HK andfsp).
Remark 9.24 Note that when 0 =TO, condition (9.16) is the same as pX {x e X : |fsp (x) | <
At} = 0 for all 0 < t < 1. That is, |fsp (x) | > A almost surely. But yfsp (x) > 0 almost surely. So
a weakly separable measure with 0 = TO is exactly a strictly separable measure with margin
Lemma 9.25 Assume p is weakly separable by HK with separation triple (0, A, C0 ).

E *h ( fy) + YII f
Proof. Write fY = fsp/At, with t > 0 to be determined. Since yfsp(x) > 0 almost surely, the same
holds for yfY(x) >0. Hence, fu(yfY(x)) < 1 and fh(yfY(x))> 0 only if yfY(x)<1, that is, if | fY(xf
= |fsp(x)/At| < 1. Therefore,
*h (fy) = 0h( yfY(x)) dp = 0h(l fY(x)l) dpx
(1 - Ify(x)l) dpx < px {x e X : | fy(x)l
I fy(x)l<1
< 1}< px {x e x : l fsp (x) l < At} < C0 te.

But Y II fY H2K = Y 1/(At)2. Setting t = (Y/C0 A2)1/(2+0) proves our statement.


Theorem 9.26 If p is weakly separable by HK with a separation triple (0, A, C0 ) then, for
any 0 < 8 < 1, with confidence 1 - 8, we have
R(sgn($)) < 2e*(m, 8, Y) + 8C^/(2+0) (A,
where e*(m, 8, y) is the smallest positive solution of the inequality

< log2

(b', 4 R)
3 me
128(1 + CKR) with R = 2c',(1 +0)A-0'(2 +0)y-1/(2+ +J 2log,8)
In addition,
(i) If log N (Bu n) < C0 (1/n)p for some p, C0 > 0 and all n > 0, then, taking Y = m-P
with 0 < ) < max{20+0+2p}, we have, with confidence 1 - 8,
log N

R(s 8 f))

< e(( m



with r = min 2++0), 2++ eyi+p) , 2+0 and C a constant independent of m and 8.
(ii) If log N (Bi, n) < C0(log(1/n)f for some p, C0 > 0 and all 0 < n < 1, then, for m >
max{4 / A , 3 }, taking Y = m-(2+e>/(1+e>, we have, with confidence 1 - 8,
R(sgn(fZY)) < C ^} (log m)p (log 2 ) j
Proof. By Theorem 9.21, it is sufficient to bound E(fz;) as stated, since y = sgn(fsp (x)) = fc
(x) almost surely and therefore E^h (fc) = 0 and R(fc) = 0 .
Choose fY by (9.17). Decompose E(fy) + Y || fZfjh ||K as
E*h (fty) - Ezh (fty) + Eh (fv) + YII fty IK - (Eth (fY) + YII fY IlK) }
+ Ezh (fY) + Y 1 fY IlK.
Since the middle term is at most zero by (9.6), we have E^ (fZY) + Y II fz+ IlK < (E^ (fz+ ) Eth (fY)} + Eh (fY) + Y II fY
F*rob {E?h (fY) - E*h (fy) < e} >

- exp I - By solving the quadratic equation

2(E ( fy) + 3 s)

To bound the last term, consider the random variable f = fh(yfY(x)). Since yfY(x) > 0 almost
surely, we have 0 < f < l.Also, o2(f) < E(f) = E^ (fY). Apply the one-side Bernstein inequality
to f and deduce, for each e > 0,
2 ( E ( f y ) + 1 s) 2
we conclude that, with confidence 1 - |,
E z ( f y) - E ( f y) <
2 log I + y j (| log I) + 2mE^ (f y ) !g
7 log 6 m
+ E ^ ( f y) .
Thus, there exists a subset U1 of Zm with p(U1) > 1 - | such that

E?h(fy) < 2E*h(fy) +

, Vz e Ul.
Then, by Lemma 9.25, for all z e U1 ,
< 4 C0
2 /(2 +e) (_K_\^/(2+e) + 7log 2
Ezh (fZy ) + Y II fzY III < Ez^h (fy) + Y II fy III
In particular, taking R = 2C0/(I+0) A e/(2+e) y 1/(2+e) +y^ ,wehave, for all z e U1 ,
fth e F ={f e HK : || f IIK < R}, Vz e Ui.
Now apply Lemma 9.19 to the set F with a = |. We have N(F, |) = N(B1, 4R) and || f ||C(X) <
CKR for all f e F. By Lemma 9.19, we can finda subset U2 of Zm with p(U2) > 1 - N(B1, 4R)

exp{-3me/(128(1 + CKR))} such that, for all f e F,

4> (f) - gp (f)
< e.
VE(f) + e
In particular, when z e U1 n U2, we havefZf-h e F and, hence,
E (ft) - Ezh (ft) <
E <P* (ft) + e < 2 E (ft) + e.
Take e = e* (m, 8, y) to be the smallest positive solution of
log N
< log .
128(1 + CKR) ~ 5 2
Then p(U2) > 1 - f and, for z e U1 n U2 , (9.18) implies

E (fZY) < 1 E*h (ft) + e*(m, 8, y) + 4q;/(2+e) (A^ ^ +

E^ (fZfY) < 2e*(m, 8, y) + 8 C2 /(2+e)
/ Y xe/(2+e)
It follows that
Since p ( U i n U2) > 1 - 8 , our first statement holds. The rest of the result, statements (i) and
(ii), follows from Theorem 9.20 after replacing A by R and 8 by |.


References and additional remarks

The support vector machine was introduced by Vapnik and his collaborators. It appeared in
[20] with polynomial kernels K(x, y) = (1 + x y)d and in [35] with general Mercer kernels.
More details about the algorithm for solving optimization problem (9.11) can be found in [37,
107, 134, 152].
Proposition 9.3 and some other properties of the Bayes rule can be found in [44].
Proposition 9.7 (a representer theorem) can be found in [137]. The material in Sections 9.39.5 is taken from [134].
Theorem 9.17 was proved in [138] and Theorem 9.21 in [154]. The idea of comparing
excess errors also appeared in [80].
The error analysis for support vector machines and strictly separable
distributions was already well understood in the early works on support vector
machines (see [134, 37]). The concept of weakly separable distribution was
introduced, and the error analysis for such a distribution was performed, in [31].
When the support vector machine soft margin classifier contains an offset term
b as in (9.11), the algorithm is more flexible and more general data can be
separated. But the error analysis is more complex than for scheme
(96; , which has no offset. The bound for || ||^ becomes larger than those
shown in Theorem 9.16 and (9.18). But the approach we have used for scheme

can be applied as well and a similar error analysis can be performed. For details,

General regularized classifiers

see [31].

In Chapter 9 we saw that solving classification problems amounts to approximating the Bayes
rule fc (w.r.t. the misclassification error) and we described a learning algorithm, the support
vector machine, producing such aproximations from a sample z, a Mercer kernel K, and a
regularization parameter Y > 0. The main result in Chapter 9 estimated the quality of the
approximations obtained under a separability hypothesis on p. The classifier produced by the
support vector machine is the regularized classifier associated with z, K, y, and a particular
loss function, the hinge loss 0 h. Recall that for a loss function 0, this regularized classifier is
given by sgn(fZty), with

0(yf (xi)) + YII f II

In this chapter we extend this development in two ways. First, we remove the separability
assumption. Second, we replace the hinge loss 0h by arbitrary loss functions within a certain
class. Note that it would not be of interest to consider completely arbitrary loss functions,
since many such functions would lead to optimization problems (10.1) for which no efficient
algorithm is known. The following definition yields an intermediate class of loss functions.
Definition 10.1 We say that 0 : R ^ R+ is a classifying loss (function) if it is convex and
differentiable at 0 with 0' (0 ) < 0 , and if the smallest zero of 0 is 1 .
Examples of classifying loss functions are the least squares loss 0|s(t), the hinge loss 0 h, and,
for 1 < q < TO, the q-norm (support vector machine) loss defined by 0q(t) := (0 h(t))q.

Note that Proposition 9.11 implies that optimization problem (10.1) for a classifying loss function is a convex programming problem.
One special feature shared by p\s, ph, and p2 = (ph)2 is that their associated convex programming problems are quadratic programming
problems. This allows for many efficient algorithms to be applied when computing a solution of (10.1). Note that p\s differs from p2 by the
addition of a symmetric part on the right of 1.
Figure 10.1 shows the shape of some of these classifying loss functions (together with that of p 0).

fpY) is to fc (w.r.t. the misclassification error). In other words,

we want to estimate the excess misclassification error R(sgn(fZ Y)) - R(fc). Note that in Chapter 9 we had
R( fc) = 0 because of the separability assumption. This is no longer the case. The main result in this chapter, Theorem 10.24, this goal
achieves for various kernels K and classifying loss functions.
Our goal, as in previous chapters, is to understand how close sgn(

The following two theorems, easily derived from Theorem 10.24, become specific for C kernels and the hinge loss ph and the least
squares loss p\s, respectively.

Theorem 10.2 Assume that X c Rn, K is C in X x X, and, for some p > 0,Choose
. Then, for any 0 < e < 1 and 0 < 8 < 1, with confidence 1 - 8,




R(sgn( fz'Y)) - R( fc) < C log8

e and CC is a constant depending on e but
holds, where 0 = min ,
1 + P 2
not on m or 8.
Condition (10.2) measures how quickly fc is approximated by functions from HK in the
metric Lxpx. When HK is dense in
(X), the quantity on the lefthand side of (10.2) tends to zero as Y ^ 0. What (10.2) requires is a certain decay for this
convergence. This can be stated as some interpolation space condition for the function fc.
Theorem 10.3 Assume that K is C inX x X and that for some P > 0,
inf {||f - fp IlL2 + YIIf IlK} = O(YP).




= m. Then, for any 0 < e < 1 and 0 < 8 < 1, with confidence

2 1/2 1 (1/2) min{p,1-e}

R(sgn( fiY )) - R( fc) <

- 8,

(log ^
holds, where CC is a constant depending on e but not on m or 8.
Again, the exponents P in Theorems 10.2 and 10.3 depend on the measure p. We note,
however, that in the latter this exponent occurs only in the bounds; that is, the regularization
parameter Y can be chosen without knowing P and, actually, without any knowledge about p.

10.1 Bounding the misclassification error in terms of the

generalization error

The classification algorithm induced by (10.1) is a regularization scheme. Thus, we expect

that our knowledge from Chapter 8 can be used in its analysis. Note, however, that the
minimized errors - the generalization error in Chapter 8 and the error with respect to the loss
$ here - are different and, naturally enough, so are their minimizers.
Definition 10.4 Denote by ff : X ^ R any measurable function minimizing the generalization error with respect to f for
example, for almost all x e X,
ff (x) := argmin f(yt) dp(y | x) = argmin f(t)nx + f(-t)(1 - nx).
t eR
Our goal in this chapter is to show that under some mild conditions, for any classifying loss f satisfying f"(0 ) > 0 , we have
f (ff Y ) - Ef (ff ) ^ Owith high confidence as m and y = y(m) ^ 0. We saw in Chapter 9 that this is the case for f = f h and
weakly separable measures. We begin in this section by extending Theorem 9.21.

Theorem 10.5 Let f be a classifying loss such that f" (00) exists and is positive. Then there is
a constant cf > 0 such that for all measurable functions f : X ^ R,
R(sgn(f)) - R(fc) < 04,yjs* (f) - Ef (ff ).
If R( fc) = 0, then the bound can be improved to
R(sgn(f)) - R( fc) < 04 {E (f) - E (ff )).

To prove Theorem 10.5, we want to understand the behavior of ff . To this end, we

introduce an auxiliary function. In what follows, fix a classifying loss f.
Definition 10.6 Define the localizing function $ = $x: (R U (^|) ^ R+ to be the function
associated with f, p, and x given by
$(t) = nx 4(t) + (1 - nx )4(-t).
The following property of classifying loss functions follows immediately from their
convexity. Denote by f- (respectively, ) the left derivative (respectively, right derivative) of f.
Figure 10.2 shows the localizing functions corresponding to f0, f1, and f2 for n(x) = 0.75.
Lemma 10.7 A classifying loss f is strictly decreasing on (-<X ), 1] and nondecreasing on (1,
+ro). It satisfies f'+ (t) < 0 for t e (-rn , 1) and (t) > 0 fort e (1 , +ro).
Note that if $ is a classifying loss, then, for all x e X, $x is convex and E $( f) = fX $x( f
(x)) dpX .Denote

f_(x) := sup{t e R | $-(t) = nx$_(t) _ (1 - nx)$+ (-t) < 0 }

f+(x) := inf{t e R | $+(t) = nx$+ (t) _ (1 _ nx)$_(_t) > 0}.
The convexity of $ implies that f _(x) < f+ (x).
Theorem 10.8 Let $ be a classifying loss function andx e X.
The convex function $x is strictly decreasing on (_rn ,f (x)], strictly increasing on
[ f+(x), +ro), and constant on [ f _(x), f+ (x)].
ii; f$ (x) is a minimizer of $x and can be taken to be any value in


0 < f - (x) < f$ (x) if fp (x) >

(iii) The following holds:

fP$ (x) < f + (x) <

if fp (x) <

f -(x) < 0 < f +(x)

if fp (x) = .
iv; f (x) < 1 andf+(x) >-1.
i; Since $ = $x is convex, its one-side derivatives are both well defined and nondecreasing,
and $- (t ) < $+ (t ) for every t e R. Then $ is strictly decreasing on the interval (TO, f -(x )\,
since $-(t) < 0 on this interval. In the same way, $+ (t ) > 0 for t > f+ (x ), so $ is strictly
increasing on [f +(x ), TO) .
For t e [f (x ), f +(x )] , we have 0 < $ (t ) < $+(t) < 0. Hence $ is constant on [f (x), f+ (x)] and
its value on this interval is its minimum.
ii; Let x e X . If we denote
E0 (f | x) :=
0(yf (x )) d p(y | x) = nx0(f (x )) + (1 nx)0(f (x )),
then E0 (f | x ) = $(f (x )). It follows that f 0 (x ), which minimizes E *(-|x), is also a minimizer of $.
iii; Observe that
fp (x) = nx (1 nx) = 2 (nx 2).
Since 0 is differentiable at 0, so is $, and $'(0) = (2nx 1) 0'(0) = fP (x)0'(0). We now reason
by cases and use that 0'(0) < 0. When fp(x) > 0, we have $- (0) = $'(0) < 0 and f - (x) > 0.
When fp(x) < 0, $+(0) = $'(0) > 0 and f+(x) < 0 hold. Finally, when fp (x) = 0, we have $-(0 )
= $+(0 ) = 0 , which implies f(x) < 0 and f+(x) > 0 .
iv; When t > 1, Lemma 10.7 tells us that 0 (t) > 0 and 0+ (t) < 0. Hence $- (t) > 0 and f
(x) < 1. In the same way, f+(x) > 1 follows from 0 + (t) < 0 and 0 (t) > 0 for t <
1 .
Assumption 10.9 It follows from Theorem 10.8 that f0 can be chosen to satisfy
| f0 ^ < 1 and f 0 (x) = 0 if fp (x) = 0 .
(10 .6 )
In the remainder of this chapter we assume, without loss of generality, that f 0 satisfies (10 .6 ).

Lemma 10.10 Let 0 be a classifying loss such that 0"(0) exists and is positive. Then
there is a constant C = C0 > 0 such that for all x e X,
$(0) $(f0(x)) > C 2) 2 .0(t) - 0(0) t
< no)

Proof. By the definition of 0"(0), there exists some 1 > c0 > 0 such that for all t e [-co, co],

This implies that


0(0) + 4> (0)t -^j1 lt| < 0(t) < ^(0) + 0'(0)t
+ ?LYL|t|, Vt e[-c0, c0].
Let x e X. Consider the case nx > j first.
Denote A = min{-(0f (nx - j), c0}. For 0 < t < c0,

0(t) = nx0(t) - (1 - nx)^(-t) < (2nx - 1)4>'(0 ) + 0 "(0 )t
Thus, for 0 < t < A < -$(00) (nx - 1), e have
0(t ) < (2 nx < 0 (0 ) 2
1 )0 '(0 )


"(0) nx - i) <

Therefore $ is strictly decreasing on the interval [0, A]. But ff ,(x) is its minimal point, so
$(0) - $(f$ (x)) > $(0) - $(A) > -^ L - ^ A.

0 (nx -

2 ).

2 ). In both cases, we have


$(f0(x)) >



When -(nx - 5 ) < co, we have A : -W (nx - 2 ) > co, we have A = co > 2co (nx
C = min
{-0(0)) 2 1 20 "(0) ('
That is, the desired inequality holds with
The proof for nx < 1 is similar: one estimates the upper bound of $ (t) for t < 0 .

Proof of Theorem 10.5 Denote Xc = {x e X | sgn( f )(x) = fc(x)}. Recall (9.14). Applying the
Cauchy-Schwarz inequality and the fact that pX is a probability measure on X, we get

R(sgn( f ))

- 'R(fc) <

\ fp(x)\2 dp



d PX


\ fp(x)\2 dPx

We then use Lemma 10.10 and (10.5) to find that
R(sgn(f )) - R( fc) < 2 { C {$(0 ) - $(ff(x))) dpx
Let x e Xc. Iffp(x) > 0, then fc(x) = 1 and f (x) < 0. By Theorem 10.8, f~(x) > 0 and $ is
strictly decreasing on (-ro,0]. So $(f (x)) > $(0) in this case. In the same way, if fp(x) < 0,
then f (x) > 0. By Theorem 10.8, f+(x) < 0 and $ is strictly increasing on [0, +ro). So $(f (x))
> $(0). Finally, iffp(x) = 0, by (10.6), ff (x) = 0 and then $(0) - $(fp (x)) = 0.

In all three cases we have $ (0) - $ (fp (x)) < $ (f (x)) - $ (fp (x)). Hence,
{$(0 ) - $(f(x))) dpx <
{$(f (x)) - $( f(x))) dpx

{$(f (x)) - $(f(x))) dpx = Ef(f ) - *(f).
This proves the first desired bound with c$ = 2/VC.
If R( f c) = 0, then y = f c (x ) almost surely and nx = 1 or 0 almost everywhere. This means
that |f p (x)| = 1 and |f p (x )\ = \f p (x)| 2 almost everywhere. Using this with respect to relation
(9.14), we see that R(sgn(f)) - R( f c) = jXc i f P (x ) i d pX. Then the above procedure yields the
second bound with cf = C4 .

Projection and error decomposition


Since regularized classifiers are obtained by composing the sgn function with a real-valued
function f: X ^ R, we may improve the error estimates by replacing image values of f by their
projection onto [-1,1]. This section develops this idea.
Definition 10.11 The projection operator n on the space of measurable functions f : X ^ R
is defined by
if f (x) > 1
n(f )(x)
if f (x) < -1
f (x) if -1 < f (x) < 1 .
Trivially, sgn(n(f)) = sgn(f). Lemma 10.7 tells us that f(y(n(f ))(x)) < f (yf (x)). Then
f(n(f)) <f(f) and f(n(f)) < f(f).
(10 .8 )
Together with Theorem 10.5, this implies that if f "(0) > 0,
R(sgn(fY)) - R(fc) < C*7 Ef(n(fY)) - f(ff ).
Thus the analysis for the excess misclassification error o fi s reduced into that for the
excess generalization error f (n(fY)) - f (ff ). We carry out the latter analysis in the next
two sections.
The following result is similar to Theorem 8.3.
Theorem 10.12 Let f be a classifying loss, fY be defined by (10.1), and fY e HK. Then
f (n( fY)) - f (f ) is bounded by

(f f) + YII f y
{{ (f y) - f (f f)] - (f y) - f (f f)]}
{[f (n( f fy)) - f (f f)] - [4 (n( f fy)) - f (f f)]}} .
Proof. The proof follows from (10.8) using the procedure in the proof of Theorem 8.3.
The function fY e HK in Theorem 10.12 is called a regularizing function. It is arbitrarily chosen and depends on
y. One standard choice is the function

(n( f ly)) -

(f f) + YII f fy IlK <

(f y)


if := argmin f (f) + Y || f ||K

f eHK 1
The first term on the right-hand side of (10.9) is estimated in the next section. It is the
regularized error (w.r.t. 0) off Y,
D(y, 0) := 0(f Y) - 0(f 0) + Y || f Y ||K.
(10 .11 )
The second and third terms on the right-hand side of (10.9) decompose the sample error
0(n( f f)) - 0(f y). The second term is about a single random variable involving only one
function f Y and is easy to handle; we bound it in Section 10.4. The third term is more complex.
In the form presented, the function n(f fY) is projected from f fY. This projection maintains the
misclassification error: R(sgn(n(fz'fY))) = R(sgn(f fY)). However, it causes the random variable
0(y n(f fY)(x )) to be bounded by 0 (-1 ), a bound that is often smaller than that for 0(yfz'fY(x)). This
allows for improved bounds for the sample error for classification algorithms. We bound this
third term in Section 10.5.

10.3 Bounds for the regularized error





In this section we estimate the regularized error V(y, 0) of fY. This estimate follows from
estimates for 0( fy)- 0( ff). Define the function * : R+ ^ R+ by
*(t) = max{|0 -(t)\, |0 + (t)\, |0 -(-t)\, |0 + (-t)|}.
Theorem 10.13 Let 0 be a classifying loss. For any measurable function f,

0( f) - 0( f00) <
*(| f (x)|) If (x) - f0(x)\ dp X.
If, in addition, 0 e C2(R), then we have
0(f) - 0(f0) <
+ II0//|IL[-|f (x)|,| f (x)|]} \f (x) - f0(x)\ dpX.
Proof. It follows from (10.3) and (10.4) that
0(f) - 0(fp0) =
$(f (x)) - $(f0(x)) dpx.
XBy Theorem 10.8, $ is constant on [f (x), f+ (x)]. So we need only bound for those points x
for which the value f (x) is outside this interval.
Iff (x) > f+ (x), then, by Theorem 10.8 and since f+ (x) > 1, $ is strictly increasing on
[f+ (x), f (x)]. Moreover, the convexity of $ implies that
$(f (x)) $(f(x)) < $(f (x)) (f (x) f(x))
< max {f (f (x)), |f+ (f (x))|} | f (x) f0(x)\
< * (I f (x)l)\f (x) f(x)\.
Similarly, iff (x) < f(x), then, by Theorem 10.8 again and sincef (x) < 1,
$ is strictly decreasing on [f (x), f(x)], and
$(f (x)) $(f(x)) < $+(f (x)) (f (x) f(x)) < *(|f (x)|)\f (x) f(x)\ Thus, we have
$(f (x)) $(f(x)) < *(|f (x)|) \f (x) f(x)\.
This gives the first bound.
If 0 e C2 (R), so is $. Then $'( ff (x)) = 0 sinceff (x) is a minimum of $. When f (x) > f +
(x), using Taylors expansion,
ff (x)
$(f (x)) $(f(x)) = $'(f0(x))(f (x) f(x)) + (f (x) t) $"(t) dt
<11 $ ^L^\f0(x), f (x)]

\f (x) fp(x)\ .

Now, since ff (x) e [-1,1], use that

W$ll\\L^\f0(x)f (x)] < max{^0 "^L[1,1 ], ll0 "llL[|f(x)|,|f(x)|]}
to get the desired result.
The case f (x) < f (x) is dealt with in the same way.

Corollary 10.14 Let 0 be a classifying loss with

< TO. For any
measurable function f,
E0 (f ) E0 (ff ) < Ilf'IUIf ff, !L ;x .
If t e C2(R) and ||0"||OT < ro, then we have
* ( f) - t (ft ) < llt'IUI f - ft IIL 2
In this section, we bound the sample error term in (10.9) involving fY, that is,
Ez (fy) - Ez (ft)
*(fy) - E*(ft)

This can be written as


Bounds for the sample error term involving


lY^i=1 f(z i) - E(f) with f the random variable on (Z , p) given by f(z ) = t (yf Y (x)) -t (yf t (x)). To
bound this quantity using the Bernstein inequalities, we need to control the variance. We do
so by means of the following constant determined by t and p.
Definition 10.15 The variancing power T = t^,p of the pair (t, p) is defined to be the
maximal number T in [0 ,1 ] such that for some constant C1 > 0 and any measurable function f :
X ^ [1,1],
E{ (t(yf (x)) - $(yft(x))f\ < Cx {*( f) - t( ft)) .
Since (10.12) always holds with T = 0 and C1 = (t(-1)) , the variancing power Tt,p is well
Example 10.16 For fys(t) = (1 - t)2 we have T$, p = 1 for any probability measure .
Proof. For tls(t) = (1 - t)2 we know that tls(yf (x)) = (y - f (x))2 and f/ = fp. Hence (10.12) is
valid with T = 1 and C1 = sup(x,y)eZ
(y - f (x) + y - fp (x))2 < 16.

In general, T depends on the convexity of t and on how much noise p contains.

In particular, for the q-norm loss tq, T^ p = 1 when 1 < q < 2 and 0 < T$ p < 1 when q >

Lemma 10.17 Let q, q* > 1 be such that7 + q* = 1. Then

Proof. Let b > 0. Define a function f : R+ ^ R byf (a) = a-b-qaq- q*bq*. This satisfies
f '(a) = b - aq-1 ,
f "(a) = (q - l)aq-2 < 0 , Va > 0 .
Hence f is a concave function on R+ and takes its maximum value at the unique point a* =
b1/(q-1) where f '(a*) = 0. But q* = q-1 and
f (a*) = a*- b - 1 (a*)q
- bq* = bq/(q-1) - 1 bq/(q-1)
- bq/(q-1) = 0 .
Proposition 10.18 Let

= T, p and BY : = max{(| For any 0 < 8 < 1 , with confidence 1 - 8, the

fy Ile), t(-\\ fy lloo)}( fy)
E H fy ) - E H ft)
5By + 2t(-l) 3m
is bounded by



2 Ci log (2/8) m
+ E t ( fy ) - E t ( ft ).
Therefore, f (a) < f (a*) = 0 for all a e R+. This is true for any b > 0. So the inequality holds.

Proof. Write the random variable h(z ) = (yf Y (x )) - (yft (x)) on (Z , p) as h = H1 + &, where
Hi := t(yfy(x)) - t(yn(fy)(x)), %2 := t(yn(fY)(x)) - tfX).

) -E(Hi)
< exp

(a2(Hi) + 3BY)
The first part f1 is a random variable satisfying 0 < f1 < BY .Applying the one-side
Bernstein inequality to f1, we obtain, for any e > 0,
Solving the quadratic equation for e given by
\ = log
2 a2 (fi) + 3BYe
we see that for any 0 < 8 < 1, there exists a subset U1 of Zm with measure at least 1 - 2 such
that for every z e U1,
) - E(0
1 BY log(2/8) + J(iBY log(2/5))2 + 2ma &) log(2 /8)

But a2(^1) < E(f2) < BYE(f1). Therefore,
+ E(1 ), Vz e U1 .
m t ^) - E(5o < 5 BY log(2/8)

7 q 1 q*
a b < aq +bq , Va, b > 0.

2 log(2 /8 )a 2 fe) m
20 (-1 )

log(2 /8 ) 3m
) - Efe)
Next we consider f2. This is a random variable bounded by $ (-1). Applying the one-side
Bernstein inequality as above, we obtain another subset U2 of Zm with measure at least 1 - |
such that for every z e U2,
The definition of the variancing power T gives a2(^2) < E(f|) < C1(E(f2)|r. Applying Lemma
10.17 to q =
,q* = 7 , a = y/2log(2/8)C1/m, and
2 log(2 /8 )a 2 fe) m
^ - ^ ^2log(2/8)C1

+ ^ Efe).
b = V{E(f2)}r, we obtain

) - Efe)
20 (-1 ) log(2 /8 ) 3m
2 log(2 /8 )C1
+ E(&).
Hence, for all z e U2 ,Combining these inequalities for f1 and %2 with the fact that E(f 1) +
E(%2) = E(f) = ^( fY) - $( fp ), we conclude that for all z e U1 n U2,
5B V

log(2/8) + 20 (-1 ) log(2/3) + /21 og(2 /8 )Q \(1/2-r) 3m

fy) - Zt(p) - P(fy) - 4(p)}
+ 4(fY) - 4(p).
Since the measure of U1 n U2 is at least 1 - 5, this bound holds with the confidence claimed.


Bounds for the sample error term involving

The other term of the sample error in (10.9),




4(n(ft)) - 4(f4)
Ez (n( fZ\y)) - Ez (fp) , involves the function fz Y and thus runs over a set of functions. To
bound it, we use - as we have already done in similar cases - a probability inequality for a
function set in terms of the covering numbers of the set.
The following probability inequality can be proved using the one-side Bernstein inequality
as in Lemma 3.18.

d - m mkiM
V XT + eT
1_ T
e1 2

< exp
me2 T
2(c + 3 Be1-T)
Lemma 10.19 Let % be a random variable on Z with mean x and variance a 2. Assume
that x > 0, \% x\< B almost everywhere, and a2 < cxT for some 0 < T < 2 and c, B >
0. Then, for every e > 0,

Also, the following inequality for a function set can be proved in the same way as Lemma
Lemma 10.20 Let 0 < T < 1, c, B > 0, and G be a set of functions on Z such that for every
g e G, E(g) > 0, ||g - E(g)\\^oo < B, and E(g2) < c(E(g ))T .

E(g ) - mz

m=i g (z i

) V(E(g ))T +



4e1 -T/ 2
me2 T
2(c + 3 Be1-T)
< N(G, e) exp
Then, for all e > 0,
We can now derive the sample error bounds along the same lines we followed in the
previous chapter for the regression problem.
f eBR
\E*(n(D) - Ef(f) - f(n(f )) - *(f) (Ef(n(f )) - Ef(f*))' + eT
< 4e1 T / 2
> 1 - N Bi,
2 C1 + 3 f(-1)e
1 T

Lemma 10.21 Let T = . For any R > 0 and any e > 0.

Proof. Apply Lemma 10.20 to the function set
FR = {R(y(nf )(x)) R(yfR(x)) : f e BR}.
Each function g e FR satisfies E(g2) < c (E(g))r for c = C\ and llg - E(g)\\^
B B :=
2R(1). Therefore, to draw our conclusion from
Lemma 10.20, we need only bound the covering number N(FR, e). To do so, we note that for
fi, f2 e BR and (x, y) e Z, we have
{f(y(nf1)(x)) - f(yff(x))} - {f(y(nf2)(x)) - f(yff(x))} = f(y(nf1)(x)) - f(y(nf2)(x))\ < f'(-1)\|| f fi II.
f (-1)1/
proving the statement.
Then the confidence for the error e = e*(m, R, 8) in Lemma 10.21 is at least 1 - 8 .
For R > 0, denote
W (R) = {z e Z : || f*v ||K < R} .
Proposition 10.22 For all 0 < 8 < 1 and R > 0, there is a subset VR of Zm with measure at
most 8 such that for all z e W(R) \ VR, the quantity E* (n( fy )) - E* ( ft ) + YIIfZY IlK is
bounded by
10BY + 4(-1)

4D(Y, ft) + 24e (m,R, 8/2) + Y

log (4/8)


2C1 log (4/8)



Define e (m, R, 8) to be the smallest positive number e satisfying


8 4e1 -T/ 2 < 2 {E*(n( f)) - *( f$))

+ (1 - T/2)41/(1-t/2) + 4e
< 2 (*(*(f)) - E*(ft)) + 12 e.
Putting this into Lemma 10.21 with e
confidence 1 - |,

Hn(ftY)') - EH

e*(m, R, 8/2), we deduce that for z e W (R), with

ft)] - [E(n(&)) - Eft( f)


logN Bi,
'RW (-1 )U2Ci + 3 ft(-1 )e
< log 8 .
the quantity
E1-Tf t (fY ) - Ef t (ft ) - Ef t (fY ) - Ef t (ft )
is bounded by
log (4/8) +
1/(2- T )
+ E* (fY )

- E* (ft ).
2 C1 log (4/8)
5BY + 24(-1) 3m
Proof. Lemma 10.17 implies that for 0 < T < 1,
Combining these two bounds with (10.9), we see that for z e W(R), with confidence 1 5,
(n( fty )) * (ft ))
(2 C1 log (4/5) \1/(2'T)
E*(n(fty)) t(ft) + Y II fy IlK < D(Y, t) + 2
+ 12e*(m, R, 5/2) + B + 2 (^ log (4/5) +
+ t (fy) t (ft).
This gives the desired bound.

Lemma 10.23 For all y > 0 and z e Z ,

IIfzt IIK ^t(0)/Y.
Proof. SinceftY minimizes / (f) + y | f ||K in HK, choosing f = 0 implies that

YII fly llK < t

(fty ) + YII fty IIK < i (0) +
1 m
t (0) = t (0).


Therefore, | ftY ||K < Vt (0)/y for all z e Zm.

By Lemma 10.23, W(V<t(0)/Y) = Zm. Taking R := Vt(0)7y", we can derive a weak error
bound, as we did in Section 8.3. But we can do better. Abound for the norm || ftY ||K improving
that of Lemma 10.23 can be shown to hold with high probability. To show this is the target of
the next section. Note that we could now wrap the results in this and the two preceding
sections into a single statement bounding the excess misclassification error ^(sgn(ftY))
R(fc). We actually do that, in Corollary 10.25, once we have obtained a better bound for the
norm || ftY ||K.


Stronger error bounds

In this section we derive bounds for t (n( fty))t (ft), improving those that would follow
from the preceding sections, at the cost of a few mild assumptions.
Theorem 10.24 Assume the following with positive constants p, C0, C^, A, q > 1 , and /3 <

{E*(n( fzt )) - Eft(p)) + 12e*(m,R, 8/2).

Proposition 10.18 with 8 replaced by 2 guarantees that with confidence 1 - 2,

ii; (i) K satisfies logN (B\, n) < C0(1/n)p. f(t) < C'f\t\q for all t (-1,1).
iii; V(y, 0) < Ayp for each y > 0.
Choose Y = m-Z with Z = p+q(1-p)/2. Then, for all 0 < n < 2 and all 0 < 8 < 1, with
confidence 1 - 8,

E*(n(ffy)) - Ef(fp ) < Cn log jm-0,

0 := min

1 - pC r
p + q(1 - P)/2 2 - T + p
s :=
2(1 + p)
Z - 1/(2 - T + p) 2(1 - s)
1 2

and Cn is a constant depending on n and the constants in conditions (i)-(iii), but not
on m or 8.
The following corollary follows from Theorems 10.5 and 10.24.
Corollary 10.25 Under the hypothesis and with the notations of Theorem 10.24, if
f"(0) > 0, then, with confidence at least 1 - 8, we have
R(sgn( fiy )) - R( fc) < c^JCn log 2 m-0.

When the kernel is C on X c Rn, we know (cf. Theorem 5.1(i)) that p in Theorem 10.24(i)
can be arbitrarily small. We thus get the following result.
Corollary 10.26 Assume that K is C onX x X and f(t) < Cf\t\q for all t ^ (-1,1) and
some q > 1. If V(y, f) < Ay? for all Y > 0 and some 0 < p < 1, choose Y = m-Z with Z =
p+qd_p -)/2. Then for any 0 < n < 1 and 0 < 8 < 1, with confidence 1 - 8,

E*(n(f)) - Ef(f) < Cn log 8 m-0,

r := max
+ n,
Z, Z
0 : = min
p + q(1
p)/2 2 and

is a constant depending on n, but not on m or 8.

Theorem 10.2 follows from Corollary 10.26, Corollary 10.14, and Theorem 9.21 by taking
fY = f defined in (10.10).
Theorem 10.3 is a consequence of Corollary 10.26 and Theorem 10.5. In this case, in
addition, we can take q = 2 , which implies Z = 1 .
The proof of Theorem 10.24 will follow from several lemmas. The idea is to find a radius R
such that W (R) is close to Zm with high probability.
First we establish a bound for the number e* (m, R, 8).
Lemma 10.27 Assume K satisfies log N (B1, n) < C0(1 /nf for some p > 0. Then for R > 1
and 0 <8 < 2 ,the quantity e*(m, R, 8) defined by (10.15) can be bounded by
e*(m,R,8) < C2 |(^
RP \1/(1+ ) RP \!/(2- +p)
where C2 := (60(-1) + 8 C1 + 1 )(C0 + 1)(|0/(-1)| + 1).
Proof. Using the covering number assumption, we see from (10.15) that e*(m, R, 8) < A, where
A is the unique positive number e satisfying
log 8.
/R|<^(-1 )lY

V e / - 2 C1 + 4&(-1)e1-T
We can rewrite this equation as
e2-T +p - 44(-1) log(1/8) e +p - 2 C1 log(1/8)ep 3m
- 4<p(-~l)C ^,(-1)iy lt - 2CCo (R^(-1)iy = 0 .
A < max
160(-1) log(1/8) 3m
8 C1 log(1/8A
(Rl^-Dlf) ' , 8CC {Rl^(-1)\)p
1/(2-T +p) '
Applying Lemma 7.2 with d = 4 to this equation, we find that the solution A satisfies

Therefore the desired bound for e* (m, R, 8) follows.The following lemma is a

consequence of Lemma 10.27 and Proposition 10.22.
Lemma 10.28 Under the assumptions of Theorem 10.24, choose Y = m-Z for some Z > 0.
Then, for any 0 < 8 < 1 and R > 1, there is a set VR Zm with measure at most 8 such
that for m > (C2KA)-1/(z(1-P)) ,
W(R) W(amRs + bm) U VR,
where s:= 2 (\ +p),am := 5VC2m 1 2 T+p))/2, and bm := C3 ^log
r := max
Z - 1/(2 - T) 2



Z, Z

1 q
- + - (1
2 + 4(

Here the constants are

C3 = VC2 + 2 CJ + ^-+2y/A +^/C;cK 2A 4.
Y y fz
Z,Y IiK < 4AYP + 24e*(m,R, 8/2) +
10BY + 4;(-1) 3m
log (4/8)
Proof. By Proposition 10.22, there is a set VR with measure at most 8 such that for all z e W(R) \
/(4 2t)



2 C1

log (4/8)\


Since $(t )





C^ \t \q

for each t (-1,1), we see that


||)} is bounded by C ^ (max{||


||,1}) . But the assumption D(y, $) <


AY^ implies that

II fy 11 < CK || fy ||K < CKy/VYWY < CK VA y( P - 1) / 2 .
Under this restriction it follows from Lemma 10.27 that z e W(R) for any R satisfying
/(2 T)
1= UAY + (24 C2 + 4Cl
+ 40(-1)) (10g(m/S))
1/(2 T)

V .
+24C2m-1/(2-T +P)RP/(1+P) + ^C^CqKAq/2YV(P-1)/2
Taking Y = m Z, we see that we can choose

= ^/C2m (z -1 /( 2 -T +P)) / 2 r P/( 2 (1 +P))

This proves the lemma.


+ 1,

+ C3

log 4

m r.

4\ t / 2

2 (2 T + p)
Jn := log2 max
Lemma 10.29 Under the assumptions of Theorem 10.24, take Y = m-Z for some Z > 0
andletm > (C2KA)-l/(z(l-^')'1. Then,forany n > 0 and0 < 8 < 1, the set W(R*) has
measure at least l - Jn8, where R* = C4mr ,
r* := max
Z 1/(2 T + p) , r,
2(1 s)
The constant C4 is given by
C4 = (STC^) 2 (f(0) + l) + Jn (5yC2) 2 C3 (log 4 )' .
Proof. Let J be a positive integer that will be determined later. Define a sequence {R( j) }j=0 by
R( 0 ) =
(0)/Y and Rj = am (R( j-r> )s + bm for
R )

+ (Rm)1+s+s2+-+si1 j=0

l < j < J. Then we have

(10.17)The first term on the right-hand side of (10.17) equals
1 S
1 s z1/ ( 2 T +P) 1 sJ
^5 C2
which, since
, \2





m2 s ,

< s < 2 , is bounded by

Z 1/(2T +P)

hy/C^j (0(0) + 1 )m 2o)


2 ( 1 s)

(LZ1/(2T +P) \ SJ

When 2J > max{Z, 2-1+P)v}, this upper bound is controlled by

2(1 s) +
The second term on the right-hand side of (10.17) equals
Z1/(2 T+p
1 sj
Z1 /(2 T+p)
2m 2
(C3 (log(4/5)) 1/ 2 m^)s ,
which is bounded by
Z1/(2 T+p)
m 2 ( 1 s)
(5-/C2) 2 C3 (log(4/5)) 1 /2 m
Z1 /(2 T+p)
(VC^) 2 (0 (0 ) + 1 ) mL
Z1 /(2 T+p)
m 2(1s)
J (5vC) C3 (log(4/5)) 1 /2 (VC^) 2 C3 (log(4/5)) 1 /2 mr.
Z 1/(2T+p)
mr 2(1s)
If r > Z 12/((12^sr)+p), this last expression is bounded by
If r < Z 12/((12^s)+p) , an upper bound is easier:
Z1/(2 T+p)
m 2(1s) J
(57C2)2 C3 (log(4/5))1/2.
Thus, in either case, the second term has the upper bound J (5VC^ C (log(4/5)) / mr*.

1 2

Combining the bounds for the two terms, we have R(J} <
5TC^)2 (0(0) + 1) + J (5-/C2)2
C3 (log(4/5)) m .Taking J to be Jn, we have 2 > max{Z, (2-x+P)n} and we finish the proof.
The proof of Theorem 10.24 follows from Lemmas 10.27 and 10.29 and Proposition 10.22.
The constant Cn can be explicitly obtained.
1 2

10.7 Improving learning rates by imposing noise


There is a difference in the learning rates given by Theorem 10.2 (where the best rate is 1 - e)
and Theorem 9.26 (where the rate can be arbitrarily close to 1). This motivates the idea of
improving the learning rates stated in this chapter by imposing some conditions on the
measures. In this section we introduce one possible such condition.
Definition 10.30 Let 0 < q <. We say that p has Tsybakov noise exponent q if there exists a
constant cq > 0 such that for all t > 0 ,
px ({x e X : | fp(x)\ < cqt}) < tq.
All distributions have at least noise exponent 0 since t0 = 1. Deterministic distributions
(which satisfy | fp(x)\ = 1 ) have noise exponent q = with
= 1.
The Tsybakov noise condition improves the variancing power r^,p. Let us show this for the
hinge loss.
Lemma 10.31 Let 0 < q < .Ifp has Tsybakov noise exponent qwith (10.18) valid, then,
for every function f : X ^[-1,1],
/ 1 \q/(q+ 1 )
( 11 )

E ( myf (x)) - Myfc(x))) 2 < 8

*h(f ) - *h (f)q/(q+ 1

Proof. Since f (x) e [-1,1], we have $h(yf (x)) - $h(yfc (x)) = y( fc(x) - f (x)). It follows that
(f ) (fc) = (fc(x) - f (x))fp(x) dpx =
\ fc(x)- f (x)| \ fp(x)\dpx
E{ ($h(yf (x)) - $h(yfc(x)))2} = J \ fc (x) - f (x)\2 d px.
Let t > 0 and separate the domain X into two sets: X+ := { x e X : | fp(x)\ > cqt} and X: = { x e X : | fp(x)\ < cqt}. On X+ we have | fc(x) - f (x)| 2 < 2| fc(x) - f (x)| 1 fpc^. On X- we
have | fc(x) - f (x)| 2 < 4. It follows from
(10.18) that
2 (E^ (f ) - E* (f c))
c qt
2 (E* (f ) - E(f c)) c qt
| f c(x ) - f (x)|2 d PX <
^ + 4PX(X t - )
+ 4t .
Choosing t = { (E^ h ( f ) - E^h(fc))/(2cq)}l/i'q+V} yields the desired bound.
Lemma 10.31 tells us that the variancing power t^h,P of the hinge loss equals q+Y when
the measure p has Tsybakov noise exponent q. Combining this with Corollary 10.26 gives the
following result on improved learning rates for measures satisfying the Tsybakov noise
Theorem 10.32 Under the assumption of Theorem 10.2, if p has Tsybakov noise
exponent q with 0 < q <<x, then, for any 0 < e < 2 and 0 < 8 < 1, with confidence 1 8, we have
2 1
R(sgn(fz?)) - R( fc) < C log - 8m
where 0 = min j f+ p, q^ - e J and C is a constant independent ofm and 8.

In Theorem 10.32, the learning rate can be arbitrarily close to 1 when q is sufficiently


References and additional remarks

General expositions of convex loss functions for classification can be found in [14,31].
Theorem 10.5, the use of the projection operator, and some estimates for the regularized

error were provided in [31]. The error decomposition for regularization schemes was
introduced in [145].
The convergence of the support vector machine (SVM) 1-norm soft margin classifier for
general probability distributions (without separability conditions) was established in [121]
when HK is dense in C(X) (such a kernel K is called
universal). Convergence rates in this situation were derived in [154]. For further results and
references on convergence rates, see the thesis [140].
The error analysis in this chapter is taken from [142], where more technical and better
error bounds are provided by means of the local Rademacher process, empirical covering
numbers, and the entropy integral [84, 132]. The Tsybakov noise condition of Section 10.7
was introduced in [131].
The iteration technique used in the proof of Lemma 10.29 was given in [122] (see also
SVMs have many modifications for various purposes in different fields [134]. These include
q-norm soft margin classifiers [31,77], multiclass SVMs [4, 32, 75, 139], v-SVMs [108], linear
programming SVMs [26, 96, 98, 146], maximum entropy discrimination [65], and one-class
SVMs [107, 128].
We conclude with some brief comments on current trends.
Learning theory is a rapidly growing field. Many people are working on both its
foundations and its applications, from different points of view. This work develops the theory
but also leaves many open questions. Here we mention some involving regularization
schemes [48].
i; Feature selection. One purpose is to understand structures of highdimensional data. Topics
include manifold learning or semisupervised learning [15, 23, 27, 34, 45, 97] and
dimensionality reduction (see the introduction [55] of a special issue and references
therein). Another purpose is to determine important features (variables) of functions
defined on huge-dimensional spaces. Two approaches are the filter method and the
wrapper method [69]. Regularization schemes for this purpose include those in [56, 58]
and a least squares-type algorithm in [93] that learns gradients as vector-valued functions
fz,Y ,S
arginf inf



- (yi f (Xi)) + YII f IlKa

ii; Multikernel regularization schemes. Let KS = {Ka : a e S} be a set of Mercer kernels on X
such as Gaussian kernels with variances a2 running over (0, ). The multikernel
regularization scheme associated with KS is defined as
Here V : R2 ^ R+ is a general loss function. In [30] SVMs with multiple parameters are
investigated. In [76, 104] mixture-density estimation is considered and Gaussian kernels with
variance a2 varying on an interval [aj2, a2] with 0 < a1 < a2 < are used to derive bounds.
Multitask learning algorithms involve kernels from a convex hull of several Mercer kernels and
spaces with changing norms (e.g. [49, 62]). The learning of kernel functions is studied in [72,
88 , 90].
Another related class of multikernel regularization schemes consists that of schemes
generated by polynomial kernels {Kd (x, y) = (1 + x y)d} with d e N. In [158] convergence
rates in the univariate case (n = 1) for multikernel regularized classifiers generated by
polynomial kernels are derived.

Online learning algorithms. These algorithms improve the efficiency of learning

methods when the sample size m is very large. Their convergence is investigated in
[28, 51, 52,

68 ,

134], and their error with respect to the step size has been

analyzed for the least squares regression in [112 ] and for regularized classification
with a general classifying loss in [151]. Error analysis for online schemes with
varying regularization parameters is performed in [127] and


1; R.A. Adams. Sobolev Spaces. Academic Press, 1975.
2; C.A. Aliprantis and O. Burkinshaw. Principles of Real Analysis. Academic Press, 3rd
edition, 1998.
3; F. Alizadeh and D. Goldfarb. Second-order cone programming. Math. Program., 95:3-51,
4; E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying
approach for margin classifiers. J. Mach. Learn. Res., 1:113-141, 2000.
5; N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions,
uniform convergence and learnability. J. ACM, 44:615-631, 1997.
6; M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations.
Cambridge University Press, 1999.
7; M. Anthony and N. Biggs. Computational Learning Theory. Cambridge University Press,
8; M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Appl.
Math., 47:207-217, 1993.
9; N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337404, 1950.
10; A.R. Barron. Complexity regularization with applications to artificial neural networks. In G.
Roussas, editor, Nonparametric Functional Estimation, pages 561-576. Kluwer
Academic Publishers 1990.
11; R.G. Bartle. The Elements of Real Analysis. John Wiley & Sons, 2nd edition, 1976.
12; P.L. Bartlett. The sample complexity of pattern classification with neural networks: the size
of the weights is more important than the size of the network. IEEE Trans. Inform.
Theory, 44:525-536, 1998.
13; P.L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Ann. Stat.,
33:1497-1537, 2005.
14; P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. J.
Amer. Stat. Ass., 101:138-156, 2006.
15; M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Mach.
Learn., 56:209-239, 2004.
16; J. Bergh and J. Lofstrom. Interpolation Spaces: An Introduction. Springer-Verlag, 1976.
17; P. Binev, A. Cohen, W. Dahmen, R. DeVore, and V. Temlyakov. Universal algorithms for
learning theory. Part I: piecewise constant functions. J. Mach. Learn. Res., 6:1297-1321,
18; C.M. Bishop. Neural Networks for Pattern Recognition. Cambridge University Press,
19; L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and Real Computation.
Springer-Verlag, 1998.
20; B.E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Proceedings of the Fifth Annual Workshop of Computational Learning Theory,
pages 144-152. Association for Computing Machinery, New York, 1992.
21; S. Boucheron, O. Bousquet, and G. Lugosi. Concentration inequalities. In O. Bousquet, U.
von Luxburg, and G. Ratsch, editors, Advanced Lectures in Machine Learning, pages
208-240. Springer-Verlag, 2004.
22; S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications
in random combinatorics and learning. Random Struct. Algorithms, 16:277-292, 2000.
23; O. Bousquet, O. Chapelle, and M. Hein. Measure based regularizations. In S. Thrun, L.K.
Saul, and B. Scholkopf, editors, Advances in Neural Information Processing
Systems, volume 16, pages 1221-1228. MIT Press, 2004.
24; O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499526, 2002.
25; S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

26; P.S. Bradley and O.L. Mangasarian. Massive data discrimination via linear support vector
machines. Optimi. Methods and Softw., 13:1-10, 2000.
27; A. Caponnetto andS. Smale. Risk bounds for random regression graphs. To appear at
Found Comput. Math.
28; N. Cesa-Bianchi, P.M. Long, and M.K. Warmuth. Worst-case quadratic loss bounds for
prediction using linear functions and gradient descent. IEEE Trans. Neural Networks,
7:604-619, 1996.
29; N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University
Press, 2006.
30; O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for
support vector machines. Mach. Learn., 46:131-159, 2002.
31; D.R. Chen, Q. Wu, Y. Ying, and D.X. Zhou. Support vector machine soft margin classifiers:
error analysis. J. Mach. Learn. Res., 5:1143-1175, 2004.
32; D.R. Chen and D.H. Xiang. The consistency of multicategory support vector machines.
Adv. Comput. Math., 24:155-169, 2006.
33; D.A. Cohn, Z. Ghahramani, and M.I. Jordan. Active learning with statistical models. J. Artif.
Intell. Res., 4:129-145, 1996.
34; R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker.
Geometric diffusions as a tool for harmonic analysis and structure definition of data:
diffusion maps. Proc. Natl. Acad. Sci., 102:7426-7431, 2005.
35; C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20:273-297, 1995.
36; D.D. Cox. Approximation of least squares regression on nested subspaces. Ann. Stat.,
16:713-732, 1988.
37; N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.
Cambridge University Press, 2000.
38; F. Cucker and S. Smale. Best choices for regularization parameters in learning theory.
Found. Comput. Math., 2:413-428, 2002.
39; F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math.
Soc., 39:1-49, 2002.
40; L. Debnath and P. Mikusinski. Introduction to Hilbert Spaces with Applications.
Academic Press, 2nd edition, 1999.
41; C. de Boor, K. Hollig, and S. Riemenschneider. Box Splines. Springer-Verlag, 1993.
42; E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares
algorithm in learning theory. Found. Comput. Math., 5:59-85, 2005.
43; E. De Vito, L. Rosasco, A. Caponnetto, U. de Giovannini, and F. Odone. Learning from
examples as an inverse problem. J. Mach. Learn. Res., 6:883-904, 2005.
44; L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.
Springer-Verlag, 1996.
45; D.L. Donoho and C. Grimes. Hessian eigenmaps: locally linear embedding techniques for
high-dimensional data. Proc. Natl. Acad. Sci., 100:5591-5596,
46; R.M. Dudley, E. Gine, and J. Zinn. Uniform and universal Glivenko-Cantelli classes. J.
Theor. Prob., 4:485-510, 1991.
47; D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Diferential
Operators. Cambridge University Press, 1996.
48; H.W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems, volume
375 of Mathematics and Its Applications. Kluwer, 1996.
49; T. Evgeniou and M. Pontil. Regularized multi-task learning. In C.E. Brodley, editor, Proc.
17th SIGKDD Conf. Knowledge Discovery and Data Mining, Association for
Computing Machinery, New York, 2004.
50; T. Evgeniou, M. Pontil, andT. Poggio. Regularization networks and support vector
machines. Adv. Comput. Math., 13:1-50, 2000.
51; J. Forster and M.K. Warmuth. Relative expected instantaneous loss bounds. J. Comput.
Syst. Sci., 64:76-102, 2002.
52; Y. Freund and R.E. Shapire. A decision-theoretic generalization of on-line learning and an
application to boosting. J. Comput. Syst. Sci., 55:119-139, 1997.
53; F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures.
Neural Comp., 7:219-269, 1995.
54; G. Golub, M. Heat, and G. Wahba. Generalized cross-validation as a method for choosing a
good ridge parameter. Technometrics, 21:215-223, 1979.
55; I. Guyon and A. Ellisseeff. An introduction to variable and feature selection. J. Mach.
Learn. Res., 3:1157-1182, 2003.
56; I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification
using support vector machines. Mach. Learn., 46:389-422, 2002.

57; L. Gyrfi, M. Kohler, A. KrzyZak, and H. Walk. A Distribution-Free Theory of

Nonparametric Regression. Springer-Verlag, 2002.
58; D. Hardin, I. Tsamardinos, and C.F. Aliferis. A theoretical characterization of linear SVMbased feature selection. In Proc. 21st Int. Conf. Machine Learning,
59; T. Hastie, R.J.Tibshirani, and J.H. Friedman. The Elements of Statistical Learning.
Springer-Verlag, 2001.
60; D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other
learning applications. Inform. and Comput., 100:78-150, 1992.
61; R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2002 .
62; M. Herbster. Relative loss bounds and polynomial-time predictions for the k-lms- net
algorithm. In S. Ben-David, J. Case, and A. Maruoka, editors, Proc. 15th Int. Conf.
Algorithmic Learning Theory, Springer 2004.
63; H. Hochstadt. Integral Equations. John Wiley & Sons, 1973.
64; V.V. Ivanov. The Theory of Approximate Methods and Their Application to the
Numerical Solution of Singular Integral Equations. Nordhoff International, 1976.
65; T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In S.A. Solla, T.K.
Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems,
volume 12, pages 470-476. MIT Press, 2000.
66; K. Jetter, J. Stckler, and J.D. Ward. Error estimates for scattered data interpolation on
spheres. Math. Comp., 68:733-747, 1999.
67; M.J. Kearns and U.V. Vazirani. An Introduction to Computational Learning Theory.
MIT Press, 1994.
68; J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels. IEEE Trans.
Signal Processing, 52:2165-2176, 2004.
69; R. Kohavi and G. John. Wrappers for feature subset selection. Artif. Intell., 97:273-324,
70; A.N. Kolmogorov and S.V. Fomin. Introductory Real Analysis. Dover Publications, 1975.
71; V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers. Ann. Stat., 30:1-50, 2002.
72; G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel
matrix with semidefinite programming. J. Mach. Learn. Res., 5:27-72, 2004.
73; P. Lax. Functional Analysis. John Wiley & Sons, 2002.
74; W.-S. Lee, P. Bartlett, and R. Williamson. The importance of convexity in learning with
squared loss. IEEE Trans. Inform. Theory, 44:1974-1980, 1998.
75; Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and
application to the classification of microarray data and satellite radiance data. J. Amer.
Stat. Ass., 99:67-81, 2004.
76; J. Li and A. Barron. Mixture density estimation. In S.A. Solla, T.K. Leen, and K.R. Mller,
editors, Advances in Neural Information Processing Systems, volume 12, pages
279-285. Morgan Kaufmann Publishers, 1999.
77; Y. Lin. Support vector machines and the Bayes rule in classification. Data Min. Knowl.
Discov., 6:259-275, 2002.
78; G.G. Lorentz. Approximation of Functions. Holt, Rinehart and Winston, 1966.
79; F. Lu and H. Sun. Positive definite dot product kernels in learning theory. Adv. Comput.
Math., 22:181-198, 2005.
80; G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods.
Ann. Stat., 32:30-55, 2004.
81; D.J.C. Mackay. Information-based objective functions for active data selection. Neural
Comp., 4:590-604, 1992.
82; W.R. Madych and S.A. Nelson. Bounds on multivariate polynomials and exponential error
estimates for multiquadric interpolation. J. Approx. Theory, 70:94-114, 1992.
83; C. McDiarmid. Concentration. In M. Habib et al., editors, Probabilistic Methods for
Algorithmic Discrete Mathematics, pages 195-248. Springer-Verlag, 1998.
84; S. Mendelson. Improving the sample complexity using global data. IEEE Trans. Inform.
Theory, 48:1977-1991, 2002.
85; J. Mercer. Functions of positive and negative type and their connection with the theory of
integral equations. Philos. Trans. Roy. Soc. LondonSer.A, 209:415-446, 1909.
86; C.A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive
definite functions. Constr. Approx., 2:11-22, 1986.
87; C.A. Micchelli and A. Pinkus. Variational problems arising frombalancing several error
criteria. Rend. Math. Appl., 14:37-86, 1994.
88; C.A. Micchelli and M. Pontil. Learning the kernel function via regularization. J. Mach.
Learn. Res., 6:1099-1125, 2005.

89; C.A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Comp., 17:177204, 2005.
90; C.A. Micchelli, M. Pontil, Q. Wu, and D.X. Zhou. Error bounds for learning the kernel.
Preprint, 2006.
91; M. Mignotte. Mathematics for Computer Algebra. Springer-Verlag, 1992.
92; T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.
93; S. Mukherjee and D.X. Zhou. Learning coordinate covariances via gradients. J. Mach.
Learn. Res., 7:519-549, 2006.
94; FJ. Narcowich, J.D. Ward, and H. Wendland. Refined error estimates for radial basis function
interpolation. Constr. Approx., 19:541-564, 2003.
95; P. Niyogi. The Informational Complexity of Learning. Kluwer Academic Publishers,
96; P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis
complexity and sample complexity for radial basis functions. Neural Comput., 8:819842, 1996.
97; P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high
confidence from random samples. Preprint, 2004.
98; J.P. Pedroso and N. Murata. Support vector machines with different norms: motivation,
formulations and results. Pattern Recognit. Lett., 22:1263-1272, 2001 .
99; I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. Ann.
Probab., 22:1679-1706, 1994.
100; A. Pinkus. N-widths in Approximation Theory. Springer-Verlag, 1996.
101; A. Pinkus. Strictly positive definite kernels on a real inner product space. Adv. Comput.
Math., 20:263-271, 2004.
102; T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory. Nature,
317:314-319, 1985.
103; D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984.
104; A. Rakhlin, D. Panchenko, and S. Mukherjee. Risk bounds for mixture density estimation.
ESAIM: Prob. Stat., 9:220-229, 2005.
105; R. Schaback. Reconstruction of multivariate functions from scattered data. Manuscript,
106; I.J. Schoenberg. Metric spaces and completely monotone functions. Ann. Math., 39:811841, 1938.
107;B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
108; B. Scholkopf, A.J. Smola, R.C. Williamson, andP.L. Bartlett. New support vector algorithms.
Neural Comp., 12:1207-1245, 2000.
109;I.R. Shafarevich. Basic Algebraic Geometry. 1: Varieties in Projective Space.
Springer-Verlag, 2nd edition, 1994.
110; J. Shawe-Taylor, P.L. Bartlet, R.C. Williamson, and M. Anthony. Structural risk minimization
over data dependent hierarchies. IEEE Trans. Inform. Theory, 44:1926-1940, 1998.
111; J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, 2004.
112; S. Smale and Y. Yao. Online learning algorithms. Found. Comput. Math., 6:145170,
113; S. Smale and D.X. Zhou. Estimating the approximation error in learning theory. Anal.
Appl., 1:17-41, 2003.
114; S. Smale and D.X. Zhou. Shannon sampling and function reconstruction from point values.
Bull. Amer. Math. Soc., 41:279-305, 2004.
115; S. Smale and D.X. Zhou. Shannon sampling II: Connections to learning theory. Appl.
Comput. Harmonic Anal., 19:285-302, 2005.
116; S. Smale and D.X. Zhou. Learning theory estimates via integral operators and their
approximations. To appear in Constr. Approx.
117; A. Smola, B. Scholkopf, and R. Herbricht. A generalized representer theorem. Comput.
Learn. Theory, 14:416-426, 2001.
118; A. Smola, B. Scholkopf, and K.R. Muller. The connection between regularization operators
and support vector kernels. Neural Networks, 11:637-649, 1998.
119; M. Sousa Lobo, L. Vandenberghe, S. Boyd, andH. Lebret. Applications of second- order
cone programming. Linear Algebra Appl., 284:193-228, 1998.
120;E.M. Stein. Singular Integrals and Diferentiability Properties of Functions.
Princeton University Press, 1970.
121; I. Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768791, 2002.
122; I. Steinwart and C. Scovel. Fast rates for support vector machines. In P. Auer and R. Meir,

editors, Proc. 18th Ann. Conf. Learn. Theory, pages 279-294, Springer
123; H.W. Sun. Mercer theorem for RKHS on noncompact sets. J. Complexity, 21:337349,
124; R.S. Sutton andA.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
125; J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least
Squares Support Vector Machines. World Scientific, 2002.
126; M. Talagrand. New concentration inequalities in product spaces. Invent. Math., 126:505563, 1996.
127; P. Tarres and Y. Yao. Online learning as stochastic approximations of regularization paths.
Preprint, 2005.
128; D.M.J. Tax and R.P.W. Duin. Support vector domain description. Pattern Recognit. Lett.,
20:1191-1199, 1999.
129;M.E. Taylor. Partial Diferential Equations I: Basic Theory, volume 115 of Applied
Mathematical Sciences. Springer-Verlag, 1996.
130; A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-Posed Problems. W.H. Winston, 1977.
131; A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Stat.,
32:135-166, 2004.
132; A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes.
Springer-Verlag, 1996.
133;V.N. Vapnik. Estimation of Dependences Based on Empirical Data. SpringerVerlag,
134; V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
135; V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies
of events to their probabilities. Theory Prob. Appl., 16:264-280,1971.
136; M. Vidyasagar. Learning and Generalization. Springer-Verlag, 2003.
137;G. Wahba. Spline Models for Observational Data. SIAM, 1990.
138; G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the
randomized GACV. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel
Methods - Support Vector Learning, pages 69-88. MIT Press, 1999.
139; J. Weston and C. Watkins. Multi-class support vector machines. Technical Report CSD-TR98-04, Department of Computer Science, Royal Holloway, University of London, 1998.
140; Q. Wu. Classification and regularization in learning theory. PhD thesis, City University of
Hong Kong, 2005.
141; Q. Wu, Y. Ying, and D.X. Zhou. Learning theory: from regression to classification. In K.
Jetter, M. Buhmann, W. Haussmann, R. Schaback, and J. Stoeckler, editors, Topics in
Multivariate Approximation and Interpolation, volume 12 of Studies in
Computational Mathematics, pages 257-290. Elsevier, 2006.
142; Q. Wu, Y. Ying, and D.X. Zhou. Multi-kernel regularized classifiers. To appear in J.
143; Q. Wu, Y. Ying, and D.X. Zhou. Learning rates of least-square regularized regression.
Found. Comput. Math., 6:171-192, 2006.
144; Q. Wu and D.X. Zhou. SVM soft margin classifiers: linear programming versus quadratic
programming. Neural Comp., 17:1160-1187, 2005.
145; Q. Wu and D.X. Zhou. Analysis of support vector machine classification. J. Comput. Anal.
Appl., 8:99-119, 2006.
146; Q. Wu and D.X. Zhou. Learning with sample dependent hypothesis spaces. Preprint, 2006.
147; Z. Wu and R. Schaback. Local error estimates for radial basis function interpolation of
scattered data. IMA J. Numer. Anal., 13:13-27, 1993.
148; Y. Yang and A. R. Barron. Information-theoretic determination of minimax rates of
convergence. Ann. Stat., 27:1564-1599, 1999.
149; G.B. Ye and D.X. Zhou. Fully online classification by regularization. To appear at Appl.
Comput. Harmonic Anal.
150; Y. Ying and D.X. Zhou. Learnability of Gaussians with flexible variances. To appear at J.
Mach. Learn. Res.
151; Y. Ying and D.X. Zhou. Online regularized classification algorithms. To appear in IEEE
Trans. Inform. Theory, 52:4775-4788, 2006.
152; T. Zhang. On the dual formulation of regularized linear systems with convex risks.
Machine Learning, 46:91-129, 2002.
153; T. Zhang. Leave-one-out bounds for kernel methods. Neural Comp., 15:13971437, 2003.
154; T. Zhang. Statistical behavior and consistency of classification methods based on convex
risk minimization. Ann. Stat., 32:56-85, 2004.
155; D.X. Zhou. The covering number in learning theory. J. Complexity, 18:739-767, 2002.
156; D.X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform.

Theory, 49:1743-1752, 2003.

157; D.X. Zhou. Density problem and approximation error in learning theory. Preprint,
D.X. Zhou and K. Jetter. Approximation with polynomial kernels and SVM classifiers.


Adv. Comput. Math., 25:323-344, 2006.

, 7
Il \\Cs(X)> I8 UN Lip(s) , 7 3 H Hup*(s,c(X))>74 N HLip*(s,LP(X)),74 H Hs, 21 H 11//s (Rn ), 75
A(fp, R), 54 A(H), 12 BR, 76
CK ,22 C(X), 9 Cs(X ), 18 C (X ), 18 V(y), 136 V(y, 0), 196 Diam(X), 72 , 110 A? ,74 , 6 y,
134 EH, 11 0,162 0,162 z0Y, 162 0,162 z ,8
z,Y, 134 P K (x), 112 nx, 174
fc, 159
fY ,134
fH,9 fp, 3, 6
fY ,8 fz ,9 fz0, 161
fz,Y, 134 fztV, 164 HK ,23 HK,z, 34, 163 ffs(Rn), 75 ffs(X), 21 Kp,159 Kp,159
Kx ,22
K [x], 22

12(XN), 90 i (XN), 90 Lip(s), 73 Lip*(s, C(X)), 74 Lip*(s, L (X)), 74 LK , 56 LP(X ), 18 L (X ), 19
Lz, 10
M(S, n), 101 N (S, n),37 O (n), 30 00, 162 0h, 165 0ls, 162 ns(K), 84 R(f ) 5 P, 5
PX , 6 p(y|x), 6 sgn, 159
P2,6 T, 8

7x ,5 Xr,t, 74 Y, 5 Z ,5 Zm, 8
Bayes rule, 160 Bennetts inequality, 38 Bernsteins inequality, 40, 42 best fit, 2
bias-variance problem, 13, 127 bounded linear map, 21 box spline, 28
Chebyshevs inequality, 38 classification algorithms, 160 classifier binary, 157 regularized,
164 compact linear map, 21 confidence, 42 constraints, 33 convex function, 33 convex
programming, 33 convex set, 33 convolution, 19 covering number, 37
defect, 10 distortion, 110 divided difference, 74 domination of measures, 110
efficient algorithm, 33 e-net, 78 ERM, 50 error
approximation, 12
approximation (associated with ty), 70 empirical, 8
empirical (associated with p), 162 empirical (associated with ty), 51 excess generalization, 12
excess misclassification, 188 generalization, 5
generalization (associated with p), 162
generalization (associated with ty), 51
in H, 11
local, 6, 161
misclassification, 157
regularized, 134 regularized (associated with p), 162 regularized empirical, 134 regularized
empirical (associated with
P), 162
expected value, 5
feasible points, 33 feasible set, 33 feature map, 70 Fourier coefficients, 55 Fourier transform,
19 nonnegative, 26 positive, 26 full measure, 10 function
completely monotonic, 21 even, 26 measurable, 19
generalized Bennetts inequality, 40, 42 Gramian, 22
Hoeffdings inequality, 40, 42 homogeneous polynomials, 17, 29 hypothesis space, 9 convex,
interpolation space, 63
K-functional, 63 kernel, 56 box spline, 28 dot product, 24 Mercer, 22 spline, 27
translation invariant, 26 universal, 212
Lagrange interpolation polynomials, 84
Lagrange multiplier, 151
least squares, 1, 2
left derivative, 34
linear programming, 33
localizing function, 190

classifying function, 187 c -insensitive, 50 function, 161 function, regression, 50 hinge,

165least squares, 50, 162 misclassification, 162 #-norm, 187
margin, 167 hard, 171 maximal, 168 of the sample, 168 soft, 171
Markovs inequality, 38 measure finite, 19 marginal, 6 nondegenerate, 19 strictly separable
by HK , 173 support of, 19
weakly separable by HK , 158, 182 Mercer kernel, see kernel metric entropy, 52 model
selection, 127 multinomial coefficients, 24
net, 78
nodal functions, 102
objective function, 33 offset, 172 operator positive, 55 self-adjoint, 55 strictly positive, 55
optimal hyperplane, 168 orthogonal group, 30 orthogonal invariance, 30 orthonormal basis,
55 orthonormal complete system, 55 orthonormal system, 55
packing number, 101 power function, 125 problems
classification, 15 regression, 15 programming convex, 34 convex quadratic, 34
general nonlinear, 33 second-order cone, 34 projection operator, 195
radial basis functions, 28 regression function, 3, 6 regularization
parameter, 135, 164 scheme, 135, 164 regularizing function, 195 reproducing kernel
Hilbert space, 24 reproducing property, 24 right derivative, 34
RKHS, see reproducing kernel Hilbert space
sample, 8
separable, 166
separable by a hyperplane, 166 sample error, regularized, 136 separating hyperplane,
167 separation exponent, 182 triple, 182
Sobolev embedding theorem, 21 Sobolev space, 21 fractional, 35 spherical coordinates, 98
support vector machine, 165 support vectors, 169 SVM, see support vector machine
target function, 9, 51, 134 empirical, 9, 51, 134
uniform Glivenko-Cantelli, 52
variance, 5
variancing power, 198 Veronese embedding, 31 Veronese variety, 31, 36
weak compactness, 20 weak convergence, 20 Weyl inner product, 30
Zygmund class, 75Lip*(s, L2(Rn)) c Hs(Rn) c Lip*(s - e, L2(Rn)).
For any integer d < s - n, f e Cd (Rn) and > f >Cd Cd > f >Hs. In particular, if s > ", it
follows by taking d = 0 that Hs(Rn) c C(Rn). Note that this is the Sobolev embedding theorem
mentioned in Section 2.3 for X = Rn. (These facts can be easily shown using the inverse
Fourier transform when s > n and d < s - n. When n > s > n and s - n d < s - n, the
proofs are more involved.) Thus, if X c Rn has piecewise smooth boundary and d < s - n, each
function f e Lip*(s, L2(X)) can be extended to a Cd function on Rn and there exists a constant
CX,s,d such that for all f e Lip*(s, L2(X)),