Vous êtes sur la page 1sur 60

Apprentissage automatique

Régression linéaire - modèle


TYPES D’APPRENTISSAGE
Sujets: apprentissage supervisé, classification, régression
RAPPEL
• L’apprentissage supervisé est lorsqu’on a une cible à
prédire
‣ classification : la cible est un indice de classe t ∈{1, ... , K}
- exemple : reconnaissance de caractères
✓ x : vecteur des intensités de tous les pixels de l’image
✓ t : identité du caractère

‣ régression : la cible est un nombre réel t ∈ ℝ


- exemple : prédiction de la valeur d’une action à la bourse
✓ x : vecteur contenant l’information sur l’activité économique de la journée
✓ t : valeur d’une action à la bourse le lendemain

2 HUGO LAROCHELLE
TYPES D’APPRENTISSAGE
Sujets: apprentissage supervisé, classification, régression
RAPPEL
• L’apprentissage supervisé est lorsqu’on a une cible à
prédire
‣ classification : la cible est un indice de classe t ∈{1, ... , K}
- exemple : reconnaissance de caractères
✓ x : vecteur des intensités de tous les pixels de l’image
✓ t : identité du caractère

‣ régression : la cible est un nombre réel t ∈ ℝ


- exemple : prédiction de la valeur d’une action à la bourse
✓ x : vecteur contenant l’information sur l’activité économique de la journée
✓ t : valeur d’une action à la bourse le lendemain

2 HUGO LAROCHELLE
sis M ODÈLE
Function
Linear DE
BasisModels R ÉGRESSION
Function Models L INÉAIRE
Sujets: modèle, biais, poids
The model
inear simplest
forlinear model is
regression forone
regression is one athat
that involves involves
linear a linear combination
combination of
• Le
the
bles modèle
input de régression linéaire est le suivant :
variables

y(x, w) = w0 y(x, x1 =
+ w1w) +w. .0. +
+ww1Dxx
1D+ . . . + wD xD (3.1) (3.

1 ,where
. . .où
, xD
x=
T
) .(x
This
1 , . . .
is , x T
often
D ) . This
simply is often
known simply
as known
linear as linear
regression. regression.
The key The k
s property
model is of this
that model
it is is that
a linear it is a linear
function of thefunction
parametersof the
w0parameters
, . . . , wD . w , . . . , wD . It
It0is
aalso,
linear however,
function a linear
of the function
input of the input
variables x , variables
and this x ,
imposes
i and this imposes
significant significa
• La prédiction
limitations on the correspond donc
model. extend à un
We therefore hyperplan
i
extend de
the class ofconsidering
models by consideri
the model.
dimension We
D therefore
(donc une droite si the
D=1) class of models by
linear combinations of fixed nonlinear functions
ations of fixed nonlinear functions of the input variables, of the form of the input variables, of the form
M
! −1
M
! −1
3 y(x, w) = +
wL0AROCHELLE
HUGO wj φj (x) (3.
sis M ODÈLE
Function
Linear DE
BasisModels R ÉGRESSION
Function Models L INÉAIRE
Sujets: modèle, biais, poids
The model
inear simplest
forlinear model is
regression forone
regression is one athat
that involves involves
linear a linear combination
combination of
• Le
the
bles modèle
input de régression linéaire est le suivant :
variables
biais
y(x, w) = w0 y(x, x1 =
+ w1w) +w. .0. +
+ww1Dxx
1D+ . . . + wD xD (3.1) (3.

1 ,where
. . .où
, xD
x=
T
) .(x
This
1 , . . .
is , x T
often
D ) . This
simply is often
known simply
as known
linear as linear
regression. regression.
The key The k
s property
model is of this
that model
it is is that
a linear it is a linear
function of thefunction
parametersof the
w0parameters
, . . . , wD . w , . . . , wD . It
It0is
aalso,
linear however,
function a linear
of the function
input of the input
variables x , variables
and this x ,
imposes
i and this imposes
significant significa
• La prédiction
limitations on the correspond donc
model. extend à un
We therefore hyperplan
i
extend de
the class ofconsidering
models by consideri
the model.
dimension We
D therefore
(donc une droite si the
D=1) class of models by
linear combinations of fixed nonlinear functions
ations of fixed nonlinear functions of the input variables, of the form of the input variables, of the form
M
! −1
M
! −1
3 y(x, w) = +
wL0AROCHELLE
HUGO wj φj (x) (3.
sis M ODÈLE
Function
Linear DE
BasisModels R ÉGRESSION
Function Models L INÉAIRE
Sujets: modèle, biais, poids
The model
inear simplest
forlinear model is
regression forone
regression is one athat
that involves involves
linear a linear combination
combination of
• Le
the
bles modèle
input de régression linéaire est le suivant :
variables
biais poids
y(x, w) = w0 y(x, x1 =
+ w1w) +w. .0. +
+ww1Dxx
1D+ . . . + wD xD (3.1) (3.

1 ,where
. . .où
, xD
x=
T
) .(x
This
1 , . . .
is , x T
often
D ) . This
simply is often
known simply
as known
linear as linear
regression. regression.
The key The k
s property
model is of this
that model
it is is that
a linear it is a linear
function of thefunction
parametersof the
w0parameters
, . . . , wD . w , . . . , wD . It
It0is
aalso,
linear however,
function a linear
of the function
input of the input
variables x , variables
and this x ,
imposes
i and this imposes
significant significa
• La prédiction
limitations on the correspond donc
model. extend à un
We therefore hyperplan
i
extend de
the class ofconsidering
models by consideri
the model.
dimension We
D therefore
(donc une droite si the
D=1) class of models by
linear combinations of fixed nonlinear functions
ations of fixed nonlinear functions of the input variables, of the form of the input variables, of the form
M
! −1
M
! −1
3 y(x, w) = +
wL0AROCHELLE
HUGO wj φj (x) (3.
MODÈLE DE RÉGRESSION LINÉAIRE
1.1. Example: Polynomial Curve Fitting 7
Sujets: exemple 1D

=0 1 M =1
t

−1

x 1 0 x1 1

4 HUGO LAROCHELLE
MODÈLE
3.1. DE RÉGRESSION
Linear LINÉAIRE
Basis Function Models
1.1. Example: Polynomial Curve Fitting 7
Sujets: exemple 1D The simplest linear model for regression is one that
the input variables
=0 1 M =1 y(x, w) = w0 + w1 x1 + . . .
t
where x = (x1 , . . . , xD ) . This is often simply kno
T
0
property of this model is that it is a linear function of
also, however, a linear function of the input variable
−1 limitations on the model. We therefore extend the
linear combinations of fixed nonlinear functions of
x 1 0 x1 1
M
! −1
4 HUGO LAROCHELLE y(x, w) = w0 + wj
MODÈLE
3.1. DE RÉGRESSION
Linear LINÉAIRE
Basis Function Models
1.1. Example: Polynomial Curve Fitting 7
Sujets: exemple 1D The simplest linear model for regression is one that
the input variables
=0 1 M =1 y(x, w) = w0 + w1 x1 + . . .

{
t
where x = (x1 , . . . , xD ) . This is often simply kno
T
0
w0 property of this model is that it is a linear function of
also, however, a linear function of the input variable
−1 limitations on the model. We therefore extend the
linear combinations of fixed nonlinear functions of
x 1 0 x1 1
M
! −1
4 HUGO LAROCHELLE y(x, w) = w0 + wj
MODÈLE
3.1. DE RÉGRESSION
Linear LINÉAIRE
Basis Function Models
1.1. Example: Polynomial Curve Fitting 7
Sujets: exemple 1D The simplest linear model for regression is one that
the input variables
1
=0
{ 1 M =1 y(x, w) = w0 + w1 x1 + . . .

{ {
t
where x = (x1 , . w. .1 , xD ) . This is often simply kno
T
0
w0 property of this model is that it is a linear function of
also, however, a linear function of the input variable
−1 limitations on the model. We therefore extend the
linear combinations of fixed nonlinear functions of
x 1 0 x1 1
M
! −1
4 HUGO LAROCHELLE y(x, w) = w0 + wj
Apprentissage automatique
Régression linéaire - fonctions de base (basis functions)
sis M ODÈLE
Function
Linear DE
BasisModels R ÉGRESSION
Function Models L INÉAIRE
Sujets: modèle
The model
inear simplest
forlinear model is
regression forone
regression is one athat
that involves involves
linear a linear combination
combination of
• Le
the
bles modèle
input de régression linéaire est le suivant :
variables

y(x, w) = w0 y(x, x1 =
+ w1w) +w. .0. +
+ww1Dxx
1D+ . . . + wD xD (3.1) (3.

1 ,where
. . .où
, xD
x ) =
T
.(x
This
1 , . . .
is , x T
often
D ) . This
simply is often
known simply
as known
linear as linear
regression. regression.
The key The k
s property
model is of this
that model
it is is that
a linear it is a linear
function of thefunction
parametersof the
w0parameters
, . . . , wD . w , . . . , wD . It
It0is
aalso,
linear however,
function a linear
of the function
input of the input
variables x , variables
and this x ,
imposes
i and this imposes
significant significa
• La prédiction
limitations on the correspond donc
model. extend à un
We therefore hyperplan
i
extend de
the class ofconsidering
models by consideri
the model.
dimension We
D therefore the class of models by
linear combinations of fixed nonlinear functions
ations of fixed nonlinear functions of the input variables, of the form of the input variables, of the form
‣ un hyperplan peut ne pas être assez flexible pour faire une bonne
prédiction M −1
M
! −1
6
!
y(x, w) = +
wL0AROCHELLE
HUGO wj φj (x) (3.
w1. x + T. . . + w x
, x1 D ) . This isDoften (3.1) regression. The key
D simply known as linear
odel is that it is a linear function of the parameters w0 , . . . , wD . It is
imply known as F
linear ONCTION
regression. The DE
key B ASE
near function of the input variables xi , and this imposes significant
unction
model. of
Sujets:We the parameters
therefore
fonctions de extend
base w the
,
(basis
0 . . class
. , wD .
of
functions) It is
models by considering
ut
ns variables xi , and this
of fixed nonlinear imposes
functions of the significant
input variables, of the form
• On peut introduire une non-linéarité comme suit :
extend the class of models by considering
M!−1
nctions of the input variables, of the form
y(x, w) = w0 + wj φj (x) (3.2)
M −1 j =1
!
+ où w
les
j φ j (x) sont des fonctions de (3.2)
base (basis
known as basis functions. By denoting the maximum value of the
functions)
1, the total number
j =1 of parameters in this model will be M .
r w0 allows for any fixed offset in the data and is sometimes called
s. ByCas
(not to denoting
be the
confused maximum
with ‘bias’ value
in a of the
statistical sense). It is often
• linéaire : j (x) = xj et M = D + 1
rameters in thisdummy
ne an additional model will
‘basisbefunction’
M. φ0 (x) = 1 so that
d offset in the data and is sometimes
7
called
HUGO LAROCHELLE
re known as basis functions. By denoting the maximum value of the
− 1, the total number of parameters in this model will be M .
meter w0 allows for any fixed F
ONCTION B
DEand isASE
offset in the data sometimes called
ter (not to be confused with ‘bias’ in a
Sujets: fonctions de base (basis functions) statistical sense). It is often
define an additional dummy ‘basis function’ φ0 (x) = 1 so that
• Pour simplifier la notation, on va supposer que 0 (x) = 1
M!− 1
y(x, w) = wj φj (x) = w φ(x)
T
(3.3)
j =0

w0 , . . . , wM −1 ) and φ = (φ0 , . . . , φM −1 ) . In many practical ap-


T T

attern recognition, we will apply some form of fixed pre-processing,

8 HUGO LAROCHELLE
re known as basis functions. By denoting the maximum value of the
− 1, the total number of parameters in this model will be M .
meter w0 allows for any fixed F
ONCTION B
DEand isASE
offset in the data sometimes called
ter (not to be confused with ‘bias’ in a
Sujets: fonctions de base (basis functions) statistical sense). It is often
define an additional dummy ‘basis function’ φ0 (x) = 1 so that
• Pour simplifier la notation, on va supposer que 0 (x) = 1
M!− 1
y(x, w) = wj φj (x) = w φ(x)
T
(3.3)
j =0
T
(w0 , . . . , wM 1 )
w0 , . . . , wM −1 ) and φ = (φ0 , . . . , φM −1 ) . In many practical ap-
T T

attern recognition, we will apply some form of fixed pre-processing,

8 HUGO LAROCHELLE
re known as basis functions. By denoting the maximum value of the
− 1, the total number of parameters in this model will be M .
meter w0 allows for any fixed F
ONCTION DEand isASE
offset in the data B
sometimes called
ter (not to be confused with ‘bias’ in a
Sujets: fonctions de base (basis functions) statistical sense). It is often
define an additional dummy ‘basis function’ φ0 (x) = 1 so that
• Pour simplifier la notation, on va supposer que 0 (x) = 1
M!− 1
y(x, w) = wj φj (x) = w φ(x)
T
(3.3)
T
j =0 ( 0 (x), . . . , M 1 (x))
T
(w0 , . . . , wM 1 )
w0 , . . . , wM −1 ) and φ = (φ0 , . . . , φM −1 ) . In many practical ap-
T T

attern recognition, we will apply some form of fixed pre-processing,

8 HUGO LAROCHELLE
FONCTION DE BASE
Sujets: fonctions de base polynomiales
• Exemple : fonctions de bases polynomiales (1D)
j
j (x) =x

• On retrouve alors la régression polynomiale

9 HUGO LAROCHELLE
that they are global functions of the input variable, so that changes
F ONCTION DE B
put space affect all other regions. This can be resolved by dividing
ASE
into regions and fit a different polynomial in each region, leading
(Hastie al., 2001).
et fonctions
Sujets: de base gaussiennes
ny other possible choices for the basis functions, for example
• Exemple : fonctions de base gaussiennes
! "
(x − µj ) 2
φj (x) = exp − (3.4)
2s 2

ern the µj et s doivent


où locations êtrebasis
of the spécifiés
functions in input space, and the pa-
s their spatial scale. These are usually referred to as ‘Gaussian’
though it should be noted that they are not required to have a prob-
tion, and in particular the normalization coefficient is unimportant
s functions will be multiplied by adaptive parameters wj .
bility is the sigmoidal basis function
10 HUGO of the form
LAROCHELLE
FONCTION DE BASE
Sujets: fonctions de base gaussiennes
3. LINEAR MODELS FOR REGRESSION
• Exemple : fonctions de base gaussiennes
‣ exemple en 1D
1 1

0.75 0.75

0.5 0.5

0.25 0.25

0 0
0 1 −1 0 1 −1 0 1
1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in the
11 HUGO LAROCHELLE
d sigmoidal of the form (3.5) on the right.
Apprentissage automatique
Régression linéaire - maximum de vraisemblance
e, we assume
owever, that 3.1.5,
in Section the target variable tbriefly
we consider is given
theby a deterministic func-
modifications
multiple targetGaussian
with additive variables.noise so that
MAXIMUM DE VRAISEMBLANCE
mum likelihood and
t =least squares
y(x, w) +ϵ (3.7)
Sujets: formulation probabiliste
e fitted polynomial functions to data sets by minimizing a sum-
zero • Pour
mean entraîner
Gaussian le modèle
random y(x,w), nouswith
variable passerons par une(inverse variance)
precision
tion. We also showed that this error function could be motivated
an write
formulation probabiliste :
elihood solution under an assumed Gaussian noise model. Let
ussion and consider the least squares approach, and its relation
p(t|x,
ood, in more detail. w, β) = N (t|y(x, w), β −1
). (3.8)
ssume that the target variable t is given by a deterministic func-
fditive
we‣assume
équivaut àasupposer
squared
Gaussiandunoise
queloss function,
les cibles
so that then
sont une versionthe optimal
bruitée de la prediction, for a
prédiction vrai modèle
x, will be given by the conditional mean of the target variable. In the
ussian conditional t = y(x,distribution
w) + ϵ of Gaussienne
the form de (3.8), the conditional
(3.7) mean
moyenne 0
et variance β -1
an Gaussian random variable with precision (inverse variance)
e 13 HUGO LAROCHELLE
E[t|x] = tp(t|x) dt = y(x, w). (3.9)
ply !
hewill will be
s unimodal, which may
M
simply
be simply noise assumption
Gaussian
E[t|x] = AXIMUM
E[t|x]be
!
tp(t|x)
implies
dt
tp(t|x)=
= inappropriate
E[t|x]
= DEV
!that theRAISEMBLANCE
y(x,
dt = y(x,
w).
forw).
tp(t|x)
conditional distribution of
dt =applications.
some y(x, w). (3.9)An ex-
(3.9)
(3.9)
mixtures
heNote Sujets:
Gaussian of conditional
formulation
noise assumption Gaussian
probabiliste distributions,
implies that which permit multimodal
that
Notethe Gaussian
that the noise assumption
Gaussian noise implies
assumption thatthe the conditional
conditional
implies that thedistribution
distribution of
conditional of
distribution of
lsdistributions,
given x is unimodal,
tunimodal, whichwill be
may discussed
which be in
mayinappropriate Section
be inappropriatefor 14.5.1.
for some
some applications.
applications. An ex-An ex-
onsider• Soit
tension to
t a notre
given
data
mixtures
x
set ensemble
is unimodal,
ofofconditional
inputs d’entraînement
X which
=
Gaussian{x may , . .be
. ,
distributions, xinappropriate
} with
which
for some
corresponding
permit
applications.
multimodal target
An ex-
mixtures of
tension
= conditional
{(x to mixtures
,t ), ... Gaussian
,of(x distributions,
conditional
,t )}
1
Gaussian N which permit
distributions, multimodal
which permit multimodal
. . , t D
conditional
. We
l distributions,
distributions,
group 1 1the will be
target discussed
variables
N N in Section
{t } 14.5.1.
into a column vector that we
considerwill besetdiscussed Xin=beSection 14.5.1.
N conditional distributions, n
Now a data of inputswill {xdiscussed
, . . . , x } in Section
with 14.5.1.
corresponding target
tonsider
where the
values t1a, .data
typeface
. . , tNset ofgroup is chosen
inputs the X
to distinguish
= variables
1
{x1 , . . .{t, nx}Ninto
X
N
= it from
} with a single
x observation
‣ on va Nowégalementconsider
. We a
noter data set
target of inputs {x 1 corresponding
, .
a column. . , }
vector
N with
that we target
corresponding target
variate
denote target,
et
values
by t
. . , tN . We group=(
where t which
1 , the
. . . would
.
typeface
theNtarget
, t ) T
We be denoted
group
is chosen
variablesto {tnt.} variables
thedistinguish
target Making
it from
into a{t
a column the
single assumption
n } into
observation
vector that that
a column vector that we
we
points
t where are
of a multivariate
denote
the drawn by
typeface independently
target,
t where which
is the
chosen would from
typeface
to be denoted
is the
chosen
distinguish distribution
t. Making
to
it the(3.8),
distinguish
from a assumption
single it we
fromobtain
that
a the
single
observation observation
these data
expression of points
a for are likelihood
the
multivariate drawn independently
target, function,
which fromwould the distribution
which beis a (3.8),
function
denoted t. we
of obtain
the
Making the
adjustable
the assumption that
variate target, which would be denoted t.
following expression for the likelihood function, which is a function of the adjustable Making the assumption that
w
points and
• are
En
parameters
these
β, in
faisant
drawn the
data form
points
l’hypothèse
w and β,independently
are
in the form
drawn
i.i.d., independently
on a : from the
from the distribution (3.8), we obtain the distribution (3.8), we obtain the
following expression for the likelihood function, which is a function of the adjustable
expression for the likelihoodNfunction, which is a function of the adjustable
parameters w and β," in the"form
N
w and β,p(t|X, in the form w, β) =
p(t|X, w, β) = N (tn |w
N (t n |wT
φ(xnn ), β ) )
T
φ(x ), β −−1 1
(3.10) (3.10)
n=1 "N
nN =1
where we have used (3.3). Note that in supervised
"
p(t|X, w, β) = N (t n |w T
φ(x n ), β −1
) (3.10)
T learning problems such as regres-
have 14
used p(t|X,
(3.3). w,
Note β) =
that in N (t
supervised
n |w φ(x
learning
H UGO
sion (and classification), we are not seeking to model the distribution of the input
n =1 nL ), β −1
)
problems
AROCHELLE such as (3.10)
regres-
e have used (3.3). Note that innsupervised
=1 learning problems such as regres-
dwe
s.
classification),
have used (3.3).
Thus x will M
weNote
are not
always AXIMUM
that
appear
seeking
in
in the
to model
supervisedDE
set of V the distribution
RAISEMBLANCE
learning problems
conditioning
and classification), we are not seeking to model the distribution
of the
such as
variables,
input
regres-
and
of the so
input
w on Thus
les. we will dropalways
x will the explicit x from
appear in theexpressions such as p(t|x,
set of conditioning w, β)and
variables, in or-
so
ep Sujets:
the notationmaximum de vraisemblance
uncluttered. Taking theexpressions
logarithm ofsuch
the as
likelihood
now on we will drop the explicit x from p(t|x, w,function,
β) in or-
ing use of the standard form (1.46) for the univariate Gaussian, we have
keep • Lors de l’entraînement, on cherche le w maximisant la function,
the notation uncluttered. Taking the logarithm of the likelihood
aking use of the standard form (1.46) for the univariate Gaussian, we have
(log-)probabilité des# données d’entraînement
N
ln p(t|w, β) = #N ln N (tn |w φ(xn ), β
T −1
)
ln p(t|w, β) = n=1 ln N (tn |wT φ(xn ), β −1 )
Nn=1 N
= ln β − ln(2π) − βED (w) (3.11)
2N 2N
= ln β − ln(2π) − βED (w) (3.11)
2
he sum-of-squares error function is defined by 2

the sum-of-squares error function is defined by
# N
1
ED (w) = 1 # N{tn − w φ(xn )} .
T 2
(3.12)
2
ED (w) = n=1 {tn − w φ(xn )} .
T 2
(3.12)
2
n=1 HUGO LAROCHELLE
ing written down the likelihood function, we
15
can use maximum likelihood to
MAXIMUM DE VRAISEMBLANCE
Sujets: maximum de vraisemblance
• On sait que le gradient de la somme des pertes
N
X
T T
DELSrE D (w)
FOR = {tn
REGRESSION w (xn )} (xn )
n=1

à la valeur
this gradient w minimisante
to zero gives doit être égale à 0 :
N
" N
#
! !
0= tn φ(xn ) − w
T T
φ(xn )φ(xn )T
. (3.14)
n=1 n=1

for w we obtain
$ %−1
16
wML = Φ Φ T
Φ t
HUGO LT
AROCHELLE (3.15)
g gradient to zero
this gradient to zero MAXIMUM
gives
gives
"
DE VRAISEMBLANCE
#
N ! N
" N N
#
Sujets: ! maximum de !
Tvraisemblance,
! design matrix
0 =0 = tn φ(x n ) n )− − ww )φ(xnn))
T T
tn φ(x T T
φ(xφ(xnn)φ(x T . (3.14)
(3.14)
• En isolant
n =1 w,
n=1 on trouve que le
n nw
=1 =1minimisant la somme des

w
gwforwe we obtain
pertes
obtain (maximisant la log-probabilité) est
$ $ T %−1
% − 1
wML = Φ T Φ ΦT t
T
(3.15)
wML = Φ Φ Φ t (3.15)
are known as the normal equations for the least squares problem. Here Φ is an
known
M matrix,as the normal
ɸ estthe
oùcalled appelée equations
designlamatrix, for
designwhose the least
matrixelements: squares problem. Here Φ is
are given by Φnj = φj (xn ),an
rix, called the design matrix, whose elements are given by Φnj = φj (xn ),
⎛ ⎞
⎛ φ0 (x1 ) φ1 (x1 ) · · · φM −1 (x1 ) ⎞
φ0 (x
⎜ φ01(x) 2 ) φ1φ(x 1 )2 ) · ·· ·· · φφ
1 (x MM− −11(x
(x12)) ⎟
Φ⎜ =⎜ φ
⎝ 0 (x 2 )
.
. φ 1 (x 2.
. ) · · .
· . φ M − 1.
.(x 2 ) ⎟

⎠ . (3.16)
Φ=⎜ . . . . ⎟ (3.16)
⎝ . . .
.φ. 0 (xN ) φ..1 (xN ) .· .· · φM −..1 (xN ) ⎠ . .

uantity φ0 (xN ) φ1 (xN ) · · · φM −1 (xN )



$ % −1 HUGO
T LAROCHELLE
ty 17
Φ ≡ Φ Φ
T
Φ (3.17)
M
gradient to zero givesAXIMUM DE
"
V
RAISEMBLANCE
#
!N !N
Sujets: maximum de T vraisemblance, design matrix
0= tn φ(xn ) − w T
φ(xn )φ(xn ) T
. (3.14)
• En isolant
n=1 w, on trouve que le w minimisant la somme des
n=1

w wepertes
obtain(maximisant la log-probabilité) est
$ T %−1 T
wML = Φ Φ Φ t (3.15)
known as the normal equations for the least squares problem. Here Φ is an
rix,•called
Il faudrait aussi vérifier
the design qu’il s’agit
matrix, whose bel etare
elements bien d’unby
given minimum
Φnj = φj (xn ),
de ED⎛ (w) (et non un maximum ou un pointselle) ⎞
φ0 (x1 ) φ1 (x1 ) · · · φM −1 (x1 )
‣ se fait en calculant les dérivées secondes
⎜ φ0 (x2 ) φ1 (x2 ) · · · φM −1 (x2 ) ⎟
=⎜
‣Φplus
⎝ .
précisément,
.. on .
montre
.. que .
la..matrice .
des
..

dérivées
⎠ .
secondes (3.16)
(matrice hessienne) est définie positive
φ0 (xN ) φ1 (xN ) · · · φM −1 (xN )
HUGO LAROCHELLE
ty 18
Apprentissage automatique
Régression linéaire - régularisation
ewe
have
haveused
used(3.3).
(3.3).Note
Notethatthatininsupervised
supervisedlearning
learningproblems
problemssuchsuchasasregres-
regres-
d classification),
and classification),we wearearenot
notseeking
seekingtotomodel
modelthethedistribution
distributionofof the
the input
input
s. Thusx xwill
les.Thus M
willalways
alwaysAXIMUM
appear
appearininthethesetDE
set V RAISEMBLANCE
ofofconditioning
conditioning variables,
variables, and
and soso
w onon
now wewewill
willdrop
dropthe explicitxxfrom
theexplicit fromexpressions
expressionssuch p(t|x,w,
suchasasp(t|x, w,β) β)ininor-
or-
eep
keep Sujets:
the notation
the maximum
notation de vraisemblance
uncluttered.
uncluttered. Taking
Takingthe
thelogarithm
logarithmofofthe
thelikelihood
likelihoodfunction,
function,
ing use
aking useofofthe
thestandard
standardform
form(1.46)
(1.46)for
forthe
theunivariate
univariateGaussian,
Gaussian,we wehave
have
• Maximiser la log-probabilité
#
NN
#
ln
ln p(t|w, β) =
p(t|w, β) = ln N (tnn|w φ(xnn), β ))
ln N (t TT
|w φ(x ), −−
β 11

n=1
n=1
NN NN
== lnlnββ−− ln(2π)
ln(2π)−−βE (w)
βEDD(w) (3.11)
(3.11)
22 22
the équivaut à minimiser
sum-of-squares error la somme
function is des
defined
he sum-of-squares error function is defined by pertes
by de l’erreur au
carré (squared error) :
#N
1 1 N
#
EE D (w)
(w) = = {t{t n−−wwTT
φ(x
φ(x n )}
)}2 2
. . (3.12)
(3.12)
D 2
2 n=1
n n
n=1

aving writtendown
ing20 written downthe
thelikelihood
likelihoodfunction,
function,we
Hwe can
can
UGO usemaximum
use maximumlikelihood
LAROCHELLE likelihoodto
to
2
o consider the sum-of-squares error function given by
N
RÉGULARISATION
1 "
E(w) = weight
Sujets: régularisation, {tndecay, φ(xn )} de Ridge
− wrégressionT 2
(3.26)
2
n=1
• Afin de contrôler les risques de sur-apprentissage, on
total error function
préfère becomes
ajouter un terme de régularisation
N
"
1 λ T
{tn − w φ(xn )} + w w.
T 2
(3.27)
2 2
n=1

icular choice of regularizer is known in the machine learning literature as


‣ équivaut au maximum a posteriori dans la formulation probabiliste
ecay because in sequential learning algorithms, it encourages weight values
towards
‣ le zero,
terme unless supported
de régularisation est by the data.
souvent appelé In statistics,
weight decayit provides an ex-
a parameter avec unmethod
shrinkage
‣ la régression terme debecause it shrinks
régularisation parameter
est aussi appelée values towards
régression de Ridge
21 HUGO LAROCHELLE
2
o consider the sum-of-squares error function given by
R ÉGULARISATION
elatively complex and flexible models. One technique that is often
N
1 "
e over-fitting
Sujets: phenomenon
=
régularisation,
E(w) in
weight
{t suchw
decay,
n − cases
T is)}
régression
φ(x n
that
2
de of regularization,
Ridge (3.26)
2
ding a penalty term to thenerror function (1.2) in order to discourage
=1
• Afin de contrôler les risques de sur-apprentissage, on
om reaching large values. The simplest such penalty term takes the
total error function
préfère becomes
ajouter un terme de régularisation
quares of all of the coefficients, leading to a modified error function
N
"
1 λ T
"N {tn − w φ(xn )} + w w.
T 2
(3.27)
12 2
λ

{
!
E(w) = n=1
2
{y(xn , w) − tn } +
2
∥w∥ (1.4)
2 2
icular choice ofn=1 regularizer is known in the machine learning literature as
‣ équivaut au maximum a posteriori dans la formulation probabiliste
ecay
T because
2 in sequential learning algorithms, it encourages
w = w0 + w1 + . . . + wM , and the coefficient λ governs the rel-
2 2 weight values
towards
‣ le zero,
terme unless supported
de régularisation est by the data.
souvent appelé In statistics,
weight decayit provides an ex-
of the regularization term compared with the sum-of-squares error
a parameter avec unmethod
shrinkage
‣ la régression terme debecause it shrinks
régularisation parameter
est aussi appelée values towards
ften the coefficient w 0 is
régression de Ridge
omitted from the regularizer because its
he results to depend on the choice of
21
origin for
HUGO LAROCHELLE
the target variable
egularization term
3.1.in (3.29)
Linear for various
Basis Function Models values
145of the parameter q.

RÉGULARISATION
advantage that the error function
Sujets: régularisation, weight decay, régression de Ridge
remains a quadratic function of
act minimizer can be found in closed form. Specifically, setting the
) with• On peut montrer que la solution (maximum a posteriori)
respect to w to zero, and solving for w as before, we obtain
est alors :
q=1 !q = 2 " −1q = 4T
w = λI + Φ T
Φ Φ t.
of the regularization term in (3.29) for various values of the parameter q.
(3.28)

ahassimple dansextension
thatλthe
le cas
the‣advantage 0, onof
= error the remains
retrouve
function least-squares
la solution du maximum
a quadratic solution
function de
of (3.15).
o its exactvraisemblance
minimizer can be found in closed form. Specifically, setting the
eral
of (3.27)regularizer
with respect to w toiszero,
sometimes
and solving for wused,
as before,for which the regularized error
we obtain
‣ si λ > 0, permet également d’avoir une solution plus stable
! "−1 T
numériquement
N w = λI(si
+ ΦT
Φ n’est
Φ pas
t. inversible) M (3.28)
1 # λ #
{t − w φ(x )} + T
esents a simple extension of the least-squares 2 (3.15).
solution
|w |
n sometimes used, forn which the regularized errorj
q
(3.29)
form22 2
ore general regularizer is
2
HUGO LAROCHELLE
Apprentissage automatique
Régression linéaire - prédictions multiples
TYPES D’APPRENTISSAGE
Sujets: apprentissage supervisé, classification, régression
RAPPEL
• L’apprentissage supervisé est lorsqu’on a une cible à
prédire
‣ classification : la cible est un indice de classe t ∈{1, ... , K}
- exemple : reconnaissance de caractères
✓ x : vecteur des intensités de tous les pixels de l’image
✓ t : identité du caractère

‣ régression : la cible est un nombre réel t ∈ ℝ


- exemple : prédiction de la valeur d’une action à la bourse
✓ x : vecteur contenant l’information sur l’activité économique de la journée
✓ t : valeur d’une action à la bourse le lendemain

24 HUGO LAROCHELLE
TYPES D’APPRENTISSAGE
Sujets: apprentissage supervisé, classification, régression
RAPPEL
• L’apprentissage supervisé est lorsqu’on a une cible à
prédire
‣ classification : la cible est un indice de classe t ∈{1, ... , K}
- exemple : reconnaissance de caractères
✓ x : vecteur des intensités de tous les pixels de l’image
✓ t : identité du caractère

‣ régression : la cible est un vecteur réel tt ∈ ℝK


nombre réel
- exemple : prédiction de la valeur d’une action à la bourse
✓ x : vecteur contenant l’information sur l’activité économique de la journée
✓ tt ::valeur
la valeur de plusieurs
d’une action à laactions
bourselelelendemain
lendemain

24 HUGO LAROCHELLE
h to predict K > 1 target variables, which we denote collectively
r t. This could be done PRÉDICTIONS MULTIPLES
by introducing a different set of basis func-
ponent of t, leading to multiple, independent regression problems.
Sujets: modèle
nteresting, and more pour common,
prédictions approach
multiples is to use the same set of
model
• Le all of the
modèle doitcomponents of theuntarget
maintenant prédire vector
vecteur : so that

y(x, w) = W φ(x)T
(3.31)
où W est
mensional column MW
vector,
une matrice ⨉ Kis an M × K matrix of parameters,
-dimensional column vector with elements φj (x), with φ0 (x) = 1
e we take the conditional distribution of the target vector to be an
• Chaque colonne de W peut être vue comme le vecteur
of the form
wk du modèle y(x,wk) pour la k cible
e

p(t|x, W, β) = N (t|W φ(x), β


T −1
I). (3.32)
25 HUGO LAROCHELLE
y(x, w) = W φ(x) (3.31)

PRÉDICTIONS
a K-dimensional column vector, W is anM MULTIPLES
× K matrix of parameters,
an M -dimensional column vector with elements φj (x), with φ0 (x) = 1
Sujets:
Suppose weformulation probabiliste pour
take the conditional prédictionsofmultiples,
distribution the targetmodèle multitâche
vector to be an
aussian
• On of the form
suppose encore un modèle gaussien

p(t|x, W, β) = N (t|W φ(x), β T −1


I). (3.32)

a set où
ofon suppose que les
observations t1 ,cibles
. . . , tsont
N , indépendantes
we can combine these into a matrix T
× K such that the n row is given by tn . Similarly, we can combine the
th T

rs x•1Un . , xN into
, . . modèle a matrix X. The log likelihood
faisant des prédictions multiples est parfois function is then given
appelé un modèle multitâche
N
!
T|X, W, β) =
26
ln N (tn |WHUGOφ(x
T
n ),
LAROCHELLE
β −1
I)
ply y(x, w) = WT φ(x) (3.31)
!
where y is a K-dimensional
E[t|x] = column PRÉDICTIONS MULTIPLES
dt = W
tp(t|x)vector, y(x,is an
w).M × K matrix of parameters,
(3.9)
and φ(x) is an M -dimensional column vector with elements φj (x), with φ0 (x) = 1
as before.
Sujets: Suppose we take the
formulation conditionalpour
probabiliste distribution of the target
prédictions vector to be an
multiples
heisotropic
Gaussian noise assumption
Gaussian of the form
implies that the conditional distribution of
s unimodal, which may be inappropriate
• Soit notre ensemble d’entraînementT for some applications. An ex-
mixtures of conditional Gaussian
p(t|x, W, β) = distributions,
= {(x1,t1), ... , (xN,tN)} N (t|W φ(x),which β I). permit multimodal
−1
(3.32)
D
l distributions, will be discussed in Section 14.5.1.
If we have a set of observations t1 , . . . , tN , we can combine these into a matrix T
of size N × K such that the n Xrow
onsider‣ ona data set
va égalementof inputs
noter th = {x , xtN
1 , . . .by
is given T } with corresponding target
n . Similarly, we can combine the
.input .etWe
. , tNvectors group
T est
x1une ,the
xN target
, . . . matrice variables
intodont les rangées
a matrix X. The{tsont } into
nlog les a column
vecteurs
likelihood t1, ...vector
function , tisN then
that we
given
t by
where the typeface is chosen to distinguish it from a single observation
variate• Entarget,
faisantwhich
l’hypothèse
wouldNi.i.d., on a : t. Making the assumption that
be denoted
pointslnare drawn independently ! from the Tdistribution−1 (3.8), we obtain the
p(T|X, W, β) = ln N (tn |W φ(xn ), β I)
expression for the likelihoodn=1 function, which is a function of the adjustable
w and β, in the form " # ! N
$ $
NK β β $tn − WT φ(xn )$2 . (3.33)
= ln −
N2
" 2π 2
n=1
27 p(t|X, w, β) = N (tn |w φ(x T
), β )
HUGOnLAROCHELLE
−1
(3.10)
3.2. The Bias-Variance Decomposition
3.2. The Bias-Variance Decomposition 14
MAXIMUM DE VRAISEMBLANCE
3.1. Linear Basis Function Models 141
Sujets: formulation probabiliste pour prédictions multiples
e can maximize
nillmaximize this this
functionfunction
with with
respect respect
to W, to W,
giving giving
• On
be simply
peut démontrer que le maximum
! de vraisemblance est :
! =
! " dt
"
= − 1 w).
=MLΦ=Φ Φ Φ
E[t|x] (3.9)
WMLW Φ T. Φ T.
T −T1
tp(t|x) T y(x, T
(3.3
ote that the Gaussian noise assumption implies that the conditional distribution of
segiven
this
result
• On result
x is for
peut voir for
each
unimodal, eachmay
target
le résultat
which target la variable
variable
comme tk , wefor
concaténation
be inappropriate thave,some
we applications.
k(colonne have An ex-
nsionpar colonne) of
to mixtures desconditional
solutions pour chaque
Gaussian tâche
distributions, which permit multimodal
onditional distributions, ! will !
be" "
discussed
−T1 T−in 1 SectionT †14.5.1. †
Now considerwk = aw data kΦ =of
T
set Φ Φ
Φinputs X =tk
Φ {xΦ = .t,kxtN=
1 , . .Φ k} with tk
Φ corresponding target (3.3
alues t1 , . . . , tN . We group the target variables {tn } into a column vector that we
dimensional
où t k 1,kcolumn
, ... typeface
, N,k vector
n N -dimensional column vector with components
enote by = (t
where the t ) T is chosenwith
to components
distinguish it from at for
single
nk tnnk=for
1, .n. .=
observation N
f a multivariate target, which would be denoted t. Making the assumption that
to
ution the to regression
the regression problem decouples
problem decouples between
ese data points are drawn independently from the distribution (3.8), we obtain the
28 H UGO L AROCHELLE
the
between different
the targ
differ
Apprentissage automatique
Régression linéaire - théorie de la décision
with additive y(x,
Gaussian noise
w) = w0 + so that
wj φj (x) (3.2)
MODÈLE DEw)
t = y(x, RÉGRESSION
+ϵ LINÉAIRE
j =1
(3.7)
) are known as basis functions. By denoting the maximum value of the
zero Sujets:
mean formulation
Gaussian probabiliste
random variableinwith
M − 1, the total number of parameters this precision
model will(inverse
be M . variance)
RAPPEL
an •write
ameter 0 allows for
Enwrégression any on
linéaire, fixed offsetque
suppose in the
: data and is sometimes called
meter (not to be confused with ‘bias’ in a statistical sense). It is often
p(t|x, w, β) = N (t|y(x, w), β −1
).
o define an additional dummy ‘basis function’ φ0 (x) = 1 so that (3.8)

f we où
assume a squared M
! loss
− 1 function, then the optimal prediction, for a
x, will bey(x,
given by
w) = the conditional mean
T of the target variable. In the
wj φj (x) = w φ(x) (3.3)
ussian conditional distribution
j =0 of the form (3.8), the conditional mean
• Lorsqu’on doitTfaire une prédiction pour une nouvelle
(w0 , . . . , wM −1 ) and φ = (φ0 , . . . , φM −1 ) . In many practical ap-
T
entrée x, on prédit alors la moyenne, i.e. y(x,w)
pattern recognition, we will apply some form of fixed pre-processing,
30 HUGO LAROCHELLE
THÉORIE DE LA DÉCISION
Sujets: théorie de la décision
• Pourquoi est-ce que prédire la moyenne (y(x,w)) est la
bonne chose à faire ?

• La théorie de la décision va nous éclairer sur le sujet

• On va maintenant noter ŷ(x) la prédiction (décision)


que l’on va faire pour une entrée x
‣ ŷ(x) pourrait être différente de y(x,w)
31 HUGO LAROCHELLE
ed earlier. The decision stage consists of choosing a specific esti-
value of t for each THÉORIE DE LA D
input x. Suppose thatÉCISION
in doing so, we incur a
TheSujets:
average, or expected,
théorie de la décision loss is then given by
!!
• Sachant que :
E[L] = L(t, y(x))p(x, t) dx dt.
‣ chaque paire (x,t) est échantillonnée d’une distribution p(x,t)
(1.86)
‣ la perte qui nous intéresse est L(t,ŷ(x)) = {ŷ(x)-t}2
e of loss function in regression problems is the squared loss given
• La perte 2espérée de notre prédiction (décision) sera
{y(x) − t} . In this case, the expected loss can be written
!!
E[L] = {y(x) − t} p(x, t) dx dt.
^ 2
(1.87)

hoose y(x) so as to minimize E[L]. If we assume a completely


32 HUGO LAROCHELLE
function. This result is illustrated in Figure 1.28. It can readily be extended to mul-
tiple target variables represented by the vector t, in which case the optimal solution
THÉORIE DE LA DÉCISION
is the conditional average y(x) = Et [t|x].
We can also derive this result in a slightly different way, which will also shed
Sujets:
light théorieofde
on the nature thelaregression
décision problem. Armed with the knowledge that the
optimal solution is the conditional expectation, we can expand the square term as
• Pour trouver ŷ(x) optimale on commence par noter que :
follows
^
{y(x) − t}2 = {y(x)
^ − E[t|x] + E[t|x] − t}2
= {y(x) − E[t|x]} + 2{y(x) − E[t|x]}{E[t|x] − t} + {E[t|x] − t}
^ 2 ^ 2

Z Z
where, to keep the notation uncluttered, we use E[t|x] to denote E [t|x]. Substituting
• Ainsi : {ŷ(x)
into the loss function
2
t} p(x, t)dxdt
and performing =
the integral
t
over t, we see that the cross-term
vanishes
Z Z and we obtain an expression forZtheZloss function in the form
! 2
{ŷ(x) E[t|x]} p(x, t)dxdt + !
{E[t|x] t} 2
p(x, t)dxdt
2
E[L] = {y(x) − E[t|x]} p(x) dx + {E[t|x] − t}2 p(x) dx. (1.90)
Z Z
The+ function y(x) weE[t|x]}{E[t|x]
2{ŷ(x) seek to determinet}p(x,
enterst)dxdt
only in the first term, which will be
minimized when y(x) is equal to E[t|x], in which case this term will vanish. This
is33simply the result that we derived previously
HUGO Land that shows that the optimal least
AROCHELLE
THÉORIE DE LA DÉCISION
Sujets: théorie de la décision
• Ensuite, on remarque que :
Z Z
2{ŷ(x) E[t|x]}{E[t|x] t}p(x, t)dxdt
Z Z
= 2{ŷ(x) E[t|x]}{E[t|x] t}p(t|x)p(x)dtdx
Z ✓Z ◆
= 2{ŷ(x) E[t|x]}p(x) {E[t|x] t}p(t|x)dt dx
Z ✓ Z ◆
= 2{ŷ(x) E[t|x]}p(x) E[t|x] t p(t|x)dt dx = 0

34 HUGO LAROCHELLE
THÉORIE DE LA DÉCISION
Sujets: théorie de la décision
• Donc onZa Zque :
2
{ŷ(x)t} p(x, t)dxdt =
Z Z Z Z
{ŷ(x) E[t|x]}2 p(x, t)dxdt + {E[t|x] t}2 p(x, t)dxdt
Z Z
+ 2{ŷ(x) E[t|x]}{E[t|x] t}p(x, t)dxdt

• Le mininum est donc atteint lorsque ŷ(x) = E[t|x]

35 HUGO LAROCHELLE
THÉORIE DE LA DÉCISION
Sujets: théorie de la décision
• Puisqu’on ne connaît pas la vraie distribution p(x,t)
(et donc ni p(t|x), ni E[t|x] ), le mieux qu’on puisse faire
est d’utiliser notre modèle de p(t|x)

ŷ(x) = E[t|x] = y(x,w)

• Donc, si on veut une petite perte de la différence au carré


(en espérance), prédire la moyenne est un bon choix selon
la théorie de la décision
36 HUGO LAROCHELLE
THÉORIE DE LA DÉCISION
Sujets: théorie de la décision
• Pour d’autres choix de perte, la décision optimale sera
différente
‣ pour la perte L(t,ŷ(x)) = |ŷ(x)-t|, la décision ŷ(x) devrait être la
médiane de p(t|x,w)

• Par chance, la médiane d’une gaussienne est aussi la


moyenne!
‣ pour d’autres choix de modèle probabiliste p(t|x,w), ce pourrait ne
pas être le cas

37 HUGO LAROCHELLE
Apprentissage automatique
Régression linéaire - décomposition biais-variance
GÉNÉRALISATION
Sujets: généralisation
• Analysons la performance espérée de généralisation d’un
modèle donné y(x, w) de régression :
Z Z
2
E(x,t) [L(t, y(x, w))] = {y(x, w) t} p(x, t)dxdt

où p(x,t) est la vraie distribution des exemples (x,t)

• Changement à la notation : on note explicitement selon


quelles variables aléatoires on fait l’espérance, en ajoutant
l’information en indice
39 HUGO LAROCHELLE
GÉNÉRALISATION
Sujets: généralisation
• Analysons la performance espérée de généralisation d’un
modèle donné y(x, w) de régression :
Z Z
2
E(x,t) [L(t, y(x, w))] = {y(x, w) t} p(x, t)dxdt

où p(x,t) est la vraie distribution des exemples (x,t)

• Changement à la notation : on note explicitement selon


quelles variables aléatoires on fait l’espérance, en ajoutant
l’information en indice
39 HUGO LAROCHELLE
GÉNÉRALISATION
Sujets: généralisation
• Analysons la performance espérée de généralisation d’un
modèle donné y(x, w) de régression :
Z Z
0.5 q=1 q=2 2 q=4
E(x,t) [L(t, y(x, w))] = {y(x, w) t} p(x, t)dxdt
e 3.3 Contours of the regularization term in (3.29) for various values of the parameter q.

où p(x,t) est la vraie distribution des exemples (x,t)


zero. It has the advantage that the error function remains a quadratic function of
• w,
On and so its
s’intéresse exact
plus minimizer can
spécifiquementbe found
au cas in
où closed
le form.
modèle w Specifically, setting the
gradient of (3.27) with respect to w to zero, and solving for w as before, we obtain
est celui obtenu après s’être entraîné sur D
! "−1
‣ en régression linéaire, lorsque w = λI + Φ Φ Φ t.
T T
(3.28)
40 HUGO LAROCHELLE
GÉNÉRALISATION
Sujets: généralisation
• Analysons la performance espérée de généralisation d’un
modèle donné y(x, w) de régression :
Z Z
2
E(x,t) [L(t, y(x; D))] = {y(x; D) t} p(x, t)dxdt

où p(x,t) est la vraie distribution des exemples (x,t)

• Pour traiter le cas général, on va plutôt noter notre


modèle y(x; D)
‣ c’est le modèle obtenu après avoir entraîné sur D
41 HUGO LAROCHELLE
GÉNÉRALISATION
Sujets: généralisation
• Analysons la performance espérée de généralisation d’un
modèle donné y(x, w) de régression :
3. LINEAR MODELS FOR REGRESSION Z Z
⇥ ⇤ 2
ED E(x,t) [L(t, y(x; D))] = ED {y(x; D) t} p(x, t)dxdt
the squared loss function, for which the optimal prediction is given by the conditiona
expectation, which we denote by h(x) and which is given by
où p(x,t) est la vraie distribution des exemples (x,t)
!
• Dans ce qui suit, on va noter h(x) = E[t|x] = tp(t|x) dt.
le meilleur (3.36
modèle possible, i.e. celui qu’on cherche
(voir
Atdiapositives
this point, sur
it islaworth
théorie de la décision)
distinguishing between the squared loss function arisin
42
from decision theory and the sum-of-squares
HUGO LAROCHELLE
error function that arose in the maxi
sures the extent to which the solutions for individual data sets vary around their
easures
age, and the extent
hence to whichthe
this measures theextent
solutions for individual
to which the functiondata
y(x;sets
D) isvary around their
sensitive
verage,
nitions
and hence
he particular
DÉCOMPOSITION BIAIS-VARIANCE
choicethis
shortly when
the particular
measures
of data
weof
choice
set. Wethe
consider
data set.
extent
shall
a simple
to which
provide
We shall
somethe
example.
function
intuition to support is sensitive
y(x; D)these
provide some intuition to support these
So far, weshortly
efinitions have considered
when we a consider
single input
a value x.example.
simple If we substitute this expansion
Sujets: biais, variance, bruit
k into
So (3.37),
far, wewe obtain
have the following
considered decomposition
a single input valueof x.the expected
If we squared
substitute thisloss
expansion
• On(3.37),
ack into peut montrer
we obtainque
the: following
2 decomposition of the expected squared loss
expected loss = (bias) + variance + noise (3.41)
⇥ ⇤
re expected loss = (bias) + variance + noise
ED E(x,t) [L(t, y(x; D))] = 2
(3.41)
'
here
(bias) 2
= {E
'D [y(x; D)] − h(x)} 2
p(x) dx (3.42)
'
(bias) 2
= !{ED [y(x; D)] − h(x)}2 p(x)
" dx (3.42)
variance = ED {y(x; D) − ED [y(x; D)]} p(x) dx
2
(3.43)
'
' ! "
variance
noise =
= E {y(x; D) − E
{h(x) − t} p(x, t) dx dt
D 2 D [y(x; D)]}2
p(x) dx
(3.44)
(3.43)
'
noise terms
the43 bias and variance = now
{h(x)
refer−
tot} p(x, t) dx
2
integrated dt
quantities. (3.44)
HUGO LAROCHELLE
sures the extent to which the solutions for individual data sets vary around their
easures
age, and the extent
hence to whichthe
this measures theextent
solutions for individual
to which the functiondata
y(x;sets
D) isvary around their
sensitive
verage,
nitions
and hence
he particular
DÉCOMPOSITION BIAIS-VARIANCE
choicethis
shortly when
the particular
measures
of data
weof
choice
set. Wethe
consider
data set.
extent
shall
a simple
to which
provide
We shall
somethe
example.
function
intuition to support is sensitive
y(x; D)these
provide some intuition to support these
So far, weshortly
efinitions have considered
when we a consider
single input
a value x.example.
simple If we substitute this expansion
Sujets: biais
k into
So (3.37),
far, wewe obtain
have the following
considered decomposition
a single input valueof x.the expected
If we squared
substitute thisloss
expansion
• On(3.37),
ack into peut montrer
we obtainque
the: following
2 decomposition of the expected squared loss
expected loss = (bias) + variance + noise (3.41)
⇥ ⇤
ED E(x,t) [L(t,
SreFOR REGRESSION expected
y(x; loss
D))] == (bias) 2
+ variance + noise (3.41)
'
here
(bias) =
2
{E'D [y(x; D)] − h(x)} 2
p(x) dx (3.42)
oss function, for which the optimal prediction is given by the conditional
'
(bias)
which weÀdenote
quel
2
point = modèle
by h(x)
le and{E!which[y(x;
is given
«moyen»
D −byh(x)}
D)]donné par
2 "
p(x) dx
l’algorithme (3.42)
variance = ED {y(x; D) − ED [y(x; D)]} p(x) dx 2
(3.43)
d’apprentissage sera ' proche
! du meilleur modèle possible
' ! "
variance
h(x) = E[t|x]
= ,=i.e.
E est-ce que
tp(t|x)
{y(x; dt.
D) l’algorithme
− E [y(x; D)]} 2 (3.36)
p(x) dx (3.43)
noise = {h(x) − t} p(x, t) dx dt
D 2 D (3.44)
' est capable de modéliser h(x)
d’apprentissage
, it is worth distinguishing
noise terms between
= now{h(x) the 2squared loss function arising
p(x, t) dx dt (3.44)
the44 bias and variance refer−
tot}
integrated quantities.
n theory and the sum-of-squares error function that arose in the maxi-
HUGO LAROCHELLE
sures the extent to which the solutions for individual data sets vary around their
easures
age, and the extent
hence to whichthe
this measures theextent
solutions for individual
to which the functiondata
y(x;sets
D) isvary around their
sensitive
verage,
nitions
and hence
he particular
DÉCOMPOSITION BIAIS-VARIANCE
choicethis
shortly when
the particular
measures
of data
weof
choice
set. Wethe
consider
data set.
extent
shall
a simple
to which
provide
We shall
somethe
example.
function
intuition to support is sensitive
y(x; D)these
provide some intuition to support these
So far, weshortly
efinitions have considered
when we a consider
single input
a value x.example.
simple If we substitute this expansion
Sujets: variance
k into
So (3.37),
far, wewe obtain
have the following
considered decomposition
a single input valueof x.the expected
If we squared
substitute thisloss
expansion
• On(3.37),
ack into peut montrer
we obtainque
the: following
2 decomposition of the expected squared loss
expected loss = (bias) + variance + noise (3.41)
⇥ ⇤
re expected loss = (bias) + variance + noise
ED E(x,t) [L(t, y(x; D))] = 2
(3.41)
'
here À 2quel point le modèle donné par2 l’algorithme
(bias) = {E D [y(x; D)] − h(x)} p(x) dx (3.42)
d’apprentissage varie'd’un ensemble d’entraînement à l’autre
'
(bias) =
2
{ED [y(x; D)] − h(x)} p(x)
! 2 "
dx (3.42)
variance = ED {y(x; D) − ED [y(x; D)]} p(x) dx
2
(3.43)
'
' ! "
variance
noise =
= E {y(x; D) − E
{h(x) − t} p(x, t) dx dt
D 2 D [y(x; D)]}2
p(x) dx
(3.44)
(3.43)
'
noise terms
the45 bias and variance = now
{h(x)
refer−
tot} p(x, t) dx
2
integrated dt
quantities. (3.44)
HUGO LAROCHELLE
sures the extent to which the solutions for individual data sets vary around their
easures
age, and the extent
hence to whichthe
this measures theextent
solutions for individual
to which the functiondata
y(x;sets
D) isvary around their
sensitive
verage,
nitions
and hence
he particular
DÉCOMPOSITION BIAIS-VARIANCE
choicethis
shortly when
the particular
measures
of data
weof
choice
set. Wethe
consider
data set.
extent
shall
a simple
to which
provide
We shall
somethe
example.
function
intuition to support is sensitive
y(x; D)these
provide some intuition to support these
So far, weshortly
efinitions have considered
when we a consider
single input
a value x.example.
simple If we substitute this expansion
Sujets: bruit
k into
So (3.37),
far, wewe obtain
have the following
considered decomposition
a single input valueof x.the expected
If we squared
substitute thisloss
expansion
• On(3.37),
ack into peut montrer
we obtainque
the: following
2 decomposition of the expected squared loss
expected loss = (bias) + variance + noise (3.41)
⇥ ⇤
re expected loss = (bias) + variance + noise
ED E(x,t) [L(t, y(x; D))] = 2
(3.41)
'
here
À quel
2 point il y a du bruit dans la cible
(bias) = {E D [y(x; D)] − h(x)}2 à dx
p(x) prédire, i.e. (3.42)
'
à que point 2elle 'varie autour de son espérance2 conditionnelle
(bias)
(ne =
dépend pas !
{E
deD [y(x; D)]
l’algorithme− h(x)} "
p(x) dx
d’apprentissage) (3.42)
variance = ED {y(x; D) − ED [y(x; D)]} p(x) dx
2
(3.43)
'
' ! "
variance
noise =
= E {y(x; D) −
{h(x) − t} p(x, t) dx dt
D 2 E D [y(x; D)]}2
p(x) dx
(3.44)
(3.43)
'
noise terms
the46 bias and variance = now
{h(x)
refer−
tot} p(x, t) dx
2
integrated dt
quantities. (3.44)
HUGO LAROCHELLE
DÉCOMPOSITION BIAIS-VARIANCE
Sujets: bruit

• On bon algorithme d’apprentissage aura un bon


compromis entre son biais et sa variance
‣ si la capacité augmente, le biais diminue et la variance augmente
‣ si on régularise, la variance diminue, mais le biais augmente

• La décomposition biais-variance illustre donc plus


formellement les phénomène de sur-apprentissage et sous-
appprentissage
‣ pas assez de capacité ⟹ biais très élevé ⟹ mauvaise généralisation
‣ trop de capacité ⟹ variance très élevée ⟹ mauvaise généralisation

47 HUGO LAROCHELLE
Apprentissage automatique
Régression linéaire - résumé
um M − 1, the
j bylikelihood, total number
in more detail. of parameters in this model will be M .
q = 1w allows for any fixed q = 2
offset in the data and is q
sometimes=
efore, we assume0 that the target variable t is given by a deterministic func- called
he parameter 4
sw)parameter
snient
(not Gaussian
with additive
of the
RÉGRESSION LINÉAIRE
to be confused
noise sowith
that ‘bias’ in a statistical sense). It is often
to regularization term dummy
define an additional in (3.29)‘basis
for various values
function’ φ0 (x)of=the parameter
1 so that q.
Sujets: résumé de t=la y(x, w) + ϵ linéaire
régression (3.7)
M
! −1
shas• Modèle : y(x, w) =
the
a zero advantage
mean Gaussianthat the
random error
wj φj (x) = w remains
variable function
with precision
T
φ(x) a quadratic
(inverse variance) function
(3.3) of
we
so can write minimizer can jbe
its exact =0 found in closed form. Specifically, setting the

of (3.27) with respect


p(t|x, w, β) to
= w
N to zero,
(t|y(x, w),and
β −1solving for w as before, we obtain
). (3.8)
e w = (w0 , . . . , wM −1 ) and φ = (φ0 , . . . , φM −1 ) . In many practical ap-
T T

ions of pattern recognition, ! will apply


we "
some
−1 form of fixed pre-processing,
at, if we assume a squared
w loss
= function,
+ Φ then
T
Φ the optimal
Φ T
t. prediction, for a
• Entraînement : λI
e of x, will be given by the conditional mean of the target variable. In the
(3.28)
(maximum de vraisemblance si λ=0 ou maximum a posteriori si λ>0)
Gaussian conditional distribution of the form (3.8), the conditional mean
resents a simple extension of the least-squares solution (3.15).
more• general
Hyper-paramètre : λ is sometimes used, for which the regularized error
regularizer
e form
• Prédiction : y(x,w)
# N #M
1 λ
{tn − w φ(xn )} +
T 2
|wj |q
(3.29)
49 2 2
HUGO LAROCHELLE
n=1 j =1

Vous aimerez peut-être aussi