Clase 1

Indroducción al aprendizaje profundo
Pablo Sprechmann y Mauricio Delbracio
Instituto de ingenierı́a eléctrica

Facultad de Ingenierı́a
pablo@cims.nyu.edu
Pablo Sprechmann Indroduccion al aprendizaje profundo

Plan para la semana
• Introducción y conceptos básicos
• Redes neuronales feed-forward, aprendizaje supervisado
• Aprendizaje de diccionarios y aprendizaje profundo. Aprendizaje no

supervisado
• Modelos generativos: VAE y GAN’s. Aprendizaje auto-supervisado
• Redes neuronales recurrentes

Plan para la semana
• Introducción y conceptos básicos
• Redes neuronales feed-forward, aprendizaje supervisado
• Aprendizaje de diccionarios y aprendizaje profundo. Aprendizaje no

supervisado
• Modelos generativos: VAE y GAN’s. Aprendizaje auto-supervisado
• Redes neuronales recurrentes
Evaluación:
• Dos prácticos basados en el curso CS231N de Stanford
• Tarea:
• Tercer práctico usando redes recurrentes
• Proyecto propio utilizando alguna de las técnicas aprendidas

Que es el aprendizaje de máquinas?

Wikipedia:
En ciencias de la computación el aprendizaje de máquinas es una rama de la
inteligencia artificial cuyo objetivo es desarrollar técnicas que permitan a las
computadoras aprender.

Wikipedia:
En ciencias de la computación el aprendizaje de máquinas es una rama de la
inteligencia artificial cuyo objetivo es desarrollar técnicas que permitan a las
computadoras aprender.
Tom Mitchel (Machine Learning):

El campo del aprendizaje de máquinas se ocupa de construir programas capacez
de mejorar su performance automaticamente a partir de experiencia.

Perceptrón
• Creada por Rosenblatt, 1957, en la universidad de Cornell.

• De las primeras “máquina” capaces de “aprender”.
• El model no ha cambiado fundamentalmente desde esta epoca
Rosenblatt, 1958

Programación v.s. aprendizaje
Dada una forma, está compuesta por una única componente conexa?
M&P demostraron que:
• Un perceptron (sencillo) no es capaz de

aprender a contestar esta pregunta.
• Es posible escribir un algoritmo simple

que lo logre.
Minsky and Papert,

Perceptrons, 1968.
Ejemplo tomado de charla de L. Bottou, ICML 2015.

Es la conectividad fácil para nosotros?

Qué es fácil para nosotros?

Por qué tanto interés en aprendizaje de máquinas?
• La conectividad tiene una especificación

matemática precisa, se puede imaginar algoritmos
con garantı́as.


con garantı́as.
• “Ratonividad” y “quesividad” no tienen tal

especificación.


con garantı́as.

especificación.
• Es necesario definir reglas heurı́sticas o utilizar

aprendizaje de máquinas.


con garantı́as.

especificación.
• Es necesario definir reglas heurı́sticas o utilizar

aprendizaje de máquinas.
Big data y poder computacional:
Cuanto mayor es la cantidad de datos, definir reglas heruisticas efectivas resulta

más difı́cil y el aprendizaje de máquinas funciona mejor.

Ejemplos
Reconocimiento en imágenes:
• Gran variabilidad de intra-clase, deformaciones, iluminacion, etc.
• Resulta imposible escribir reglas que describan en forma precisa un objecto

Ejemplos
Detección de correos electrónicos de SPAM:

• Puede no haber reglas que sean simples y confiables.
• Puede ser necesario combinar una gran cantidad de reglas no muy confiables.
• Estas reglas cambian con el tiempo y deben ser ajustadas.

• En lugar de escribir un programa especı́fico para cada aplicación.

• Recolectar gran cantidad de ejemplos especificando la respuesta (o salida)
deseada para cada caso (o entrada).
• Un algoritmo de aprendizaje automático construye, a partir de los datos
recibidos, un programa capaz de resolver la tarea.

• En lugar de escribir un programa especı́fico para cada aplicación.

• Recolectar gran cantidad de ejemplos especificando la respuesta (o salida)
deseada para cada caso (o entrada).
• Un algoritmo de aprendizaje automático construye, a partir de los datos
recibidos, un programa capaz de resolver la tarea.
Programas obtenidos:
• El programa obtenido es muy distinto de un programa escrito a mano. Por
ejemplo, puede contener de un enorme cantidad de numeros.
• El programa debe ser capaz de producir buenos resultados con datos
diferentes a los que fue entrenado.
• Si los datos de interes cambian, el programa puede ser adaptado
entrenandolo con datos nuevos representativos de dicho cambio.

Formas de aprendizaje de máquinas
Aprendizaje supervisado
Aprendizaje por refuerzos
Aprendizaje no supervisado

• Predecir una salida dada una entrada.


• Aprender a elegir las acciones con el fin de maximizar alguna noción
de recompensa acumulada.


• Aprender a elegir las acciones con el fin de maximizar alguna noción
de recompensa acumulada.
• Encontrar una buena representación de los datos.

Formas de aprendizaje supervisado
Observamos un conjunto de pares {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} donde x 2 Rd

Regresión: La salida deseada es un número real y 2 R o un vector de números

reales y 2 Rm .


reales y 2 Rm .
• Predecir la temperatura al medio dá del jueves
• Predecir el valor de una acción en tres meses


reales y 2 Rm .
Clasificación: La salida deseada es una variable categorica, y 2 {1, . . . , k}.


reales y 2 Rm .

• El caso más sencillo es clasificación binaria, k = 2.
• Ejemplos: clasificación de imágenes, reconocimiento de voz.


reales y 2 Rm .

Predicción estructurada: La salida deseada es un objeto con estructora, e.g. grafo


reales y 2 Rm .

Predicción estructurada: La salida deseada es un objeto con estructora, e.g. grafo
• Estimacion de posiciones humanas en imagenes


Proceso de aprendizaje:
• Model class F, de forma que f (x) ⇡ y
• Una medida discrepancia, L(f (x), y), entre predicciones f (x) y objetivos y.
• Método para elegir f 2 F de forma de minimizar:
n
X
L(f (xi ), yi )
i=1

Proceso de aprendizaje:
• Model class F, de forma que f (x) ⇡ y
• Una medida discrepancia, L(f (x), y), entre predicciones f (x) y objetivos y.
• Método para elegir f 2 F de forma de minimizar:
n
X
L(f (xi ), yi )
i=1
Problema de interpolación:
• Dado x utilizar los datos de entrenamiento cercanos {(xi , yi )}i para predecir
f (x).

Curse of dimensionality
• Datos x 2 Rn , altı́sima dimensión e.g. n = 106


• En altas dimensiones: los puntos estan aislados unos de otros


• En altas dimensiones: los puntos estan aislados unos de otros
• Solucion: utilizar representaciones x
• Invarianza: |f (x) 0
f (x )| debe variar poco relativo a || (x) (x0 )||2

Arquitecture moderna de reconocimiento de patrones
Reconocimiento de audio, principtios de los 90 hasta el 2011:
MFCC Mix of Gaussians Classifier
fixed unsupervised supervised
Reconocimiento de imágnes, principtios del 2000 hasta el 2011:
SIFT K-means
Pooling Classifier
HoG Sparse Coding
unsupervised supervised
fixed
Low-level Mid-level
Features Features
Ejemplo tomado de charlas de Y. Lecun.

Representaciones jerárquicas adaptivas
Aprendizaje profundo:
Una clase de representaciones paramétricas no lineales capaces de codificar el
caracterı́stias (o conocimiento) de problema y de ser optimizadas de forma
eficiente (a enorme escala) usand metodos de descenso de gradiente estocástico.




• Cada capa transforma la representación anterior a otra de un mayor nivel de

abstracción
• Representaciones de alto nivel are más globales y más invariantes
• Las representaciones de bajo nivel son comunes a las distintas categorı́as
Trainable Trainable Trainable

Feature Feature Classifier/
Transform Transform Predictor
Learned Internal Representations

Low-Level Mid-Level High-Level Trainable

Feature Feature Feature Classifier
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]


• Una jerarquia de representaciones (capas) con mayores niveles de abstracción

• Cada capa es una representacion de caracteristicas adaptiva (entrenable)



• Imágenes:
• Pixel ! bordes ! texto ! partes ! objetos



• Imágenes:
• Texto:
• Caracteres ! palabras ! grupos de palabras ! frases ! historias



• Imágenes:
• Texto:
• Caracteres ! palabras ! grupos de palabras ! frases ! historias
• Audio:
• Muestras ! banda espectral ! sonido . . . ! fonemas ! palabras

• Representaciones intermedias: Retina - LGN - V1 - V2 - V4 - PIT - AIT
• Las redes neuronales no son biologicamente plausibles

Breve historia del resurgimiento (más reciente) de las redes neuronales

Antes del 2010...
• Primera red convolucional entrenada con backpropagation

• En los ’90 resolvia problemas en serio
• En otros dominios era tambien competitiva, pero no se uso en producción

Antes del 2010...
• Resultados competitivos, trabajos con redes neuronales eran minorı́a
Face detection [Vaillant et al ’93,’94 ; Osadchy et al, ’03, ’04, ’07]
(Yann LeCun y cia.)

Antes del 2010...
Segmentacion de escenas [Farabet et al, 2012-13]
(Yann LeCun y cia.)

Antes del 2010...
Segmentacion de escenas [Farabet et al, 2012-13]
(Yann LeCun y cia.)

Antes del 2010...
Principales razones:
• Demasiados parámetros para aprender. “Yo conozco el problema, mis

features son mejores!”

Antes del 2010...

• Optimizacion no convex? Estan locos.

Antes del 2010...

• Es un model de caja-negra, no se puede interpretar los models.

Antes del 2010...

• Dificultad para obtener resultados, falta de bibliotecas de codigo sólidas.

Antes del 2010...

• Dificultad para obtener resultados, falta de bibliotecas de codigo sólidas.
• Son estimacines puntuales, no hay inferencia.

Primer break-through: Reconocimiento de voz
• Mohamed, Dhal y Hinton, 2009

• Nuevo metodo para entrener redes profundas mostró resultados buenos
• Fue incluido en Androdi en 2012.
(imagen de R. Pieters)

ImageNet classification challenge
Competencia de reconocimiento de objetos
• 1000 categorias de imagenes; 1.4 millones de imagenes de entrenamiento

• Krizhevsky, Sutskever, Hinton, 2012.
• Respuesta correcta entre las primeras 5 imagenes.


Técnicas que usan CNN - Otros métodos


3.5% de error con una red neuronal de 150 capas!

Después del 2010...
Qué hizo posible este cambio?
• Bases de datos enormes (1.4 M de imagenes, 1000 catagorı́as)


• Mejor hardwar: GPU.


• Arquitecturas: ReLU.


• Mejor regularizacion para el entrenamiento: Dropout.


• Mejor regularizacion para el entrenamiento: Dropout.
• Mejor condicionamiento para la optimizacion: Batch-normalization.

Evolucion de las arquitecturas

Evolucion de las arquitecturas

Preguntas:
• Son las mejoras únicamente para estos bases de datos?
• Son las mejoras únicamente para estos problemas?

Son las mejoras únicamente para estos bases de datos?
• Técnicas usando deep learning tienen el estado del arte en casi todos los
problemas de visión.
Segmentation [Pinhero, Collobert, Dollar, ICCV15]

Son las mejoras únicamente para estos bases de datos?
• Técnicas usando deep learning tienen el estado del arte en casi todos los
problemas de visión.

Son las mejoras únicamente para estos problemas?
• También aplica a las Redes recurrentes, RNN

• Representacion de texto, Neural Machine Translation
Google

Son las mejoras únicamente para estos problemas?
• Multiples modalidades: mapear imagenes a texto.
[Vinyals et al ’14, Karpathy et al ’14, Donahue et al14, Kiros et al ’14, MSR ’14]

Más alla del aprendizaje supervisado
• Separar textura de estilo en imagenes
Gatys et al 2015

• Separar textura de estilo en imagenes
Gatys et al 2015

• Aprendizaje por refuerzos

• DeepMind: Atari games y Alpha Go

• Aprendizaje por refuerzos

• DeepMind: Atari games y Alpha Go

Credits
These slides are adapted (or most often directly taken) from presentations by
David Sontag and Andrew Zisserman.
Classification
• Suppose we are given a training set of N observations
(x1, . . . , xN ) and (y1, . . . , yN ), xi ∈ Rd, yi ∈ {−1, 1}

• Classification problem is to estimate f(x) from this data such that
f (xi) = yi
cations as possible. rule will divide the input space into regions Rk called decision regions, on
riadeforthemaking
input decisions.
space into rule regions
will divide Rkthe
class, input
called
such that space
decision
all pointsinto inregions
regions, Rk areone R k called
for
assigned each todecision
class Ck . regions, one f
The boundaries
ble classes. Such a class, such that all points in R are assigned to class C . The boundaries b x
hat
e allfrompoints
misclassification
arise the Rk are
inregions rate
of assigned
input to class
decision
space regions . The
Ck are k boundaries
called decision boundariesbetweenor kdecision surfaces. Note
egions, one for each
ons
p(C are
|x) called
is R 1 decision
decision
significantly regions
boundaries
less than decision
unity, are
or calledneed
region
decision decision
2not be boundaries
Rsurfaces. contiguous
Note thatbutoreach
decision
could surfaces.
comprise Note tho
some number
boundaries
implyk to make betweenas few misclassifications
decision regions.
region
Decision
as
need possible.
We notshallbe
theory
encounter
contiguous examples
but ofdisjoint
could decision
comprise boundaries
some and decision
number of d
ch
on
(x,
ces.
need
C )
Note
value havenot
that
be contiguous
comparable
each
x tojoint values. but could
These comprise some number of
ration k ofof the one ofprobabilities
the availablelater
regions. We
classes.
p(x,
shall C k )Such
chapters. for
encounter Inaorder of
eachexamples twoof
to find classes
the optimalplotted
decision decision
boundaries rule, and
consider first ofreg
decision a
shall
tain
into encounter
about
regions ofclass examples
membership.
k called
of decision
Inofsome boundaries and decision regions in
her number
with the Rdisjoint
decision decision
boundary regions,xtwo=one b. for each
Values
classes,
x as inof !x
thexcancer b are
problemclassified as A mistake occurs whe
for instance.
s.
making
nce In belong
are order tototo
decisions
assigned find class
decisionlater
onthethe chapters.
Ckoptimal
. difficult In order
decision
The boundaries
region cases rule,
between to consider
find the xoptimal
Cfirst bofaredecision
all classified
the caserule,
C2 consider firstThe
of all t
kdecision regions in R 2 , whereas
vector belonging points
to class 1< is assigned
x to class or vice versa. prob
rgs,
sionasboundaries
examples
to
first in
Rof the
.
allfor cancer
Errors
the or
which
case of
problem
a
arise two classes,
classification
decision
Minimising misclassification rate:
1 from for
surfaces.
the this as
instance.
Note
de-
blue, in the
thatA
green,
occurring cancer
mistake
each
is and
given problem
occurs
red
by for
when
regions, instance.
soan input
that A
formistake occurs when a
ntiguous
ging
ption.
are due
ccurs to
For
when but
class could
toexample,
an Cinput
points vector
1 comprise
isfrom
assigned
in ourclass belonging
some toCnumber
2class
hypothetical beingC to
of class
disjoint
or vice
2misclassified is assigned
C1 versa. asThe to class C2 or
probability
C1 (represented byvice versa. The probab
of
amples
ogThe
ed is
usegiven
and ofgreen
an decision
by
automatic boundaries
regions), this
system and and
occurring
to decision
converselyis
classify regions
given for by in
p(mistake)
points in the =region
p(x ∈x R ! 1 ,x
b ) + p(x ∈ R2 , C1 )
C2the
he probability
optimal decision ofrule, consider first of all the case ! !
ooubt pointsas tofrom class Cclass,
the correct 1 being whilemisclassified
leav- as C2 (represented =
by theC blue
problem for instance. A mistake occurs when an input p(x, 2 ) dx + p(x, C1 ) dx.
vary
biguous thecases.
p(mistake) location We=x bcanofp(xthe
achieve∈decision
R1this )by + p(x ∈ R
boundary,
, C2p(mistake) =
the ) ∈ R1Rareas
2 , Ccombined
1p(x ,1C2 ) +ofp(x the∈ RR22, C1 )
assigned
egions to class C2constant,
remains or vice
! versa. whereas The probability
the ! of
size of the ! region varies. The
red !
e) inputs x for which the largest of the
or b
x is where the =curves for
p(x, p(x,CWeC ) are
dx) free
+
and to choose
p(x, C
p(x, =) the) dx.
cross,
C decision rule
corresponding
p(x, C ) dxthat +
(1.78) assigns
to eachCpoint
p(x, ) dx.x to one o
r equal to θ. This is illustrated for 2 the
1 2 1 2 1
in this case the red region classes.
disappears. Clearly
This to
is minimize
equivalent p(mistake)
to the we should
minimum R2 arrange that each x is a
)input
dx.
p(x ∈We R1 , C2(1.78)
need
variable a
)x, rule
+ in that
p(xFigure R assigns
∈ R2 , 1.26.
1
C1 ) whichevereach x toR one of the Rclasses
Note class has the smaller value of the integrand in (1.78). Thus, if p
2 1
rate decision rule, which assigns each value of x to the class having the
! !
es are rejected, whereas We if
are there
free are
to K
choose
C ) foreach athe decision rule that we
x, then assigns
shouldeach that x to
point to class
one ofC1 . t
(1.78) point x to one of the two
to choose
probability the
p(C decision
|x). rule that p(x,assigns given value of assign
p(x, C2 ) dx k + p(x, C1 ) dx. 2
xatrly no
toR one
to1
examples
Such of
minimizeatherule are
two rejected.
divides
classes.
p(mistake)
R 2
Thus
theClearly
space
we the
product
should into
to disjoint
rule
minimize
arrange regions,
of probability
that each
p(mistake) wedecided
xhave
weassigned
is should
p(x, k )arrange
byCdecisionto= p(Cboundaries
kthat each x
|x)p(x). Because
is assi
olled
ach
ass x
has by the smaller
is assigned
the valuetoofwhichever
θ.
value of the p(x) has
class is common
integrand the in to bothvalue
smaller
(1.78). terms,ofwethe
Thus, if canintegrand
p(x, restate thisinresult
C ) > as saying
(1.78). Thus, that
if the
p(x
n sionto rule that assigns
minimize the each point
expected loss,x towhenone of the two 1
aThus,
making given
mistake) Ififap(x, mistake
value
we C1of
should ) >x, isp(x,
obtained
then
arrange Cthat
we
2 ) for ifashould
should
we
each xgiven
each
is assignvalue
value
assign
assigned ofofxx
that
to x,is then
to
to assigned
class
class weCshould
1 .toFromtheassign the that x to class C1 . Fr
class
loss
class incurred. From when the aproduct
reject decisionrule of is
probability ) =factor
posterior
of
r value of theC
probability
1 probability we
integrand in p(Chave k |x)
p(x,
(1.78). C
Thus,
k ) =
is largest. > weresult
This
p(CCk1 )|x)p(x).
if p(x, have is illustrated
Because
p(x, Ckthe p(C fork |x)p(x). Because the
d,Because
mon then toweboth
a single the
should
inputfactor
terms,assign we
p(x)
variable can
that isxx,common
restate
toin class
FigureC1to
this
We the
both
result
. From
1.24.
assign
terms,
theas sayingwe can that restate
the
to the class that maximises the
this result as saying that the mi
minimum
g that
have the
p(x, minimum
C ) = p(C
ore general case of K classes, posterior
k k |x)p(x). Because factor
it is slightly easier to maximize the
we can restate this result as saying that the minimum probabilities
em eing down correct,into two which is given
separate stages, by the
a to learn f (x)a = model
argmax for p(Ck |x), and the
k K
"
ng account of the loss incurred when a reject decision is
k-NN
and decision
lassification problem down into two separate stages, the
we use training data to learn
f (x)a =
model for p(Ck |x), and the
argmax
k
We can try to use the nearest neighbours to approximate this function

f (x) = argmax
k
Nk
f (x) = argmax
k N
The percentage of neighbours belonging to each class
Two approximations:
conditioning at a point is relaxed to conditioning on some region “close” to the
target point.
probability is approximated by averaging over sample data

Supervised Learning: Overview
Learning machine
K Nearest Neighbour (K-NN) Classifier
Algorithm
• For each test point, x, to be classified, find the K nearest
samples in the training data
• Classify the point, x, according to the majority vote of their
class labels
e.g. K = 3
• applicable to
multi-class case
K=1
Voronoi diagram: Classification boundary:

• partitions the space into regions • non-linear
• boundaries are equal distance
from training points
K=1
Training data Testing data
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
error = 0.0 error = 0.15

K=3
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
error = 0.0760 error = 0.1340

Generalization
• The real aim of supervised learning is to do well on test data that is
not known during learning
• Choosing the values for the parameters that minimize the loss
function on the training data is not necessarily the best policy
• We want the learning machine to model the true regularities in the

data and to ignore the noise in the data.
K=1
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
error = 0.0 error = 0.15

K=3
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
error = 0.0760 error = 0.1340

K=7
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
error = 0.1320 error = 0.1110

K = 21
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
error = 0.1120 error = 0.0920

Properties and training
As K increases:
• Classification boundary becomes smoother
• Training error can increase
Choose (learn) K by cross-validation

• Split training data into training and validation
• Hold out validation data and measure error on this
Example: hand written digit recognition
• MNIST data set

• Distance = raw pixel distance between images
• 60K training examples r
X ³ ´2
D(A, B) = aij − bij
• 10K testing examples
ij
• K-NN gives 5% classification error
Summary
Advantages:
• K-NN is a simple but effective classification procedure
• Applies to multi-class classification
• Decision surfaces are non-linear
• Quality of predictions automatically improves with more training
data
• Only a single parameter, K; easily tuned by cross-validation
1.2
0.8
0.6
0.4
0.2
-0.2
-1.5 -1 -0.5 0 0.5 1
Summary
Disadvantages:
• What does nearest mean? Need to specify a distance metric.
• Computational cost: must store and search through the entire
training set at test time. Can alleviate this problem by thinning,
and use of efficient data structures like KD trees.
Regression
(x1, . . . , xN ) and (y1, . . . , yN ), xi, yi ∈ R
• Regression problem is to estimate y(x) from this data
K-NN Regression
Algorithm
• For each test point, x, find the K nearest samples xi in the
training data and their values yi K
1 X
• Output is mean of their values f (x) = yi
K i=1
• Again, need to choose (learn) K
Regression
(x1, . . . , xN ) and (y1, . . . , yN ), xi, yi ∈ R
• Regression problem is to estimate y(x) from this data
K-NN Regression
Algorithm
• For each test point, x, find the K nearest samples xi in the
training data and their values yi K
1 X
• Output is mean of their values f (x) = yi
K i=1
• Again, need to choose (learn) K
Regression example: polynomial curve fitting
• The green curve is the true function (which is from Bishop
not a polynomial)
• The data points are uniform in x but have

noise in y.
• We will use a loss function that measures the

squared error in the prediction of y(x) from x.
The loss for the red polynomial is the sum of
the squared vertical errors.
target value
polynomial
regression
Some fits to the data: which is best?

from Bishop
over fitting
Regression example: polynomial curve fitting
• The green curve is the true function (which is from Bishop
not a polynomial)
• The data points are uniform in x but have

noise in y.
• We will use a loss function that measures the

squared error in the prediction of y(x) from x.
The loss for the red polynomial is the sum of
the squared vertical errors.
target value
polynomial
regression
Some fits to the data: which is best?

from Bishop
over fitting
Over-fitting
• test data: a different sample from the same true function
Root‐Mean‐Square (RMS) Error:
• training error goes to zero, but test error increases with M
Trading off goodness of fit against model complexity
• If the model has as many degrees of freedom as the data, it can fit the
training data perfectly
• But the objective in ML is generalization
• Can expect a model to generalize well if it explains the training data

surprisingly well given the complexity of the model.
Over-fitting
• test data: a different sample from the same true function
Root‐Mean‐Square (RMS) Error:
• training error goes to zero, but test error increases with M
Trading off goodness of fit against model complexity
• If the model has as many degrees of freedom as the data, it can fit the
training data perfectly
• But the objective in ML is generalization
• Can expect a model to generalize well if it explains the training data

surprisingly well given the complexity of the model.
Polynomial Coefficients
How to prevent over fitting? I

• Add more data than the model “complexity”
• For 9th order polynomial:

How to prevent over fitting? II
• Regularization: penalize large coefficient values
“ridge” regression
loss function regularization
In practice use validation data to choose λ (not test)

• cf with KNN classification as K increases
• we will return to regularization for regression later
How to prevent over fitting? II
• Regularization: penalize large coefficient values
“ridge” regression
loss function regularization
In practice use validation data to choose λ (not test)

• cf with KNN classification as K increases
• we will return to regularization for regression later
How to prevent over fitting? I

• Add more data than the model “complexity”
• For 9th order polynomial:

Summary: How to set parameters?
Use a validation set:
Divide the total dataset into three subsets:

• Training data is used for learning the parameters of the
model.
• Validation data is not used for learning but is used for
deciding what type of model and what amount of
regularization works best.
• Test data is used to get a final, unbiased estimate of how
well the learning machine works. We expect this estimate
to be worse than on the validation data.
We could then re-divide the total dataset to get another
unbiased estimate of the true error rate.
• Again, need to control the complexity of the (discriminant)

function
Binary Classification
Given training data (xi, yi) for i = 1 . . . N , with
xi ∈ Rd and yi ∈ {−1, 1}, learn a classifier f (x)
such that
(
≥ 0 yi = +1
f (x i )
< 0 yi = −1
i.e. yif (xi) > 0 for a correct classification.
Linear separability
linearly
separable
not
linearly
separable
Linear classifiers
A linear classifier has the form

f (x) = 0
X2
f (x) = w>x + b
f (x) < 0 f (x) > 0
X1
• in 2D the discriminant is a line

• is the normal to the line, and b the bias
• is known as the weight vector
Linear classifiers
A linear classifier has the form f (x) = 0
f (x) = w>x + b
• in 3D the discriminant is a plane, and in nD it is a hyperplane
For a K-NN classifier it was necessary to `carry’ the training data

For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data
The Perceptron Classifier
Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function
f (x i ) = w > x i + b
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
The Perceptron Algorithm

Write classifier as f (xi) = w̃>x̃i + w0 = w>xi
where w = (w̃, w0), xi = (x̃i, 1)

• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then w ← w + α sign(f (xi)) xi
• Until all the data is correctly classified
For example in 2D
• Initialize w = 0
• Cycle though the data points { xi, yi }
• if xi is misclassified then w ← w + α sign(f (xi)) xi
• Until all the data is correctly classified
before update after update
X2 X2
w
w
X1 X1
xi
PN
NB after convergence w = i αi xi
Perceptron4.1. Discriminant Functions 195
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Figure 4.7 Illustration of the convergence of the perceptron learning algorithm, showing data points from two
classes (red and blue) in a two-dimensional feature space (φ1 , φ2 ). The top left plot shows the initial parameter
vector w shown as a black arrow together with the corresponding decision boundary (black line), in which the
arrow points towards the decision region which classified as belonging to the red class. The data point circled
8
6
Perceptron
example 4
-2
-4
-6
-8
-10
-15 -10 -5 0 5 10
• if the data is linearly separable, then the algorithm will converge

• convergence can be slow …
• separating line close to training data
• we would prefer a larger margin for generalization
What is the best w?
• maximum margin solution: most stable under perturbations of the inputs

Support Vector Machine
linearly separable data
wTx + b = 0
b
||w||
Support Vector
Support Vector
X
f (x) = αi yi (xi > x) + b
i
support vectors
SVM – sketch derivation
• Since w>x + b = 0 and c(w>x + b) = 0 define the same

plane, we have the freedom to choose the normalization
of w
• Choose normalization such that w>x++b = +1 and w>x−+

b = −1 for the positive and negative support vectors re-
spectively
• Then the margin is given by

³ ´
w ³ ´ w> x+ − x− 2
. x+ − x− = =
||w|| ||w|| ||w||
Support Vector Machine
linearly separable data
Margin = 2
||w||
Support Vector
Support Vector
wTx + b = 1 w
wTx + b = 0
wTx + b = -1
SVM – Optimization
• Learning the SVM can be formulated as an optimization:

2 ≥1 if yi = +1
max subject to w>xi+b for i = 1 . . . N
w ||w|| ≤ −1 if yi = −1
• Or equivalently
³ ´
min ||w||2 >
subject to yi w xi + b ≥ 1 for i = 1 . . . N
w
• This is a quadratic optimization problem subject to linear

constraints and there is a unique minimum
Linear separability again: What is the best w?
• the points can be linearly separated but

there is a very narrow margin
• but possibly the large margin solution is

better, even though one constraint is violated
In general there is a trade off between the margin and the number of
mistakes on the training data
Introduce “slack” variables
ξi 2
> 2
ξi ≥ 0 ||w|| ||w|| Margin =
||w||
Misclassified
point
• for 0 < ξ ≤ 1 point is between ξi 1
margin and correct side of hyper- <
||w|| ||w||
plane. This is a margin violation
• for ξ > 1 point is misclassified
Support Vector
Support Vector
=0
wTx + b = 1 w
wTx + b = 0
wTx + b = -1
“Soft” margin solution
The optimization problem becomes
N
X
2
min ||w|| +C ξi
w∈Rd ,ξ i ∈R+ i
subject to
³ ´
>
yi w xi + b ≥ 1−ξi for i = 1 . . . N
• Every constraint can be satisfied if ξi is sufficiently large
• C is a regularization parameter:
— small C allows constraints to be easily ignored → large margin
— large C makes constraints hard to ignore → narrow margin
— C = ∞ enforces all constraints: hard margin

• This is still a quadratic optimization problem and there is a
unique minimum. Note, there is only one parameter, C.
0.8
0.6
0.4
0.2
feature y
-0.2
-0.4
-0.6
-0.8
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
feature x
• data is linearly separable

• but only with a narrow margin
C = Infinity hard margin
C = 10 soft margin
Optimization
Learning an SVM has been formulated as a constrained optimization prob-
lem over w and ξ
N
X ³ ´
min ||w||2 + C >
ξi subject to yi w xi + b ≥ 1 − ξi for i = 1 . . . N
w∈Rd ,ξi ∈R+ i
³ ´
>
The constraint yi w xi + b ≥ 1 − ξi, can be written more concisely as
yif (xi) ≥ 1 − ξi
which, together with ξi ≥ 0, is equivalent to
ξi = max (0, 1 − yif (xi))

Hence the learning problem is equivalent to the unconstrained optimiza-
tion problem over w
N
X
min ||w||2 + C max (0, 1 − yif (xi))
w∈Rd i
regularization loss function

Loss function
N
X
2
min ||w|| + C max (0, 1 − yif (xi))
w∈Rd i
wTx + b = 0
loss function
Points are in three categories: Support Vector

1. yif (xi) > 1
Point is outside margin.
No contribution to loss
Support Vector
2. yif (xi) = 1
Point is on margin.
No contribution to loss.
As in hard margin case. w
3. yif (xi) < 1
Point violates margin constraint.
Contributes to loss
Loss functions
yif (xi)
• SVM uses “hinge” loss max (0, 1 − yi f (xi))
• an approximation to the 0-1 loss
Optimization continued
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w∈Rd i
local global
minimum minimum
• Does this cost function have a unique solution?

• Does the solution depend on the starting point of an iterative
optimization algorithm (such as gradient descent)?
If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over a convex set, which it is in our case)
Convex functions
Convex function examples
convex Not convex
A non-negative sum of convex functions is convex

+
SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2 convex
w∈Rd i
Gradient (or steepest) descent algorithm for SVM
To minimize a cost function C(w) use the iterative update
wt+1 ← wt − ηt∇w C(wt)

where η is the learning rate.
First, rewrite the optimization problem as an average
N
X
λ 1
min C(w) = ||w||2 + max (0, 1 − yif (xi))
w 2 N i
N µ ¶
1 X λ
= ||w||2 + max (0, 1 − yif (xi))
N i 2
(with λ = 2/(N C) up to an overall scale of the problem) and
f (x ) = w > x + b
Because the hinge loss is not differentiable, a sub-gradient is

computed
Sub-gradient for hinge loss
L(xi, yi; w) = max (0, 1 − yif (xi)) f (xi) = w>xi + b
∂L
= −yixi
∂w
∂L
=0
∂w
yif (xi)
Sub-gradient descent algorithm for SVM
N µ ¶
1 X λ
C(w) = ||w||2 + L(xi, yi; w)
N i 2
The iterative update is
wt+1 ← wt − η∇wt C(wt)

N
1X
← wt − η (λwt + ∇w L(xi, yi; wt))
N i
where η is the learning rate.
Then each iteration t involves cycling through the training data with the
updates:
wt+1 ← wt − η(λwt − yixi) if yif (xi) < 1

← wt − ηλwt otherwise
1
In the Pegasos algorithm the learning rate is set at ηt = λt
Pegasos – Stochastic Gradient Descent Algorithm
Randomly sample from the training data
2
10 6
4
1
10
2
energy
0
10 0
-2
-1
10
-4
-2
10 -6
0 50 100 150 200 250 300 -6 -4 -2 0 2 4 6
How do we do multi-class classification?
Multi-class SVM
w+
Simultaneously learn 3 sets
of weights: w-
• How do we guarantee the
correct labels?
wo
• Need new constraints!
The “score” of the correct

class must be better than the
“score” of wrong classes:
cations as possible. rule will divide the input space into regions Rk called decision regions, on
riadeforthemaking
input decisions.
space into rule regions
will divide Rkthe
class, input
called
such that space
decision
all pointsinto inregions
regions, Rk areone R k called
for
assigned each todecision
class Ck . regions, one f
The boundaries
ble classes. Such a class, such that all points in R are assigned to class C . The boundaries b x
hat
e allfrompoints
misclassification
arise the Rk are
inregions rate
of assigned
input to class
decision
space regions . The
Ck are k boundaries
called decision boundariesbetweenor kdecision surfaces. Note
egions, one for each
ons
p(C are
|x) called
is R 1 decision
decision
significantly regions
boundaries
less than decision
unity, are
or calledneed
region
decision decision
2not be boundaries
Rsurfaces. contiguous
Note thatbutoreach
decision
could surfaces.
comprise Note tho
some number
boundaries
implyk to make betweenas few misclassifications
decision regions.
region
Decision
as
need possible.
We notshallbe
theory
encounter
contiguous examples
but ofdisjoint
could decision
comprise boundaries
some and decision
number of d
ch
on
(x,
ces.
need
C )
Note
value havenot
that
be contiguous
comparable
each
x tojoint values. but could
These comprise some number of
ration k ofof the one ofprobabilities
the availablelater
regions. We
classes.
p(x,
shall C k )Such
chapters. for
encounter Inaorder of
eachexamples twoof
to find classes
the optimalplotted
decision decision
boundaries rule, and
consider first ofreg
decision a
shall
tain
into encounter
about
regions ofclass examples
membership.
k called
of decision
Inofsome boundaries and decision regions in
her number
with the Rdisjoint
decision decision
boundary regions,xtwo=one b. for each
Values
classes,
x as inof !x
thexcancer b are
problemclassified as A mistake occurs whe
for instance.
s.
making
nce In belong
are order tototo
decisions
assigned find class
decisionlater
onthethe chapters.
Ckoptimal
. difficult In order
decision
The boundaries
region cases rule,
between to consider
find the xoptimal
Cfirst bofaredecision
all classified
the caserule,
C2 consider firstThe
of all t
kdecision regions in R 2 , whereas
vector belonging points
to class 1< is assigned
x to class or vice versa. prob
rgs,
sionasboundaries
examples
to
first in
Rof the
.
allfor cancer
Errors
the or
which
case of
problem
a
arise two classes,
classification
decision
Minimising misclassification rate:
1 from for
surfaces.
the this as
instance.
Note
de-
blue, in the
thatA
green,
occurring cancer
mistake
each
is and
given problem
occurs
red
by for
when
regions, instance.
soan input
that A
formistake occurs when a
ntiguous
ging
ption.
are due
ccurs to
For
when but
class could
toexample,
an Cinput
points vector
1 comprise
isfrom
assigned
in ourclass belonging
some toCnumber
2class
hypothetical beingC to
of class
disjoint
or vice
2misclassified is assigned
C1 versa. asThe to class C2 or
probability
C1 (represented byvice versa. The probab
of
amples
ogThe
ed is
usegiven
and ofgreen
an decision
by
automatic boundaries
regions), this
system and and
occurring
to decision
converselyis
classify regions
given for by in
p(mistake)
points in the =region
p(x ∈x R ! 1 ,x
b ) + p(x ∈ R2 , C1 )
C2the
he probability
optimal decision ofrule, consider first of all the case ! !
ooubt pointsas tofrom class Cclass,
the correct 1 being whilemisclassified
leav- as C2 (represented =
by theC blue
problem for instance. A mistake occurs when an input p(x, 2 ) dx + p(x, C1 ) dx.
vary
biguous thecases.
p(mistake) location We=x bcanofp(xthe
achieve∈decision
R1this )by + p(x ∈ R
boundary,
, C2p(mistake) =
the ) ∈ R1Rareas
2 , Ccombined
1p(x ,1C2 ) +ofp(x the∈ RR22, C1 )
assigned
egions to class C2constant,
remains or vice
! versa. whereas The probability
the ! of
size of the ! region varies. The
red !
e) inputs x for which the largest of the
or b
x is where the =curves for
p(x, p(x,CWeC ) are
dx) free
+
and to choose
p(x, C
p(x, =) the) dx.
cross,
C decision rule
corresponding
p(x, C ) dxthat +
(1.78) assigns
to eachCpoint
p(x, ) dx.x to one o
r equal to θ. This is illustrated for 2 the
1 2 1 2 1
in this case the red region classes.
disappears. Clearly
This to
is minimize
equivalent p(mistake)
to the we should
minimum R2 arrange that each x is a
)input
dx.
p(x ∈We R1 , C2(1.78)
need
variable a
)x, rule
+ in that
p(xFigure R assigns
∈ R2 , 1.26.
1
C1 ) whichevereach x toR one of the Rclasses
Note class has the smaller value of the integrand in (1.78). Thus, if p
2 1
rate decision rule, which assigns each value of x to the class having the
! !
es are rejected, whereas We if
are there
free are
to K
choose
C ) foreach athe decision rule that we
x, then assigns
shouldeach that x to
point to class
one ofC1 . t
(1.78) point x to one of the two
to choose
probability the
p(C decision
|x). rule that p(x,assigns given value of assign
p(x, C2 ) dx k + p(x, C1 ) dx. 2
xatrly no
toR one
to1
examples
Such of
minimizeatherule are
two rejected.
divides
classes.
p(mistake)
R 2
Thus
theClearly
space
we the
product
should into
to disjoint
rule
minimize
arrange regions,
of probability
that each
p(mistake) wedecided
xhave
weassigned
is should
p(x, k )arrange
byCdecisionto= p(Cboundaries
kthat each x
|x)p(x). Because
is assi
olled
ach
ass x
has by the smaller
is assigned
the valuetoofwhichever
θ.
value of the p(x) has
class is common
integrand the in to bothvalue
smaller
(1.78). terms,ofwethe
Thus, if canintegrand
p(x, restate thisinresult
C ) > as saying
(1.78). Thus, that
if the
p(x
n sionto rule that assigns
minimize the each point
expected loss,x towhenone of the two 1
aThus,
making given
mistake) Ififap(x, mistake
value
we C1of
should ) >x, isp(x,
obtained
then
arrange Cthat
we
2 ) for ifashould
should
we
each xgiven
each
is assignvalue
value
assign
assigned ofofxx
that
to x,is then
to
to assigned
class
class weCshould
1 .toFromtheassign the that x to class C1 . Fr
class
loss
class incurred. From when the aproduct
reject decisionrule of is
probability ) =factor
posterior
of
r value of theC
probability
1 probability we
integrand in p(Chave k |x)
p(x,
(1.78). C
Thus,
k ) =
is largest. > weresult
This
p(CCk1 )|x)p(x).
if p(x, have is illustrated
Because
p(x, Ckthe p(C fork |x)p(x). Because the
d,Because
mon then toweboth
a single the
should
inputfactor
terms,assign we
p(x)
variable can
that isxx,common
restate
toin class
FigureC1to
this
We the
both
result
. From
1.24.
assign
terms,
theas sayingwe can that restate
the
to the class that maximises the
this result as saying that the mi
minimum
g that
have the
p(x, minimum
C ) = p(C
ore general case of K classes, posterior
k k |x)p(x). Because factor
it is slightly easier to maximize the
we can restate this result as saying that the minimum probabilities
em eing down correct,into two which is given
separate stages, by the
a to learn f (x)a = model
argmax for p(Ck |x), and the
k K
"
rule will divide the input space into regions Rk called decision regions, one for
p(x, bet
class, such that all points in Rk are assigned to class Ck . The boundaries C2
decision regions are called decision boundaries or decision surfaces. Note that
C2 ) Decision
decision region need theory
not be contiguous but could comprise some number of dis
regions. We shall encounter examples of decision boundaries and decision regio
later chapters. In order to find the optimal decision rule, consider first of all the
of two classes, as in the cancer problem for instance. A mistake occurs when an
vector belonging to class C1 is assigned to class C2 or vice versa. The probabili
Minimising misclassificationthis occurring is given by
rate:
p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )
! !
= p(x, C2 ) dx + p(x, C1 ) dx. (
R1 R2
x
We1.are
40 R1 rule that assigns each point x to one of the
free to choose the decision
INTRODUCTION
classes. Clearly to minimize p(mistake) we should arrange that each x is assign
R2Errors blue, red and green regions
Figure 1.24 Schematicwhichever class has the smaller value of the integrand
illustration of the joint probabilities
x0 !
x
in (1.78). Thus, if p(x, C
p(x, C2 ) for a given value of x, then we should assign that k ) for
x toCclass
p(x, ea
C1 . From
each of two classes against together
plottedx, product with
rule of the decision
probability we C1 ) p(x,boundary
p(x,have x=x
Ck ) = p(Ck |x)p(x). . Values
bBecause the f
es of x ! x
b are class
classified
classified as Cas2 and
p(x) hence
is commonbelong to decision
to both terms, we can restateregion
this result R
as 2 , whereas
saying that the minip
p(x, C )
s points x < x b are asclassified
C1 and belong to R1 . Errors arise from the blue, green, an
2
andOptimal
red regions, so
rule red that
xarea
< x for errors are due to points from class C2 being misclass
the
bdisappears
ssified as C1 (represented
the sum by of the red and green regions), and conversely for poin
!x
oints in the region xerrors b theare due to points from class C1 being misclassified as C
s C2 (represented by the blue
region). As we vary the location x b of the decision boundary, x
t
y, the combined areas of the
of the red region varies. The Figure 1.24 Schematic illustration of the joint probabilities p(x, C ) for eachthe
blue and green regions remains constant, whereas size of
R1 R2
optimal
, C2 ) cross, corresponding to choice for b
x is where the curves for p(x,
against x, together with the decision boundary x = x
Ck
1 ) and
of two classes pl
b. Values of x ! x
p(x, C
b are classifie
is equivalent to the x = x0 , because inas

bminimum this
class case
C and hence the
2 belongred region
to decision region disappears.
R , whereas points x <
2
This is
b are clas
x
C and belong to R . Errors arise from the blue, green, and red regions, so th
1 1
action of examples that get rejected is controlled by the value of θ.
We can easily extend the reject criterion to minimize the expected loss, whe
Decision
loss matrix is given, taking account of thetheory
loss incurred when a reject decision
ade.
1.5.4
We haveInference and decision
broken the problem into two stages: inference and decision
We have broken the classification problem down into two separate stages, th
ference in which
stagestage:
Inference we usemodel
we estimate training data totolearn
parameters a posteriors
predict model for p(Ck |x), and th
Decision: use these probabilities to make a class assignment
Alternative approach:
Define a function that maps inputs to decisions, and solves directly the class
assignments (like SVM)
yclass
the assignments.
more ambiguous An (a) in solve
First
cases.
alternative Bayes’the inference
We
possibility theorem
can
would achieve can
problem
be to this
solve beby
of
both found intheterms
determining
problems of the q
class-conditional
her and simplythose
rejecting learn ainputs
functionp(x|C
x k ) which
thatfor
mapsfor eachx
inputs class
directly
the individually.
Ckinto
largest decisions.
of the Such Also separately infer the pr
numerator, because
probabilities p(Ck ). Then use Bayes’ theorem!
ction is called a discriminant function. in the form
)n fact,
is less than
we can or equal
identify three to θ. approaches
Decision
distinct This is illustrated
theory: for the
threedecision
to solving approaches
p(x) =
problems, p(x|Ck )p(Ck ).
gle
whichcontinuous
have been used input variable
in practical x, in Figure
applications. These are 1.26.
given,Note p(x|Ck )p(Ck )
in decreasing
of complexity, by: p(Ck |x) = k
that all examples are rejected, whereas if there are K p(x)
will
irst ensure
solve that no
the inference problem Equivalently,
examples thewe
are rejected.
of determining canThus model
class-conditional the joint distribution p
thedensities
) for each class to find theAlso
individually. posterior classinfer
separately probabilities
the prior p(Ck |x). As usual, the den
class
ejected
p(x|C k is controlled
Generative models:by
C k normalize
the
estimate
in
value
Bayes’ to
theorem obtain
classofconditionals
θ. can be the
densities
found posterior
in
and prior probabilities.
terms of
probabilities
the quantities H
appearin
probabilities p(Ck ). Then use Bayes’ theorem in the form
reject criterion to minimize thebecause
probabilities,
numerator, expectedwe use loss,decision
when theory to determine c
!
account of the loss incurred p(Cnew
whenk a reject
p(x|C )p(C k )
|x) =input x. Approaches
decision
p(x) = that is p(x|C
explicitly
(1.82)k )p(Ck ). or implicitly
k
p(x)
inputs as well as outputs are k known as generative mo
from them
to find the posterior class probabilities
Equivalently, it
p(Ckwe
|x).isAspossible
can usual, the
model to generate
thedenominator
joint distribution p(x, Ck ) directly
synthetic data po
d decision
in Bayes’ theorem can
Discriminative be found
models: in aterms
normalize
find toofobtain
model thedirectly
to quantities appearing
the posterior
predict in the probabilities
probabilities.
the posterior Having found the
numerator, because probabilities, we inference
use decision theory to determine class membership
ification problem(b) downFirst
p(x)new
!solve
into
= input
two
p(x|C
the
separate
)p(C ).
stages, problem
the of
(1.83)
determining the p
x. kApproaches
k that explicitly or implicitly model the distri
se training data to learnp(C a model
k |x), for
k and then
p(C k |x), and the
subsequently use models,
decision theory
inputs as well as outputs are known as generative because by
Equivalently, we can model from onejoint
the ofdistribution
them the
it isclasses.
p(x, CtokApproaches
possible )generate
directly and thenthat
synthetic datamodel
points inthe
the post
input
Directtoapproaches:
normalize find a function
obtain the posterior that maps
probabilities. Having inputs
found to
thedecisions
posterior
probabilities, we use decision
are
(b) Firsttheory
called
solvetothe
discriminative
inference
determine problem
class membership
models.
of determining
for each the posterior class pro
new input x. Approaches thatp(C k |x), and
explicitly then subsequently
or implicitly use decision
model the distribution of theory to assign each n
(c)areFind
inputs as well as outputs known
one of aasthe
function
generative (x),because
fApproaches
classes.models, called amodel
discriminant
by sampling
that the posterior function,
probabilities
directly
from them it is possible to generate
are onto
synthetic
called data a class
points
discriminative in thelabel.
models.input space.For instance, in the ca
fof(·)
irst solve the inference problem
(c) Find mightfbe
adetermining
function the binary
(x), called avalued
posterior class andfunction,
probabilities
discriminant such that f=
which 0 rep
maps each
p(Ck |x), and then subsequently use decision theory to assign each new x to
X-ray images probabilities
for which therep(Ck ).isThen
littleuse as to
Bayes’
doubt findcorrect
totheorem
the the posterior
in the formwhile
class, class
leav- pro
human expert to classify the more ambiguous in Bayes’
cases. We cantheoremachieve canthisbebyfou
p(x|Ck )p(Cbecause
numerator, k)
ducing a threshold θ and rejecting those
Decision |x) = x for which the largest of the
p(Ckinputs
theory p(x)
rior probabilities p(Ck |x) is less than or equal to θ. This is illustrated for the p(x)
of two classes,
to and
find athe
single continuous
posterior input variable
class probabilities p(Cx, in Figure
|x). As 1.26.theNote
usual, denom
k
setting θ = 1 in
Generativewill ensure
Bayes’
models: that allcan
theorem examples
be found areinrejected,
terms ofwhereas if thereappearing
the quantities are K
es then setting θ < 1/Kbecause
numerator, will ensure that no examples Equivalently, we canThus
are rejected. model
the t
! normalize
on of examplesthat
Approaches thatexplicitly
get rejected is controlled
or implicitly model theby the
p(x) = distribution
p(x|Cvalue oftoθ.
of inputs
)p(C ).obtain
as well astheoutputs
poste
k k
We can easily extend the reject criterion to minimize
Estimating densities in high dimensions difficult k
probabilities,
the we
expected use
loss,decision
when
s matrix is given, taking account of the loss incurred new inputwhenx. a reject decisionthat
Approaches is
Equivalently, we can model the joint distribution
e. Full model of the problem: evaluate uncertainty inputs as well as outputs are kn p(x, C k ) directly an
normalize to obtain the posterior probabilities. Having found the po
probabilities,
1.5.4Discriminative
Inference and we use
decisiondecision theory from
to them
determine it is possible
class to
membership gen fo
models:
new input x. Approaches that explicitly or implicitly model the distribu
We have broken inputstheasclassification
well as problem
outputs
Model the posterior probabilities directlyare (b)
known First
down as solve
into twothe
generative inference
separate
models, becauseproblem
stages, the
by sa
ence stage in from
which we ituse
them training to
is possible data to learn
generate p(Ca kmodel
|x),data
synthetic forpoints
and then
p(C and the sp
insubsequent
k |x), the input
Require significantly fewer parameters one of the classes. Approache
(b) First solve the inference problem of determining the posterior class proba
are called discriminative mode
p(C |x), and then subsequently use decision theory to assign each ne
k
Direct approaches:
one of the classes. Approaches that model the posterior probabilities d
are called discriminative
(c)
models.
Find a function f (x), called a d
Simplest approach can be very effective
directly onto a class label. Fo
(c) Find a function f (x), called a discriminant function, which maps each i
Cj (where j may or may not be equal to k). In so doing, we incur some
ss that we denote by Lkj , which we can view as the k, j element of a loss
r instance, in our cancer example, we might have a loss matrix of the form
Figure 1.25. This particular loss matrix says
Advantages ofthat there is no lossprobabilities
estimating incurred
ect decision is made, there is a loss of 1 if a healthy patient is diagnosed as
cer,1.whereas
INTRODUCTION
there is a loss of 1000 if a patient having cancer is diagnosed
ptimalMinimising
Figure solution is the
risk:one
1.26 Illustration of which minimizes
the reject option. the loss function.
Inputs p(C1However,
|x) p(C2 |x)
nction dependsx such
on thethattrue
the class,
larger which
of the two poste- 1.0For a given input
is unknown.
our uncertaintyrior
in probabilities
the true class is less than or equal
is expressed to theθ joint probability
through
Cases in which
some making
threshold one
θ will be type of mistake has more “cost” than
rejected. 1.5.other
Decision Theory
n p(x, Ck ) and so we seek instead to minimize the average loss, where the
computed with respect to this distribution, which is given by
Figure 1.25 An example $ of a loss matrix with ele- cancer normal "
# #
ments Lkj for the cancer treatment problem. The rows !
E[L]to=the true class, whereas
correspond ) dx. cor-
Ckcolumns
Lkj p(x,the (1.80)
cancer 0 1000
respond to the assignment
k j R
ofj class made by our deci- normal 1 0
sion criterion.
n be assigned independently to one of the decision regions Rj . Our goal
se the regions Rj in order to minimize% the expected loss (1.80), which
Matrix L measures relative
at for each x we should minimize 1.5.2“cost”
L
of mistake
Minimizing
p(x, C ). As0.0 types
the expected
before, we can useloss x
k kj k
reject region
t rule p(x, Ck ) = p(Ck |x)p(x)For to many
eliminate the common factor of p(x).
applications, our objective will be more complex than simply mi
ecision rule
The that minimizes
optimal decision the
is expected
mizing tothe loss
number
assign is misclassifications.
of
points thetoonethethat assigns
class Leteach
us consider again the medical diagno
maximising:
new x to the problem. Wewhich
class j for note that, if a patient who does not have cancer is incorrectly diagnos
the quantity
as having cancer, the consequences
! may be some patient distress plus the need
further investigations. Conversely,
Lkj p(Ck if |x)a patient with cancer is diagnosed as healt
(1.81)
the result may be premature k death due to lack of treatment. Thus the consequen
of these two types of mistake can be dramatically different. It would clearly be bet
is a minimum. This is
to make clearly
fewer trivialoftothe
mistakes do,second
once kind,
we knoweventhe posterior
if this class
was at the proba-of mak
expense
bilities p(Ck |x).
more mistakes of the first kind.
Advantages of estimating probabilities
Rejection option:
Avoid making a decision in difficult cases
TRODUCTION
26 Illustration of the reject option. Inputs p(C1 |x) p(C2 |x)

x such that the larger of the two poste- 1.0
rior probabilities is less than or equal to θ
some threshold θ will be rejected.
0.0 x
reject region
new x to the class j for which the quantity

!
Lkj p(Ck |x) (1.81)
Advantages of estimating probabilities
Compensation for prior:
Situation of unbalanced datasets
Predicting all samples from the dominant class gives high accuracy
We can train on a balanced set and the rescale using correct priors
Model combination:
For complex applications, break the problem into smaller subproblems each of
which can be tackled by a separate module.
If X represents all our inputs and Y all our observed
al maximum likelihood estimator is
CHAPTERMaximum likelihood
5. MACHINE estimation
LEARNING BASICS
θML = arg max P (Y | X ; θ). (5.62)
θ CHAPTER 5. MACHINE LEARNING BASICS
5.5.1 Conditional Log-Likelihood and Mean Squared Error
ed toThe
bemost
i.i.d., then this
common can
loss for be decomposed
training into
probabilistic models
The maximum likelihood estimator can readily be generalized to the case w
our5.5.1
goal is Conditional
to estimate a Log-Likelihood
conditional and
probability P (Mean
y | x ; θ) Squared
in order Error
to pred
m
given x.(iThis
)
| x is(iactually
) the most common situation because it forms the bas
= argGiven
maxa model
logparametrised
P (ymaximum
The by;likelihood
θ,).we perform
estimator can (5.63)
readily
estimation be generalized to the case
solving:
θ most
oursupervised
goal is to estimate
X represents
learning.aIfconditional all our inputs
probability P ( y | xand
; θ) Yin all ourtoobse
order pre
i=1 targets, then the conditional maximum likelihood estimator is
given x. This is actually the most common situation because it forms the b
most supervised learning. If X represents
P ( Yall| X
our; θinputs
) . and Y all our ob
(
θML = arg max
Linear
targets, then the conditional regression,
maximum
θ likelihood estimator is
1.4, may be justified
If the as a maximum
examples likelihood
are assumed to be procedure.
i.i.d.,
θML = arg max then
P (this
Y | can
X ; θbe). decomposed into
inear regression as an algorithm that learns to m
θ take an
To obtain a more convenient but equivalent optimisation problem
utput value ŷ. The mapping
If the examples from ML x
areθassumed
= to
arg ŷ
to
maxis
be chosen
i.i.d.,
logthen
P ( to
y(i)
|
this x (i)
can ; θbe). decomposed into (5
θ
or, a criterion that we introduced more or less i=1 arbitrarily.
m
ession from the point of view of maximum θML = arg max log P (y(i) | x (i); θ).
likelihood
θ i=1 Linear regres
ducing a single prediction ŷ, we
introduced earlier in now think
Sec. 5.1.4 , mayofbethe model
justified as a maximum likelihood proce
l distribution p(y | x). We
Previously, can imagine
we motivated that with
linear regression as ananalgorithm that learns Lineartoregr
tak
input
, we might see several x and produce
training
introduced inan output
earlierexamples
Sec. value
5.1.4,with
may be ŷ.justified
the The
samemapping from xlikelihood
as a maximum to ŷ is chose
pro
minimize mean squared error, a criterion that we introduced more or less arbitr
yj
p(y|x, ✓) = aj (1 aj ) 1 yj
x = {x1 , . . . , xm } j
X
Maximum likelihood estimation
L(✓) = yj log(aj ) + (1 yj ) log(1
1 ajm)
y = {y 1 , . . . ,jy m } x = {x , . . . , x }
1
j 2, {0,
Binary classification problem: x =y{x xmy} = {y 1 , . . . , y m }
. . . ,1}
aj = P (yj = 1|xj , ✓)
Y 1 m
yj y = {y
1 yj
, . . . , y }
p(y|x, ✓) = aj (1 a j )
Pablo Sprechmann
a j = P (y j = 1|x j ,
Indroduccion al aprendizaje profundo
✓)
j
X aj = P (yj = 1|xj , ✓) Y yj 1 yj
L(✓) = p(y|x,
yj log(aj ) + (1 yj ) log(1 aj ) ✓) = a j (1 a j )
j Y yj j
Cross entropy loss: p(y|x, ✓) = aX (1 a j ) 1 yj
j
yj 2 {0, 1} L(✓)j = yj log(aj ) + (1 yj ) log(1 aj )
X j
L(✓) = yj log(aj ) + (1 yj ) log(1 aj )
j yj 2 {0, 1}
yj 2 {0, 1}

mator (MVUE) is sometimes used instead. It is
1
⇤
j
⇥ˆ 2ik = j
(X i µ̂ ik )2
(Y j
= yk ) (15)
(⇤ j (Y 1= yk )) 1 j j
⇥ˆ 2ik = ⇤ (Xi µ̂ik ) 2
(Y j
= yk ) (15)
Logis6c&Regression&
j
(⇤ j (Y = yk )) 1 j
ogistic Regression
ogistic Regression Logistic function
Regression is an approach to learning functions of the form f :X ⇤ Y , or (Sigmoid):
n the case
•  where
Regression isLearn Y isP(Y|X)
an approachdiscrete-valued,
to learning and X =of⌅Xthe
functions
directly! . Xn ⇧ fis: any
1 . .form X ⇤vector
Y , or
n the
ng case or
discrete where Y is discrete-valued,
continuous variables. In this X = ⌅Xwe
andsection . Xn ⇧primarily
1 . .will is any vectorcon-
ng discrete
case whereorY isAssume
• continuous
a boolean avariable,
particular
variables. Ininthis section
order we willnotation.
to simplify primarilyIncon- the 1
case where Y isfunctional
a boolean form
variable, in order to simplify
section we extend our treatment to the case where Y takes on any finite notation. In the 1 + e z
section
of wevalues.
discrete extend our treatment to the case where Y takes on any finite
•  Sigmoid applied to a linear
of
sticdiscrete values.
Regression assumes a parametric form for the distribution P(Y |X),
stic Regression function
assumes a of the data:
parametric form for thedata.
distribution P(Y |X),
ectly estimates its parameters from the training The parametric
ectly
sumed estimates
by Logisticits parameters
Regression infrom
the the
casetraining
where Ydata. The parametric
is boolean is: z
sumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = 1 n (16)
P(Y = 1|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (16)
1 + exp(w0 + ⇤i=1 wi Xi ) Features can be
exp(w0 + ⇤nni=1 wi Xi ) discrete or
P(Y = 0|X) = exp(w0 + ⇤i=1n wi Xi ) continuous!(17)
P(Y = 0|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (17)
1 + exp(w0 + ⇤i=1 wi Xi )
hat equation (17) follows directly from equation (16), because the sum of
hat equation (17)
o probabilities follows
must equaldirectly
1. from equation (16), because the sum of
ohighly
probabilities mustproperty
convenient equal 1. of this form for P(Y |X) is that it leads to a
Logis6c&Func6on&in&n&Dimensions&
Sigmoid applied to a linear function of the data:
Features can be discrete or continuous!

1x1x2
0.8
0.6
0.4
0.2
0
10
8
6
4
2
0
6-4-2
4
2
0
-2
on we extend our treatment to the case1wherei=1 Y takes on any finite
crete values.P(Y = 1|X) = model assumed n
by Logistic Regression (16) in the case where Y is boolean
1 + exp(w0 + ⇤i=1 n wi Xi )
0.2
Regression assumes a0 parametricexp(w + ⇤the

form0 for i=1 w i Xi )
distribution P(Y |X),
P(Y =−5 0|X) = X0 (17) 1
n P(Yi ) = parametric
1|X) =
5
estimates its parameters from1 + theexp(w
training
0 + ⇤data.i=1 wi XThe n
ed by Logistic
ce that equation Logis6c&Regression:&decision&boundary&&
Regression
= 0|X)
P(Y(17) in exp(w
the case0+ where
n
⇤i=1Ywiis
= directly from equation
follows n
)
Xiboolean is:
(16), because the sum
1 + exp(w
(17)of
0 + ⇤ i=1 wi Xi )
two probabilities must equal and 1 + exp(w +

1. 1 0 ⇤i=1 i i w X )
n
hat highly =
P(Yconvenient
ne equation 1|X)
(17) = property
follows directlyof from
this equation (16), because (16)
the exp(w
sum of 0 + ⇤ i=1 wi Xi )
nform for P(Y |X) is that it leads to a
1 + exp(w0 + ⇤i=1 wi Xi ) P(Y = 0|X) = 1 + exp(w + n w X )
oleprobabilities must for
linear expression equal 1.
classification. To classify any given X we generally 0 ⇤i=1 i i
highly
to assign convenient
the valueproperty of this form
yk that maximizes
Notice thatP(Y for=P(Y
equation yk |X).
|X)
(17) isPut
that it leads
another
follows to from
way,
directly awe equation (16), because
nnear •  Prediction:
the expression for Output
exp(w
classification.+ the
⇤ n Y with
To i ) any given
wi Xholds:
i=1classify
label Y= = 0 if the
= following 0 condition
these two probabilities must Xequalwe generally
1.
=0
P(Y 0|X)
highest P(Y|X) n (17)
assign
Form the value
of the yk that
logistic +maximizes
1 function.
exp(w0One +In
⇤ P(Y w
Logistic
highly
i=1 ik)|X).
=i Xyconvenient Put another
Regression, propertyP(Y way,
of we
is form
|X)this as- for P(Y |X) is that
0
–  0For binary P(Y
Y, output = 0|X)
Y=0 holds:
if
w.X+w
euation
ollow Y=
label (17)
this if the
form.
follows following
directly1 from
< condition
simple linear expression
equation (16), because forthe
classification.
sum of To classify any given X w
P(Y = 1|X)
babilities must equal 1. P(Y =to0|X)
want assign the value yk that maximizes P(Y = yk |X). Put anoth
yituting
convenient property (16)
from equations 1<
of this
andassign
form the thislabel
for P(Y Y =is0that
|X) if theit following
leads to a condition holds:
P(Y(17),
= 1|X) becomes
gexpression
the natural for classification. To classify any given X we generally
log of both sides we have n a linear classificationP(Y rule= that
0|X)
ningthe value
from yk that (16)
equations maximizes
and (17),P(Y =becomes
this yk |X). Put another way, we
bel Y = 0 if X satisfies1 < exp(w0 + ⇤ wi Xi ) 1<
P(Y = 1|X)
el Y = 0 if the following condition holds: i=1
n
0 + ⇤ wi Xfrom
n
1< exp(w
P(Y =substituting
0|X) i) equations (16) and (17), this becomes
1< 0 < w 0 +
P(Y = 1|X)i=1
⇥i=1wi Xi (18)
n
1 < exp(w0 + ⇤ wi Xi )
om A
equations Linear
(16)
s Y = 1 otherwise. and Classifier!
(17), this becomes i=1
tingly, the parametric form nof P(Y |X) used by Logistic Regression is
he form implied1 <by
exp(w 0 + ⇤ wi Xi ) of a Gaussian Naive Bayes classi-
the assumptions
i=1
fore, we can view Logistic Regression as a closely related alternative to
Maximizing&Condi6onal&Log&Likelihood&
0 or 1!
Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!
No local minima
Concave functions easy to optimize
Op6mizing&concave&func6on&–&
Gradient&ascent&&
•  Condi6onal&likelihood&for&Logis6c&Regression&is&concave&!&&
Gradient:
Learning rate, η>0

Update rule:
•  Gradient&ascent&is&simplest&of&op6miza6on&approaches&
P (Y
P = r|X)
(Y = r|X)
P = 1=
=
(Y 1 r|X)PP
P (Y = r|X) = = (Y
(Y
1 ⇤
1 =j|X)
= j|X)
PP(Y(Y==j|X)
j|X) ⌅2
j=1 j=1 j=1
2 j=1
error = eww0 +
P (t
+P i iw wiX Xi i Pt̂ ) =
P t i 1
w h (x )
k k i
jj ⌥
e 0 i i w + w X j 1 P
P ee 1 1P
i
Maximize&Condi6onal&Log&Likelihood:&Gradient&ascent&
0
w0 + +i w i
i i Xi ji
== yy lnln= i j P + (1
(1 i
y y ) ) ln
ln kj P
=1 + eywy00ln
wj ln
+ w i X i P
P w X + (11 +ye + (1 y
jw )+ln
w0)+ln i w
0 w i Xi
w+0P
jj 1 + e + 11 ii wi Xw
++ eei w0 +
0 + i w
i i X
ii i 1 + e 1
i i Xi
1+ +ewe0
+ wiiw
i XiiXi
jj
P (Y = 1|X)
l(w)
l(w)
⇥ exp(w10 + ⇥⇥w1i Xi )
l(w) j j ⌥
j j
1|xj ,j ,w) ⇥
= ⌅l(w) xxjii yy= (Y
PP(Y
j j j j= =j 1|x w)
ji j j j ⇥
w = xxi i yy PP(Y(Y ==1|x 1|x, w), w)
jj⌅ww i jj
⇧⇧ ⇤⇤ + ⌅⌃ ⌅⌃
l(w)
l(w)
P (Y
⇧⇧ = 2|X) ⇥ exp(w 20 ⇤ ⇤w X
2i i ) ⌅⌃ ⌅⌃
= l(w) y j⌥
= ⌅l(w) (w0 +⌅ j jw wiixxjii)) ⌥ lnln
j
j j 11+ + ⌅exp(w
exp(w i 0 0++ ww j j
i xiix)i )
⌥
j j
ww =
w= yy (w (w00++ ww wi xi x
ii) ) ln ln 1 1++ exp(w
exp(w 0 +
0 + wiwxix
i )i )
j w
j⌅w i ⌅ww ii ⌅ww i i
jj ii r 1 i i
⇧ ⌥⌥ ⌃ ⌃
⇧xjj exp(w
x exp(w ++ ⌥ w w x x
jj
) ) ⌥ ⌃
P (Y j= j r|X)
j ii = 1 0 j 0
xi exp(w iiP
yj x
i i(Yi
0 i+
j i = j|X) j
= y xii j j = ⌥⌥ j j i wi xi )
= 11 +y exp(w
+ x
exp(w
i 00+ +j=1 i iww ixix i) )⌥ j
jj 1 + exp(w + i
i i i)
w x
j 0
⇧ j ⌃
⇧ P ⌥⌥ ⌃
ej
w 0 + w
i exp(w⇧
X
iexp(wi 0+ +
jj
wwi xi xi ))⌥ ⌃1
j j j 0 i ji j
= =
y= ln xxii =yywj +P 1x j
+ w y j
exp(w
X
+ +
exp(w
(1 ⌥ ⌥
i
w
0y+)
x j jlni wi xi )
) ⌥ w +
P
j 1 + e 0 1 ii+ exp(w
j
i i 00 + i w
i
i ix i i ) 1 + ej 0 i wi Xi
j j 1 + exp(w0 + i wi xi )
⌅l(w) 1 ⇥
= xji y j
P
1+e (Y
ax
j j
= 1|x , w)
⌅wi
j
⇧ ln p(w) ⇥ ⇤wi2 ⌅⌃
⌅l(w) ⌅ ⌅2
wi xji ) wi xji )
j i
= y (w0 + ln 1 + exp(w0 +
Gradient&Ascent&for&LR&
Gradient ascent algorithm: (learning rate η > 0)
do:&
&For&i=1&to&n:&(iterate&over&features)&
un6l&“change”&<&ε&
Loop over training examples!
p lo t 1⇧⇤1 e ^ ⇥x⌅ fro m ⇥5 to 5 th ick
= xi y ⌥ j
j 1 + exp(w0 + i i i)
w x
In p u t in te rp re ta tio n : In p u t in te rp re ta tio n :
⇥p lot
In p u t in te rp re ta tio n :
1
Large&parameters…&
1
⇥
1
1 Sh ow p lot x ⇥ 5Shto
ow5 x ⇥5
⇤⇥10x
Sh ow p lot
⇤⇥x 1
x ⇥ 5 to 5 ⇤ ⇥2 x
1 1 + e ax 1
Re su lt : Re su lt :
Re su lt :
1.0 1.0
1.0
0.8 0.8
0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
⇥4 ⇥2 2 4 ⇥4 ⇥2 2 4 ⇥4 ⇥2 2 4
a=1 a=5 a=10

•  Maximum&likelihood&solu6on:&prefers&higher&weights&
–  higher&likelihood&of&(properly&classified)&examples&close&to&
decision&boundary&&
–  larger&influence&of&corresponding&features&on&decision&
–  can)cause)overfi0ng!!!)
•  Regulariza6on:&penalize&high&weights&
3
That’s&all&MLE.&&How&about&MAP?&
•  One&common&approach&is&to&define&priors&on&w(
–  Normal&distribu6on,&zero&mean,&iden6ty&covariance&
–  “Pushes”&parameters&towards&zero&
•  Regulariza*on&
–  Helps&avoid&very&large&weights&and&overficng&
•  MAP&es6mate:&
⇥w ⇥w i ⌥ j
j i j 1 + exp(w
i 0+ i i i)
w x
⇧ ⌥⇧j ⌃ ⌃
xji
exp(w0 + i wi xi ) ⌥ j
= y j xji ⌥xj yjj exp(w0 + i i i)
w x
=
j
⇧
MAP&as&Regulariza6on&
1 + exp(w +
j
0
⌃
i x )
w i i i 1 + exp(w0 +
⌥ j
i i i)
w x
⌥j
exp(w0 + i i i)
w x 1
= xji y j
⌥
j 1 + exp(w0 + i wi xji ) 1+e ax
1
ln p(w) ⇥ wi2
1+e ax 2
•  Add&log&p(w)&to&objec6ve:& i
⇥ ln p(w)
ln p(w) ⇥ wi2 = wi
2 ⇥wi
i
–  Quadra6c&penalty:&drives&weights&towards&zero&
⇥ ln p(w)
= wi
–  Adds&a&nega6ve&linear&term&to&the&gradients&
⇥wi
Penalizes high weights, just like we did with SVMs!
3
MLE&vs.&MAP&&
•  Maximum&condi6onal&likelihood&es6mate&
•  Maximum&condi6onal&a&posteriori&es6mate&
Training a single neuron
wreg = [0,−1] w = [0,−1]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [0.4,−0.7] w = [0.4,−0.7]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [0.6,−0.4] w = [0.6,−0.4]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [0.8,−0.2] w = [0.8,−0.3]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [0.9,−0.1] w = [0.9,−0.1]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [0.9,0] w = [1,0]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1,0.1] w = [1.1,0]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1,0.1] w = [1.1,0.1]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1.1,0.2] w = [1.2,0.2]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1.1,0.3] w = [1.2,0.2]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1.2,0.5] w = [1.4,0.5]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1.1,0.8] w = [1.6,0.9]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1,1.1] w = [1.9,1.7]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1,1.1] w = [2.2,2.7]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
wreg = [1,1.1] w = [2.5,4]

5 5
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Logis6c&regression&for&discrete&
classifica6on&
Logis6c&regression&in&more&general&case,&where&&
⇥2
set&of&possible&Y)is
⇤ ⇤ {y ,…,y
1 ⇤R }& ⇥2
error = ⇤(ti t̂i )22 = ⇤ ti ⇤ wk hk (xi ) ⇥2
•  ⇤Define&a&weight&vector&w
error = (ti t̂i ) = ⇤ ti ⇤ wki&for&each&y
hk (xi ) i, i=1,…,R-1&
i 2 i k
error = i (ti t̂i ) = i ti k wk hk (xi )
⇤
i
P (Y = 1|X) ⇥ exp(w10 + i
w Xi ) ⇤k
P (Y = 1|X) ⇥ exp(w10 + ⇤ w1i
1i Xi ) P(Y=y1|X)
i
P (Y = 1|X) ⇥ exp(w10 + i w1i Xi ) biggest
⇤
P (Y = 2|X) ⇥ exp(w20 + ⇤
i w X )
P (Y = 2|X) ⇥ exp(w20 + ⇤ w2i i
2i i )
X
i
P (Y = 2|X) ⇥ exp(w20 + i w2i Xi )
… i
r 1 P(Y=y3|X) P(Y=y2|X)
⇤ biggest
P (Y = r|X) = 1 P (Y = j|X) biggest
j=1
Logis6c&regression&for&discrete&
classifica6on)
•  Logis6c&regression&in&more&general&case,&where&&
Y)is&in&the&set){y1,…,yR}&
&for&k<R)
&for&k=R)(normaliza6on,&so&no&weights&for&this&class)&
Features(can(be(discrete(or(con&nuous!(
where the negative sign ensures that information is positive or zero. Note that low
∆ → 0. The
probability first term
events on the right-hand
x correspond to highside of (1.102)content.
information will approach the integral
The choice of
of basis
thelnlogarithm
p(x)
for p(x) in thisis limit so thatand for the moment we shall adopt the convention
arbitrary,
prevalent in information ! theory of using logarithms
# to the base of 2. In this case, as
" Information$theory
we shall see shortly,
lim the unitsp(xof h(x) are bits=(‘binary
i )∆ ln p(xi ) − p(x) digits’).
ln p(x) dx (1.103)
Now suppose that a isender wishes to transmit the value of a random variable to
∆ → 0
a receiver. The average amount of information that they transmit in the process is
where the
Entropy:
obtained by quantity
taking the onexpectation
the right-hand side iswith
of (1.92) called the differential
respect We see
entropy. p(x)
to the distribution and
isthat discrete and continuous forms of the entropy differ by a quantity ln ∆, which
theby
given
diverges in the limit ∆ → 0. This! reflects the fact that to specify a continuous
H[x] =
variable very precisely requires − number
a large p(x) log p(x).For a density defined (1.93)
of2 bits. over
multiple continuous variables, denotedxcollectively by the vector x, the differential
This important
entropy is givenquantity
by is called the $entropy of the random variable x. Note that
limp→0 p ln p = 0 and so we shall take p(x) ln p(x) = 0 whenever we encounter
H[x] = − p(x) ln p(x) dx. (1.104)
a
value for x such that p(x) = 0.
SoInfar
thewe have
case of given a rather
discrete heuristicwe
distributions, motivation
saw that for
the the definition
maximum of informa-
entropy con-
figuration corresponded to an equal distribution of probabilities across the possible
statesisof
H[x] theexpected
the variable. number
Let us now
of consider the maximum
bits needed entropy
to encode configuration
a randomly drawnfor
a continuous variable. In order for this maximum to be well defined, it will be nec-
value of X (under most efficient code)
essary to constrain the first and second moments of p(x) as well as preserving the
normalization constraint. We therefore maximize the differential entropy with the
Ludwig Boltzmann dynamics, which states that the entropy of a closed

1844–1906 system tends to increase with time. By contrast, at
the microscopic level the classical Newtonian equa-
Ludwig Eduard Boltzmann was an tions of physics are reversible, and so they found it
Austrian physicist who created the difficult to see how the latter could explain the for-
Entropy&
Information theory
Entropy&H(Y)&of&a&random&variable&Y
High,&Low&Entropy&
•  “High&Entropy”&& Entropy&of&a&coin&flip&
–  Y&is&from&a&uniform&like&distribuPon&
More uncertainty, more entropy!
–  Flat&histogram&
–  Values&sampled&from&it&are&less&predictable&
Entropy&
Information Theory interpretation:
H(Y) is the expected number of bits
•  “Low&Entropy”&& needed to encode a randomly
–  Y&is&from&a&varied&(peaks&and&valleys)&
drawn value of Y (under most
distribuPon& efficient code)
–  Histogram&has&many&lows&and&highs&
Probability&of&heads&
–  Values&sampled&from&it&are&more&predictable&
(Slide from Vibhav Gogate)

x, y) and H[x] is the differential en-
he information needed to describe xH[x, y] = H[y|x] + H[x] (1.112)
needed to describe x alone
where H[x, plus
y] is the the entropy of p(x, y) and H[x] is the differential en-
differential
ven x. Information
tropy of the marginal distribution p(x). Thustheory
the information needed to describe x
and y is given by the sum of the information needed to describe x alone plus the
ual information
additional information required to specify y given x.
number of concepts
Relative 1.6.1
entropy:fromRelative
information
entropy and mutual information
We now start to relateSo far inthese ideaswetohave introduced a number of concepts from information
this section,
distribution p(x),
theory, and suppose
including the keythat
notion of entropy. We now start to relate these ideas to
We have modelpattern some unknown
recognition. distribution
Consider some p(x) using
unknown an approximating
distribution and distribution
suppose that q(x).
distribution q(x). If we use q(x) to p(x),
we have modelled this using an approximating distribution q(x). If we use q(x) to
ransmitting values of x to a receiver,
construct a coding scheme for the purpose of transmitting values of x to a receiver,
ationIf we
(in use
nats)q(x) to construct
required
then
a coding
to specify
the average
scheme for transmitting values of x,
theamount
additional of information (in nats) required to specify the
dingthen
scheme) as a result
the average
value of usingwe
of xadditional
(assuming amount
chooseof
q(x) aninformation
efficient coding scheme) as a result of using q(x)
instead of the true distribution p(x) is given by
! " ! #
" ! #
KL(p∥q) = − p(x) ln q(x) dx − − p(x) ln p(x) dx
− − p(x) ln p(x) dx ! $ %
% q(x)
= − p(x) ln dx. (1.113)
p(x)
dx. (1.113)
This is known as the relative entropy or Kullback-Leibler divergence, or KL diver-
gence (Kullback and Leibler, 1951), between the distributions p(x) and q(x). Note
ack-Leibler divergence,
KL is a divergence,
that it is not aor KL diver-
non-negative
symmetrical and p(x)that
quantity, = q(x) if and
is to say only ̸≡
KL(p∥q) if KL(p||q)
KL(q∥p). = 0.
We now
he distributions p(x) andshow the Kullback-Leibler divergence satisfies KL(p∥q) ! 0 with
that Note
q(x).
if, and only if, p(x) = q(x). To do this we first introduce the concept of
ay KL(p∥q) ̸≡equality
KL(q∥p).
convex functions. A function f (x) is said to be convex if it has the property that
vergence satisfies chord lies on!or0above
every KL(p∥q) withthe function, as shown in Figure 1.31. Any value of x
x, y) and H[x] is the differential en-
he information needed to describe xH[x, y] = H[y|x] + H[x] (1.112)
needed to describe x alone
where H[x, plus
y] is the the entropy of p(x, y) and H[x] is the differential en-
differential
ven x. Information
tropy of the marginal distribution p(x). Thustheory
the information needed to describe x
and y is given by the sum of the information needed to describe x alone plus the
ual information
additional information required to specify y given x.
number of concepts
Relative 1.6.1
entropy:fromRelative
information
entropy and mutual information
We now start to relateSo far inthese ideaswetohave introduced a number of concepts from information
this section,
distribution p(x),
theory, and suppose
including the keythat
notion of entropy. We now start to relate these ideas to
We have modelpattern some unknown
recognition. distribution
Consider some p(x) using
unknown an approximating
distribution and distribution
suppose that q(x).
distribution q(x). If we use q(x) to p(x),
we have modelled this using an approximating distribution q(x). If we use q(x) to
ransmitting values of x to a receiver,
construct a coding scheme for the purpose of transmitting values of x to a receiver,
ationIf we
(in use
nats)q(x) to construct
required
then
a coding
to specify
the average
scheme for transmitting values of x,
theamount
additional of information (in nats) required to specify the
dingthen
scheme) as a result
the average
value of usingwe
of xadditional
(assuming amount
chooseof
q(x) aninformation
efficient coding scheme) as a result of using q(x)
instead of the true distribution p(x) is given by
! " ! #
" ! #
KL(p∥q) = − p(x) ln q(x) dx − − p(x) ln p(x) dx
− − p(x) ln p(x) dx ! $ %
% q(x)
= − p(x) ln dx. (1.113)
p(x)
dx. (1.113)
This is known as the relative entropy or Kullback-Leibler divergence, or KL diver-
gence (Kullback and Leibler, 1951), between the distributions p(x) and q(x). Note
ack-Leibler divergence,
KL is a divergence,
that it is not aor KL diver-
non-negative
symmetrical and p(x)that
quantity, = q(x) if and
is to say only ̸≡
KL(p∥q) if KL(p||q)
KL(q∥p). = 0.
We now
he distributions p(x) andshow the Kullback-Leibler divergence satisfies KL(p∥q) ! 0 with
that Note
q(x).
if, and only if, p(x) = q(x). To do this we first introduce the concept of
ay KL(p∥q) ̸≡equality
KL(q∥p).
convex functions. A function f (x) is said to be convex if it has the property that
vergence satisfies chord lies on!or0above
every KL(p∥q) withthe function, as shown in Figure 1.31. Any value of x
necessarily have a less efficient coding, and on average the additional information
that must be transmitted is (at least) equal to the Kullback-Leibler divergence be-
tween the two distributions.
Information
Suppose that data is being generated theorydistribution p(x) that we
from an unknown
wish to model. We can try to approximate this distribution using some parametric
distribution q(x|θ), governed by a set of adjustable parameters θ, for example a
Suppose that
multivariate the data
Gaussian. Oneis way
beingto generated
determine θfrom
is toan unknown
minimize thedistribution p(x) that we
Kullback-Leibler
divergence between
want to model withp(x) and q(x|θ) with respect to θ. We cannot do this directly
q(x|θ)
because we don’t know p(x). Suppose, however, that we have observed a finite set
of training points xn , for n = 1, . . . , N , drawn from p(x). Then the expectation
Onerespect
with way totodetermine
p(x) can beθ isapproximated
to minimise the
by aKL divergence
finite sum overbetween p(x) and
these points, q(x|θ)
using
(1.35), so that
"N
KL(p∥q) ≃ {− ln q(xn |θ) + ln p(xn )} . (1.119)
n=1
The second term on the right-hand side=

f (x) of argmax
(1.119) is independent of θ, and the first
The is
term second term islog
the negative constant, thus
likelihood maximising
function the likelihood
forkθ under is equivalent
the distribution q(x|θ) to minimising
eval-
uated
the KLusing the training
divergence set. our
between Thus we see
model andthat
theminimizing this Kullback-Leibler
data distribution
divergence is equivalent to maximizing the likelihood Nk function.
f (x) = between
Now consider the joint distribution argmaxtwo sets of variables x and y given
Cross entropy: k N
by p(x, y). If the sets of variables are independent, then their joint distribution will
factorize into the product of their marginals p(x, y) = p(x)p(y). If the variables are
not independent, H(p,
we canq)gain
= some
Ep [ idea
logof
q]whether
= H(p) + are
they DKL (pkq),
‘close’ to being indepen-
dent by considering the Kullback-Leibler divergence between the joint distribution
and the product of the marginals, given by

Clase 1

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clase 1

Transféré par

Droits d'auteur :

Formats disponibles

Indroducción al aprendizaje profundo

Pablo Sprechmann y Mauricio Delbracio

Instituto de ingenierı́a eléctrica

Pablo Sprechmann Indroduccion al aprendizaje profundo

• Introducción y conceptos básicos

• Redes neuronales feed-forward, aprendizaje supervisado

• Aprendizaje de diccionarios y aprendizaje profundo. Aprendizaje no

• Modelos generativos: VAE y GAN’s. Aprendizaje auto-supervisado

• Redes neuronales recurrentes

Pablo Sprechmann Indroduccion al aprendizaje profundo

• Introducción y conceptos básicos

• Redes neuronales feed-forward, aprendizaje supervisado

• Aprendizaje de diccionarios y aprendizaje profundo. Aprendizaje no

• Modelos generativos: VAE y GAN’s. Aprendizaje auto-supervisado

• Redes neuronales recurrentes

Pablo Sprechmann Indroduccion al aprendizaje profundo

Pablo Sprechmann Indroduccion al aprendizaje profundo

Pablo Sprechmann Indroduccion al aprendizaje profundo

Tom Mitchel (Machine Learning):

Pablo Sprechmann Indroduccion al aprendizaje profundo

• Creada por Rosenblatt, 1957, en la universidad de Cornell.

Pablo Sprechmann Indroduccion al aprendizaje profundo

M&P demostraron que:

• Un perceptron (sencillo) no es capaz de

• Es posible escribir un algoritmo simple

Minsky and Papert,

Ejemplo tomado de charla de L. Bottou, ICML 2015.

Ejemplo tomado de charla de L. Bottou, ICML 2015.

Ejemplo tomado de charla de L. Bottou, ICML 2015.

• La conectividad tiene una especificación

Ejemplo tomado de charla de L. Bottou, ICML 2015.

• La conectividad tiene una especificación

• “Ratonividad” y “quesividad” no tienen tal

Ejemplo tomado de charla de L. Bottou, ICML 2015.

• La conectividad tiene una especificación

• “Ratonividad” y “quesividad” no tienen tal

• Es necesario definir reglas heurı́sticas o utilizar

Ejemplo tomado de charla de L. Bottou, ICML 2015.

• La conectividad tiene una especificación

• “Ratonividad” y “quesividad” no tienen tal

• Es necesario definir reglas heurı́sticas o utilizar

Big data y poder computacional:

Cuanto mayor es la cantidad de datos, definir reglas heruisticas efectivas resulta

Ejemplo tomado de charla de L. Bottou, ICML 2015.

Pablo Sprechmann Indroduccion al aprendizaje profundo

Detección de correos electrónicos de SPAM:

Pablo Sprechmann Indroduccion al aprendizaje profundo

• En lugar de escribir un programa especı́fico para cada aplicación.

Pablo Sprechmann Indroduccion al aprendizaje profundo

• En lugar de escribir un programa especı́fico para cada aplicación.

Pablo Sprechmann Indroduccion al aprendizaje profundo

Aprendizaje por refuerzos

Pablo Sprechmann Indroduccion al aprendizaje profundo

Aprendizaje por refuerzos

Pablo Sprechmann Indroduccion al aprendizaje profundo

Aprendizaje por refuerzos

Pablo Sprechmann Indroduccion al aprendizaje profundo

Aprendizaje por refuerzos

Pablo Sprechmann Indroduccion al aprendizaje profundo

Observamos un conjunto de pares {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} donde x 2 Rd

Pablo Sprechmann Indroduccion al aprendizaje profundo

Observamos un conjunto de pares {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} donde x 2 Rd

Regresión: La salida deseada es un número real y 2 R o un vector de números