Académique Documents
Professionnel Documents
Culture Documents
Evaluación:
• Dos prácticos basados en el curso CS231N de Stanford
• Tarea:
• Tercer práctico usando redes recurrentes
• Proyecto propio utilizando alguna de las técnicas aprendidas
Wikipedia:
En ciencias de la computación el aprendizaje de máquinas es una rama de la
inteligencia artificial cuyo objetivo es desarrollar técnicas que permitan a las
computadoras aprender.
Wikipedia:
En ciencias de la computación el aprendizaje de máquinas es una rama de la
inteligencia artificial cuyo objetivo es desarrollar técnicas que permitan a las
computadoras aprender.
Rosenblatt, 1958
Dada una forma, está compuesta por una única componente conexa?
Reconocimiento en imágenes:
• Gran variabilidad de intra-clase, deformaciones, iluminacion, etc.
• Resulta imposible escribir reglas que describan en forma precisa un objecto
Programas obtenidos:
• El programa obtenido es muy distinto de un programa escrito a mano. Por
ejemplo, puede contener de un enorme cantidad de numeros.
• El programa debe ser capaz de producir buenos resultados con datos
diferentes a los que fue entrenado.
• Si los datos de interes cambian, el programa puede ser adaptado
entrenandolo con datos nuevos representativos de dicho cambio.
Aprendizaje supervisado
Aprendizaje no supervisado
Aprendizaje supervisado
• Predecir una salida dada una entrada.
Aprendizaje no supervisado
Aprendizaje supervisado
• Predecir una salida dada una entrada.
Aprendizaje no supervisado
Aprendizaje supervisado
• Predecir una salida dada una entrada.
Aprendizaje no supervisado
• Encontrar una buena representación de los datos.
Proceso de aprendizaje:
• Model class F, de forma que f (x) ⇡ y
• Una medida discrepancia, L(f (x), y), entre predicciones f (x) y objetivos y.
• Método para elegir f 2 F de forma de minimizar:
n
X
L(f (xi ), yi )
i=1
Proceso de aprendizaje:
• Model class F, de forma que f (x) ⇡ y
• Una medida discrepancia, L(f (x), y), entre predicciones f (x) y objetivos y.
• Método para elegir f 2 F de forma de minimizar:
n
X
L(f (xi ), yi )
i=1
Problema de interpolación:
• Dado x utilizar los datos de entrenamiento cercanos {(xi , yi )}i para predecir
f (x).
SIFT K-means
Pooling Classifier
HoG Sparse Coding
unsupervised supervised
fixed
Low-level Mid-level
Features Features
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
• Imágenes:
• Pixel ! bordes ! texto ! partes ! objetos
• Imágenes:
• Pixel ! bordes ! texto ! partes ! objetos
• Texto:
• Caracteres ! palabras ! grupos de palabras ! frases ! historias
• Imágenes:
• Pixel ! bordes ! texto ! partes ! objetos
• Texto:
• Caracteres ! palabras ! grupos de palabras ! frases ! historias
• Audio:
• Muestras ! banda espectral ! sonido . . . ! fonemas ! palabras
Principales razones:
Principales razones:
Principales razones:
Principales razones:
Principales razones:
(imagen de R. Pieters)
• Técnicas usando deep learning tienen el estado del arte en casi todos los
problemas de visión.
[Vinyals et al ’14, Karpathy et al ’14, Donahue et al14, Kiros et al ’14, MSR ’14]
Gatys et al 2015
Gatys et al 2015
These slides are adapted (or most often directly taken) from presentations by
David Sontag and Andrew Zisserman.
Classification
rate decision rule, which assigns each value of x to the class having the
! !
es are rejected, whereas We if
are there
free are
to K
choose
C ) foreach athe decision rule that we
x, then assigns
shouldeach that x to
point to class
one ofC1 . t
(1.78) point x to one of the two
to choose
probability the
p(C decision
|x). rule that p(x,assigns given value of assign
p(x, C2 ) dx k + p(x, C1 ) dx. 2
xatrly no
toR one
to1
examples
Such of
minimizeatherule are
two rejected.
divides
classes.
p(mistake)
R 2
Thus
theClearly
space
we the
product
should into
to disjoint
rule
minimize
arrange regions,
of probability
that each
p(mistake) wedecided
xhave
weassigned
is should
p(x, k )arrange
byCdecisionto= p(Cboundaries
kthat each x
|x)p(x). Because
is assi
olled
ach
ass x
has by the smaller
is assigned
the valuetoofwhichever
θ.
value of the p(x) has
class is common
integrand the in to bothvalue
smaller
(1.78). terms,ofwethe
Thus, if canintegrand
p(x, restate thisinresult
C ) > as saying
(1.78). Thus, that
if the
p(x
n sionto rule that assigns
minimize the each point
expected loss,x towhenone of the two 1
aThus,
making given
mistake) Ififap(x, mistake
value
we C1of
should ) >x, isp(x,
obtained
then
arrange Cthat
we
2 ) for ifashould
should
we
each xgiven
each
is assignvalue
value
assign
assigned ofofxx
that
to x,is then
to
to assigned
class
class weCshould
1 .toFromtheassign the that x to class C1 . Fr
class
loss
class incurred. From when the aproduct
reject decisionrule of is
probability ) =factor
posterior
of
r value of theC
probability
1 probability we
integrand in p(Chave k |x)
p(x,
(1.78). C
Thus,
k ) =
is largest. > weresult
This
p(CCk1 )|x)p(x).
if p(x, have is illustrated
Because
p(x, Ckthe p(C fork |x)p(x). Because the
d,Because
mon then toweboth
a single the
should
inputfactor
terms,assign we
p(x)
variable can
that isxx,common
restate
toin class
FigureC1to
this
We the
both
result
. From
1.24.
assign
terms,
theas sayingwe can that restate
the
to the class that maximises the
this result as saying that the mi
minimum
g that
have the
p(x, minimum
C ) = p(C
ore general case of K classes, posterior
k k |x)p(x). Because factor
it is slightly easier to maximize the
we can restate this result as saying that the minimum probabilities
em eing down correct,into two which is given
separate stages, by the
a to learn f (x)a = model
argmax for p(Ck |x), and the
k K
"
ng account of the loss incurred when a reject decision is
k-NN
and decision
lassification problem down into two separate stages, the
we use training data to learn
f (x)a =
model for p(Ck |x), and the
argmax
k
Nk
f (x) = argmax
k N
Two approximations:
Pablo Sprechmann Indroduccion al aprendizaje profundo
conditioning at a point is relaxed to conditioning on some region “close” to the
target point.
Learning machine
K Nearest Neighbour (K-NN) Classifier
Algorithm
• For each test point, x, to be classified, find the K nearest
samples in the training data
• Classify the point, x, according to the majority vote of their
class labels
e.g. K = 3
• applicable to
multi-class case
K=1
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
• Choosing the values for the parameters that minimize the loss
function on the training data is not necessarily the best policy
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-1.5 -1 -0.5 0 0.5 1 -1.5 -1 -0.5 0 0.5 1 1.5
Advantages:
• K-NN is a simple but effective classification procedure
• Applies to multi-class classification
• Decision surfaces are non-linear
• Quality of predictions automatically improves with more training
data
• Only a single parameter, K; easily tuned by cross-validation
1.2
0.8
0.6
0.4
0.2
-0.2
-1.5 -1 -0.5 0 0.5 1
Summary
Disadvantages:
• What does nearest mean? Need to specify a distance metric.
• Computational cost: must store and search through the entire
training set at test time. Can alleviate this problem by thinning,
and use of efficient data structures like KD trees.
Regression
K-NN Regression
Algorithm
• For each test point, x, find the K nearest samples xi in the
training data and their values yi K
1 X
• Output is mean of their values f (x) = yi
K i=1
• Again, need to choose (learn) K
Regression
K-NN Regression
Algorithm
• For each test point, x, find the K nearest samples xi in the
training data and their values yi K
1 X
• Output is mean of their values f (x) = yi
K i=1
• Again, need to choose (learn) K
Regression example: polynomial curve fitting
• The green curve is the true function (which is from Bishop
not a polynomial)
target value
polynomial
regression
over fitting
Regression example: polynomial curve fitting
• The green curve is the true function (which is from Bishop
not a polynomial)
target value
polynomial
regression
over fitting
Over-fitting
• test data: a different sample from the same true function
Root‐Mean‐Square (RMS) Error:
• If the model has as many degrees of freedom as the data, it can fit the
training data perfectly
Root‐Mean‐Square (RMS) Error:
• If the model has as many degrees of freedom as the data, it can fit the
training data perfectly
“ridge” regression
Polynomial Coefficients
How to prevent over fitting? II
“ridge” regression
Polynomial Coefficients
Polynomial Coefficients
linearly
separable
not
linearly
separable
Linear classifiers
f (x) = w>x + b
f (x) < 0 f (x) > 0
X1
f (x) = w>x + b
f (x i ) = w > x i + b
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
X2 X2
w
w
X1 X1
xi
PN
NB after convergence w = i αi xi
Perceptron4.1. Discriminant Functions 195
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Figure 4.7 Illustration of the convergence of the perceptron learning algorithm, showing data points from two
classes (red and blue) in a two-dimensional feature space (φ1 , φ2 ). The top left plot shows the initial parameter
vector w shown as a black arrow together with the corresponding decision boundary (black line), in which the
arrow points towards the decision region which classified as belonging to the red class. The data point circled
8
6
Perceptron
example 4
-2
-4
-6
-8
-10
-15 -10 -5 0 5 10
wTx + b = 0
b
||w||
Support Vector
Support Vector
X
f (x) = αi yi (xi > x) + b
i
support vectors
SVM – sketch derivation
Support Vector
Support Vector
wTx + b = 1 w
wTx + b = 0
wTx + b = -1
SVM – Optimization
• Or equivalently
³ ´
min ||w||2 >
subject to yi w xi + b ≥ 1 for i = 1 . . . N
w
In general there is a trade off between the margin and the number of
mistakes on the training data
Introduce “slack” variables
ξi 2
> 2
ξi ≥ 0 ||w|| ||w|| Margin =
||w||
Misclassified
point
• for 0 < ξ ≤ 1 point is between ξi 1
margin and correct side of hyper- <
||w|| ||w||
plane. This is a margin violation
• for ξ > 1 point is misclassified
Support Vector
Support Vector
=0
wTx + b = 1 w
wTx + b = 0
wTx + b = -1
“Soft” margin solution
The optimization problem becomes
N
X
2
min ||w|| +C ξi
w∈Rd ,ξ i ∈R+ i
subject to
³ ´
>
yi w xi + b ≥ 1−ξi for i = 1 . . . N
• C is a regularization parameter:
0.6
0.4
0.2
feature y
-0.2
-0.4
-0.6
-0.8
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
feature x
³ ´
>
The constraint yi w xi + b ≥ 1 − ξi, can be written more concisely as
yif (xi) ≥ 1 − ξi
which, together with ξi ≥ 0, is equivalent to
yif (xi)
• SVM uses “hinge” loss max (0, 1 − yi f (xi))
• an approximation to the 0-1 loss
Optimization continued
N
X
min C max (0, 1 − yif (xi)) + ||w||2
w∈Rd i
local global
minimum minimum
If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over a convex set, which it is in our case)
Convex functions
Convex function examples
SVM
N
X
min C max (0, 1 − yif (xi)) + ||w||2 convex
w∈Rd i
Gradient (or steepest) descent algorithm for SVM
To minimize a cost function C(w) use the iterative update
∂L
= −yixi
∂w
∂L
=0
∂w
yif (xi)
Sub-gradient descent algorithm for SVM
N µ ¶
1 X λ
C(w) = ||w||2 + L(xi, yi; w)
N i 2
Then each iteration t involves cycling through the training data with the
updates:
2
10 6
4
1
10
2
energy
0
10 0
-2
-1
10
-4
-2
10 -6
0 50 100 150 200 250 300 -6 -4 -2 0 2 4 6
How do we do multi-class classification?
Multi-class SVM
w+
Simultaneously learn 3 sets
of weights: w-
• How do we guarantee the
correct labels?
wo
• Need new constraints!
rate decision rule, which assigns each value of x to the class having the
! !
es are rejected, whereas We if
are there
free are
to K
choose
C ) foreach athe decision rule that we
x, then assigns
shouldeach that x to
point to class
one ofC1 . t
(1.78) point x to one of the two
to choose
probability the
p(C decision
|x). rule that p(x,assigns given value of assign
p(x, C2 ) dx k + p(x, C1 ) dx. 2
xatrly no
toR one
to1
examples
Such of
minimizeatherule are
two rejected.
divides
classes.
p(mistake)
R 2
Thus
theClearly
space
we the
product
should into
to disjoint
rule
minimize
arrange regions,
of probability
that each
p(mistake) wedecided
xhave
weassigned
is should
p(x, k )arrange
byCdecisionto= p(Cboundaries
kthat each x
|x)p(x). Because
is assi
olled
ach
ass x
has by the smaller
is assigned
the valuetoofwhichever
θ.
value of the p(x) has
class is common
integrand the in to bothvalue
smaller
(1.78). terms,ofwethe
Thus, if canintegrand
p(x, restate thisinresult
C ) > as saying
(1.78). Thus, that
if the
p(x
n sionto rule that assigns
minimize the each point
expected loss,x towhenone of the two 1
aThus,
making given
mistake) Ififap(x, mistake
value
we C1of
should ) >x, isp(x,
obtained
then
arrange Cthat
we
2 ) for ifashould
should
we
each xgiven
each
is assignvalue
value
assign
assigned ofofxx
that
to x,is then
to
to assigned
class
class weCshould
1 .toFromtheassign the that x to class C1 . Fr
class
loss
class incurred. From when the aproduct
reject decisionrule of is
probability ) =factor
posterior
of
r value of theC
probability
1 probability we
integrand in p(Chave k |x)
p(x,
(1.78). C
Thus,
k ) =
is largest. > weresult
This
p(CCk1 )|x)p(x).
if p(x, have is illustrated
Because
p(x, Ckthe p(C fork |x)p(x). Because the
d,Because
mon then toweboth
a single the
should
inputfactor
terms,assign we
p(x)
variable can
that isxx,common
restate
toin class
FigureC1to
this
We the
both
result
. From
1.24.
assign
terms,
theas sayingwe can that restate
the
to the class that maximises the
this result as saying that the mi
minimum
g that
have the
p(x, minimum
C ) = p(C
ore general case of K classes, posterior
k k |x)p(x). Because factor
it is slightly easier to maximize the
we can restate this result as saying that the minimum probabilities
em eing down correct,into two which is given
separate stages, by the
a to learn f (x)a = model
argmax for p(Ck |x), and the
k K
"
rule will divide the input space into regions Rk called decision regions, one for
p(x, bet
class, such that all points in Rk are assigned to class Ck . The boundaries C2
decision regions are called decision boundaries or decision surfaces. Note that
C2 ) Decision
decision region need theory
not be contiguous but could comprise some number of dis
regions. We shall encounter examples of decision boundaries and decision regio
later chapters. In order to find the optimal decision rule, consider first of all the
of two classes, as in the cancer problem for instance. A mistake occurs when an
vector belonging to class C1 is assigned to class C2 or vice versa. The probabili
Minimising misclassificationthis occurring is given by
rate:
p(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )
! !
= p(x, C2 ) dx + p(x, C1 ) dx. (
R1 R2
x
We1.are
40 R1 rule that assigns each point x to one of the
free to choose the decision
INTRODUCTION
classes. Clearly to minimize p(mistake) we should arrange that each x is assign
R2Errors blue, red and green regions
Figure 1.24 Schematicwhichever class has the smaller value of the integrand
illustration of the joint probabilities
x0 !
x
in (1.78). Thus, if p(x, C
p(x, C2 ) for a given value of x, then we should assign that k ) for
x toCclass
p(x, ea
C1 . From
each of two classes against together
plottedx, product with
rule of the decision
probability we C1 ) p(x,boundary
p(x,have x=x
Ck ) = p(Ck |x)p(x). . Values
bBecause the f
es of x ! x
b are class
classified
classified as Cas2 and
p(x) hence
is commonbelong to decision
to both terms, we can restateregion
this result R
as 2 , whereas
saying that the minip
p(x, C )
s points x < x b are asclassified
C1 and belong to R1 . Errors arise from the blue, green, an
2
andOptimal
red regions, so
rule red that
xarea
< x for errors are due to points from class C2 being misclass
the
bdisappears
ssified as C1 (represented
the sum by of the red and green regions), and conversely for poin
!x
oints in the region xerrors b theare due to points from class C1 being misclassified as C
s C2 (represented by the blue
region). As we vary the location x b of the decision boundary, x
t
y, the combined areas of the
of the red region varies. The Figure 1.24 Schematic illustration of the joint probabilities p(x, C ) for eachthe
blue and green regions remains constant, whereas size of
R1 R2
optimal
, C2 ) cross, corresponding to choice for b
x is where the curves for p(x,
against x, together with the decision boundary x = x
Ck
1 ) and
of two classes pl
b. Values of x ! x
p(x, C
b are classifie
1.5.4
We haveInference and decision
broken the problem into two stages: inference and decision
We have broken the classification problem down into two separate stages, th
ference in which
stagestage:
Inference we usemodel
we estimate training data totolearn
parameters a posteriors
predict model for p(Ck |x), and th
Alternative approach:
Define a function that maps inputs to decisions, and solves directly the class
assignments (like SVM)
yclass
the assignments.
more ambiguous An (a) in solve
First
cases.
alternative Bayes’the inference
We
possibility theorem
can
would achieve can
problem
be to this
solve beby
of
both found intheterms
determining
problems of the q
class-conditional
her and simplythose
rejecting learn ainputs
functionp(x|C
x k ) which
thatfor
mapsfor eachx
inputs class
directly
the individually.
Ckinto
largest decisions.
of the Such Also separately infer the pr
numerator, because
probabilities p(Ck ). Then use Bayes’ theorem!
ction is called a discriminant function. in the form
)n fact,
is less than
we can or equal
identify three to θ. approaches
Decision
distinct This is illustrated
theory: for the
threedecision
to solving approaches
p(x) =
problems, p(x|Ck )p(Ck ).
gle
whichcontinuous
have been used input variable
in practical x, in Figure
applications. These are 1.26.
given,Note p(x|Ck )p(Ck )
in decreasing
of complexity, by: p(Ck |x) = k
that all examples are rejected, whereas if there are K p(x)
will
irst ensure
solve that no
the inference problem Equivalently,
examples thewe
are rejected.
of determining canThus model
class-conditional the joint distribution p
thedensities
) for each class to find theAlso
individually. posterior classinfer
separately probabilities
the prior p(Ck |x). As usual, the den
class
ejected
p(x|C k is controlled
Generative models:by
C k normalize
the
estimate
in
value
Bayes’ to
theorem obtain
classofconditionals
θ. can be the
densities
found posterior
in
and prior probabilities.
terms of
probabilities
the quantities H
appearin
probabilities p(Ck ). Then use Bayes’ theorem in the form
reject criterion to minimize thebecause
probabilities,
numerator, expectedwe use loss,decision
when theory to determine c
!
account of the loss incurred p(Cnew
whenk a reject
p(x|C )p(C k )
|x) =input x. Approaches
decision
p(x) = that is p(x|C
explicitly
(1.82)k )p(Ck ). or implicitly
k
p(x)
inputs as well as outputs are k known as generative mo
from them
to find the posterior class probabilities
Equivalently, it
p(Ckwe
|x).isAspossible
can usual, the
model to generate
thedenominator
joint distribution p(x, Ck ) directly
synthetic data po
d decision
in Bayes’ theorem can
Discriminative be found
models: in aterms
normalize
find toofobtain
model thedirectly
to quantities appearing
the posterior
predict in the probabilities
probabilities.
the posterior Having found the
numerator, because probabilities, we inference
use decision theory to determine class membership
ification problem(b) downFirst
p(x)new
!solve
into
= input
two
p(x|C
the
separate
)p(C ).
stages, problem
the of
(1.83)
determining the p
x. kApproaches
k that explicitly or implicitly model the distri
se training data to learnp(C a model
k |x), for
k and then
p(C k |x), and the
subsequently use models,
decision theory
inputs as well as outputs are known as generative because by
Equivalently, we can model from onejoint
the ofdistribution
them the
it isclasses.
p(x, CtokApproaches
possible )generate
directly and thenthat
synthetic datamodel
points inthe
the post
input
Directtoapproaches:
normalize find a function
obtain the posterior that maps
probabilities. Having inputs
found to
thedecisions
posterior
probabilities, we use decision
are
(b) Firsttheory
called
solvetothe
discriminative
inference
determine problem
class membership
models.
of determining
for each the posterior class pro
new input x. Approaches thatp(C k |x), and
explicitly then subsequently
or implicitly use decision
model the distribution of theory to assign each n
(c)areFind
inputs as well as outputs known
one of aasthe
function
generative (x),because
fApproaches
classes.models, called amodel
discriminant
by sampling
that the posterior function,
probabilities
directly
from them it is possible to generate
are onto
synthetic
called data a class
points
discriminative in thelabel.
models.input space.For instance, in the ca
fof(·)
irst solve the inference problem
(c) Find mightfbe
adetermining
function the binary
(x), called avalued
posterior class andfunction,
probabilities
discriminant such that f=
which 0 rep
maps each
p(Ck |x), and then subsequently use decision theory to assign each new x to
X-ray images probabilities
for which therep(Ck ).isThen
littleuse as to
Bayes’
doubt findcorrect
totheorem
the the posterior
in the formwhile
class, class
leav- pro
human expert to classify the more ambiguous in Bayes’
cases. We cantheoremachieve canthisbebyfou
p(x|Ck )p(Cbecause
numerator, k)
ducing a threshold θ and rejecting those
Decision |x) = x for which the largest of the
p(Ckinputs
theory p(x)
rior probabilities p(Ck |x) is less than or equal to θ. This is illustrated for the p(x)
of two classes,
to and
find athe
single continuous
posterior input variable
class probabilities p(Cx, in Figure
|x). As 1.26.theNote
usual, denom
k
setting θ = 1 in
Generativewill ensure
Bayes’
models: that allcan
theorem examples
be found areinrejected,
terms ofwhereas if thereappearing
the quantities are K
es then setting θ < 1/Kbecause
numerator, will ensure that no examples Equivalently, we canThus
are rejected. model
the t
! normalize
on of examplesthat
Approaches thatexplicitly
get rejected is controlled
or implicitly model theby the
p(x) = distribution
p(x|Cvalue oftoθ.
of inputs
)p(C ).obtain
as well astheoutputs
poste
k k
We can easily extend the reject criterion to minimize
Estimating densities in high dimensions difficult k
probabilities,
the we
expected use
loss,decision
when
s matrix is given, taking account of the loss incurred new inputwhenx. a reject decisionthat
Approaches is
Equivalently, we can model the joint distribution
e. Full model of the problem: evaluate uncertainty inputs as well as outputs are kn p(x, C k ) directly an
normalize to obtain the posterior probabilities. Having found the po
probabilities,
1.5.4Discriminative
Inference and we use
decisiondecision theory from
to them
determine it is possible
class to
membership gen fo
models:
new input x. Approaches that explicitly or implicitly model the distribu
We have broken inputstheasclassification
well as problem
outputs
Model the posterior probabilities directlyare (b)
known First
down as solve
into twothe
generative inference
separate
models, becauseproblem
stages, the
by sa
ence stage in from
which we ituse
them training to
is possible data to learn
generate p(Ca kmodel
|x),data
synthetic forpoints
and then
p(C and the sp
insubsequent
k |x), the input
Require significantly fewer parameters one of the classes. Approache
(b) First solve the inference problem of determining the posterior class proba
are called discriminative mode
p(C |x), and then subsequently use decision theory to assign each ne
k
Direct approaches:
one of the classes. Approaches that model the posterior probabilities d
are called discriminative
(c)
models.
Find a function f (x), called a d
Simplest approach can be very effective
directly onto a class label. Fo
(c) Find a function f (x), called a discriminant function, which maps each i
Cj (where j may or may not be equal to k). In so doing, we incur some
ss that we denote by Lkj , which we can view as the k, j element of a loss
r instance, in our cancer example, we might have a loss matrix of the form
Figure 1.25. This particular loss matrix says
Advantages ofthat there is no lossprobabilities
estimating incurred
ect decision is made, there is a loss of 1 if a healthy patient is diagnosed as
cer,1.whereas
INTRODUCTION
there is a loss of 1000 if a patient having cancer is diagnosed
ptimalMinimising
Figure solution is the
risk:one
1.26 Illustration of which minimizes
the reject option. the loss function.
Inputs p(C1However,
|x) p(C2 |x)
nction dependsx such
on thethattrue
the class,
larger which
of the two poste- 1.0For a given input
is unknown.
our uncertaintyrior
in probabilities
the true class is less than or equal
is expressed to theθ joint probability
through
Cases in which
some making
threshold one
θ will be type of mistake has more “cost” than
rejected. 1.5.other
Decision Theory
n p(x, Ck ) and so we seek instead to minimize the average loss, where the
computed with respect to this distribution, which is given by
Figure 1.25 An example $ of a loss matrix with ele- cancer normal "
# #
ments Lkj for the cancer treatment problem. The rows !
E[L]to=the true class, whereas
correspond ) dx. cor-
Ckcolumns
Lkj p(x,the (1.80)
cancer 0 1000
respond to the assignment
k j R
ofj class made by our deci- normal 1 0
sion criterion.
n be assigned independently to one of the decision regions Rj . Our goal
se the regions Rj in order to minimize% the expected loss (1.80), which
Matrix L measures relative
at for each x we should minimize 1.5.2“cost”
L
of mistake
Minimizing
p(x, C ). As0.0 types
the expected
before, we can useloss x
k kj k
reject region
t rule p(x, Ck ) = p(Ck |x)p(x)For to many
eliminate the common factor of p(x).
applications, our objective will be more complex than simply mi
ecision rule
The that minimizes
optimal decision the
is expected
mizing tothe loss
number
assign is misclassifications.
of
points thetoonethethat assigns
class Leteach
us consider again the medical diagno
maximising:
new x to the problem. Wewhich
class j for note that, if a patient who does not have cancer is incorrectly diagnos
the quantity
as having cancer, the consequences
! may be some patient distress plus the need
further investigations. Conversely,
Lkj p(Ck if |x)a patient with cancer is diagnosed as healt
(1.81)
the result may be premature k death due to lack of treatment. Thus the consequen
of these two types of mistake can be dramatically different. It would clearly be bet
is a minimum. This is
to make clearly
fewer trivialoftothe
mistakes do,second
once kind,
we knoweventhe posterior
if this class
was at the proba-of mak
expense
bilities p(Ck |x).
more mistakes of the first kind.
Advantages of estimating probabilities
Rejection option:
TRODUCTION
0.0 x
reject region
Predicting all samples from the dominant class gives high accuracy
We can train on a balanced set and the rescale using correct priors
Model combination:
For complex applications, break the problem into smaller subproblems each of
which can be tackled by a separate module.
If X represents all our inputs and Y all our observed
al maximum likelihood estimator is
CHAPTERMaximum likelihood
5. MACHINE estimation
LEARNING BASICS
θML = arg max P (Y | X ; θ). (5.62)
θ CHAPTER 5. MACHINE LEARNING BASICS
5.5.1 Conditional Log-Likelihood and Mean Squared Error
ed toThe
bemost
i.i.d., then this
common can
loss for be decomposed
training into
probabilistic models
The maximum likelihood estimator can readily be generalized to the case w
our5.5.1
goal is Conditional
to estimate a Log-Likelihood
conditional and
probability P (Mean
y | x ; θ) Squared
in order Error
to pred
m
given x.(iThis
)
| x is(iactually
) the most common situation because it forms the bas
= argGiven
maxa model
logparametrised
P (ymaximum
The by;likelihood
θ,).we perform
estimator can (5.63)
readily
estimation be generalized to the case
solving:
θ most
oursupervised
goal is to estimate
X represents
learning.aIfconditional all our inputs
probability P ( y | xand
; θ) Yin all ourtoobse
order pre
i=1 targets, then the conditional maximum likelihood estimator is
given x. This is actually the most common situation because it forms the b
most supervised learning. If X represents
P ( Yall| X
our; θinputs
) . and Y all our ob
(
θML = arg max
Linear
targets, then the conditional regression,
maximum
θ likelihood estimator is
1.4, may be justified
If the as a maximum
examples likelihood
are assumed to be procedure.
i.i.d.,
θML = arg max then
P (this
Y | can
X ; θbe). decomposed into
inear regression as an algorithm that learns to m
θ take an
To obtain a more convenient but equivalent optimisation problem
utput value ŷ. The mapping
If the examples from ML x
areθassumed
= to
arg ŷ
to
maxis
be chosen
i.i.d.,
logthen
P ( to
y(i)
|
this x (i)
can ; θbe). decomposed into (5
θ
or, a criterion that we introduced more or less i=1 arbitrarily.
m
ession from the point of view of maximum θML = arg max log P (y(i) | x (i); θ).
likelihood
θ i=1 Linear regres
ducing a single prediction ŷ, we
introduced earlier in now think
Sec. 5.1.4 , mayofbethe model
justified as a maximum likelihood proce
l distribution p(y | x). We
Previously, can imagine
we motivated that with
linear regression as ananalgorithm that learns Lineartoregr
tak
input
, we might see several x and produce
training
introduced inan output
earlierexamples
Sec. value
5.1.4,with
may be ŷ.justified
the The
samemapping from xlikelihood
as a maximum to ŷ is chose
pro
minimize mean squared error, a criterion that we introduced more or less arbitr
yj
p(y|x, ✓) = aj (1 aj ) 1 yj
x = {x1 , . . . , xm } j
X
Maximum likelihood estimation
L(✓) = yj log(aj ) + (1 yj ) log(1
1 ajm)
y = {y 1 , . . . ,jy m } x = {x , . . . , x }
1
j 2, {0,
Binary classification problem: x =y{x xmy} = {y 1 , . . . , y m }
. . . ,1}
aj = P (yj = 1|xj , ✓)
Y 1 m
yj y = {y
1 yj
, . . . , y }
p(y|x, ✓) = aj (1 a j )
Pablo Sprechmann
a j = P (y j = 1|x j ,
Indroduccion al aprendizaje profundo
✓)
j
X aj = P (yj = 1|xj , ✓) Y yj 1 yj
L(✓) = p(y|x,
yj log(aj ) + (1 yj ) log(1 aj ) ✓) = a j (1 a j )
j Y yj j
Cross entropy loss: p(y|x, ✓) = aX (1 a j ) 1 yj
j
yj 2 {0, 1} L(✓)j = yj log(aj ) + (1 yj ) log(1 aj )
X j
L(✓) = yj log(aj ) + (1 yj ) log(1 aj )
j yj 2 {0, 1}
Pablo Sprechmann Indroduccion al aprendizaje profundo
yj 2 {0, 1}
section
of wevalues.
discrete extend our treatment to the case where Y takes on any finite
• Sigmoid applied to a linear
of
sticdiscrete values.
Regression assumes a parametric form for the distribution P(Y |X),
stic Regression function
assumes a of the data:
parametric form for thedata.
distribution P(Y |X),
ectly estimates its parameters from the training The parametric
ectly
sumed estimates
by Logisticits parameters
Regression infrom
the the
casetraining
where Ydata. The parametric
is boolean is: z
sumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = 1 n (16)
P(Y = 1|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (16)
1 + exp(w0 + ⇤i=1 wi Xi ) Features can be
exp(w0 + ⇤nni=1 wi Xi ) discrete or
P(Y = 0|X) = exp(w0 + ⇤i=1n wi Xi ) continuous!(17)
P(Y = 0|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (17)
1 + exp(w0 + ⇤i=1 wi Xi )
hat equation (17) follows directly from equation (16), because the sum of
hat equation (17)
o probabilities follows
must equaldirectly
1. from equation (16), because the sum of
ohighly
probabilities mustproperty
convenient equal 1. of this form for P(Y |X) is that it leads to a
Logis6c&Func6on&in&n&Dimensions&
=0
P(Y 0|X)
highest P(Y|X) n (17)
assign
Form the value
of the yk that
logistic +maximizes
1 function.
exp(w0One +In
⇤ P(Y w
Logistic
highly
i=1 ik)|X).
=i Xyconvenient Put another
Regression, propertyP(Y way,
of we
is form
|X)this as- for P(Y |X) is that
0
– 0For binary P(Y
Y, output = 0|X)
Y=0 holds:
if
w.X+w
euation
ollow Y=
label (17)
this if the
form.
follows following
directly1 from
< condition
simple linear expression
equation (16), because forthe
classification.
sum of To classify any given X w
P(Y = 1|X)
babilities must equal 1. P(Y =to0|X)
want assign the value yk that maximizes P(Y = yk |X). Put anoth
yituting
convenient property (16)
from equations 1<
of this
andassign
form the thislabel
for P(Y Y =is0that
|X) if theit following
leads to a condition holds:
P(Y(17),
= 1|X) becomes
gexpression
the natural for classification. To classify any given X we generally
log of both sides we have n a linear classificationP(Y rule= that
0|X)
ningthe value
from yk that (16)
equations maximizes
and (17),P(Y =becomes
this yk |X). Put another way, we
bel Y = 0 if X satisfies1 < exp(w0 + ⇤ wi Xi ) 1<
P(Y = 1|X)
el Y = 0 if the following condition holds: i=1
n
0 + ⇤ wi Xfrom
n
1< exp(w
P(Y =substituting
0|X) i) equations (16) and (17), this becomes
1< 0 < w 0 +
P(Y = 1|X)i=1
⇥i=1wi Xi (18)
n
1 < exp(w0 + ⇤ wi Xi )
om A
equations Linear
(16)
s Y = 1 otherwise. and Classifier!
(17), this becomes i=1
tingly, the parametric form nof P(Y |X) used by Logistic Regression is
he form implied1 <by
exp(w 0 + ⇤ wi Xi ) of a Gaussian Naive Bayes classi-
the assumptions
i=1
fore, we can view Logistic Regression as a closely related alternative to
Maximizing&Condi6onal&Log&Likelihood&
0 or 1!
Gradient:
• Gradient&ascent&is&simplest&of&op6miza6on&approaches&
P (Y
P = r|X)
(Y = r|X)
P = 1=
=
(Y 1 r|X)PP
P (Y = r|X) = = (Y
(Y
1 ⇤
1 =j|X)
= j|X)
PP(Y(Y==j|X)
j|X) ⌅2
j=1 j=1 j=1
2 j=1
error = eww0 +
P (t
+P i iw wiX Xi i Pt̂ ) =
P t i 1
w h (x )
k k i
jj ⌥
e 0 i i w + w X j 1 P
P ee 1 1P
i
Maximize&Condi6onal&Log&Likelihood:&Gradient&ascent&
0
w0 + +i w i
i i Xi ji
== yy lnln= i j P + (1
(1 i
y y ) ) ln
ln kj P
=1 + eywy00ln
wj ln
+ w i X i P
P w X + (11 +ye + (1 y
jw )+ln
w0)+ln i w
0 w i Xi
w+0P
jj 1 + e + 11 ii wi Xw
++ eei w0 +
0 + i w
i i X
ii i 1 + e 1
i i Xi
1+ +ewe0
+ wiiw
i XiiXi
jj
P (Y = 1|X)
l(w)
l(w)
⇥ exp(w10 + ⇥⇥w1i Xi )
l(w) j j ⌥
j j
1|xj ,j ,w) ⇥
= ⌅l(w) xxjii yy= (Y
PP(Y
j j j j= =j 1|x w)
ji j j j ⇥
w = xxi i yy PP(Y(Y ==1|x 1|x, w), w)
jj⌅ww i jj
⇧⇧ ⇤⇤ + ⌅⌃ ⌅⌃
l(w)
l(w)
P (Y
⇧⇧ = 2|X) ⇥ exp(w 20 ⇤ ⇤w X
2i i ) ⌅⌃ ⌅⌃
= l(w) y j⌥
= ⌅l(w) (w0 +⌅ j jw wiixxjii)) ⌥ lnln
j
j j 11+ + ⌅exp(w
exp(w i 0 0++ ww j j
i xiix)i )
⌥
j j
ww =
w= yy (w (w00++ ww wi xi x
ii) ) ln ln 1 1++ exp(w
exp(w 0 +
0 + wiwxix
i )i )
j w
j⌅w i ⌅ww ii ⌅ww i i
jj ii r 1 i i
⇧ ⌥⌥ ⌃ ⌃
⇧xjj exp(w
x exp(w ++ ⌥ w w x x
jj
) ) ⌥ ⌃
P (Y j= j r|X)
j ii = 1 0 j 0
xi exp(w iiP
yj x
i i(Yi
0 i+
j i = j|X) j
= y xii j j = ⌥⌥ j j i wi xi )
= 11 +y exp(w
+ x
exp(w
i 00+ +j=1 i iww ixix i) )⌥ j
jj 1 + exp(w + i
i i i)
w x
j 0
⇧ j ⌃
⇧ P ⌥⌥ ⌃
ej
w 0 + w
i exp(w⇧
X
iexp(wi 0+ +
jj
wwi xi xi ))⌥ ⌃1
j j j 0 i ji j
= =
y= ln xxii =yywj +P 1x j
+ w y j
exp(w
X
+ +
exp(w
(1 ⌥ ⌥
i
w
0y+)
x j jlni wi xi )
) ⌥ w +
P
j 1 + e 0 1 ii+ exp(w
j
i i 00 + i w
i
i ix i i ) 1 + ej 0 i wi Xi
j j 1 + exp(w0 + i wi xi )
⌅l(w) 1 ⇥
= xji y j
P
1+e (Y
ax
j j
= 1|x , w)
⌅wi
j
⇧ ln p(w) ⇥ ⇤wi2 ⌅⌃
⌅l(w) ⌅ ⌅2
wi xji ) wi xji )
j i
= y (w0 + ln 1 + exp(w0 +
Gradient&Ascent&for&LR&
Gradient ascent algorithm: (learning rate η > 0)
do:&
&For&i=1&to&n:&(iterate&over&features)&
un6l&“change”&<&ε&
Loop over training examples!
p lo t 1⇧⇤1 e ^ ⇥x⌅ fro m ⇥5 to 5 th ick
= xi y ⌥ j
j 1 + exp(w0 + i i i)
w x
In p u t in te rp re ta tio n : In p u t in te rp re ta tio n :
⇥p lot
In p u t in te rp re ta tio n :
1
Large¶meters…&
1
⇥
1
1 Sh ow p lot x ⇥ 5Shto
ow5 x ⇥5
⇤⇥10x
Sh ow p lot
⇤⇥x 1
x ⇥ 5 to 5 ⇤ ⇥2 x
1 1 + e ax 1
Re su lt : Re su lt :
Re su lt :
1.0 1.0
1.0
0.8 0.8
0.8
⇥4 ⇥2 2 4 ⇥4 ⇥2 2 4 ⇥4 ⇥2 2 4
• One&common&approach&is&to&define&priors&on&w(
– Normal&distribu6on,&zero&mean,&iden6ty&covariance&
– “Pushes”¶meters&towards&zero&
• Regulariza*on&
– Helps&avoid&very&large&weights&and&overficng&
• MAP&es6mate:&
⇥w ⇥w i ⌥ j
j i j 1 + exp(w
i 0+ i i i)
w x
⇧ ⌥⇧j ⌃ ⌃
xji
exp(w0 + i wi xi ) ⌥ j
= y j xji ⌥xj yjj exp(w0 + i i i)
w x
=
j
⇧
MAP&as&Regulariza6on&
1 + exp(w +
j
0
⌃
i x )
w i i i 1 + exp(w0 +
⌥ j
i i i)
w x
⌥j
exp(w0 + i i i)
w x 1
= xji y j
⌥
j 1 + exp(w0 + i wi xji ) 1+e ax
1
ln p(w) ⇥ wi2
1+e ax 2
• Add&log&p(w)&to&objec6ve:& i
⇥ ln p(w)
ln p(w) ⇥ wi2 = wi
2 ⇥wi
i
– Quadra6c&penalty:&drives&weights&towards&zero&
⇥ ln p(w)
= wi
– Adds&a&nega6ve&linear&term&to&the&gradients&
⇥wi
3
MLE&vs.&MAP&&
• Maximum&condi6onal&likelihood&es6mate&
• Maximum&condi6onal&a&posteriori&es6mate&
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Training a single neuron
z2
z2
0 0
−5 −5
−5 0 5 −5 0 5
z1 z1
original
objective
regularised
0
10
0 10 20 30 40 50
iteration
Logis6c®ression&for&discrete&
classifica6on&
Logis6c®ression&in&more&general&case,&where&&
⇥2
set&of&possible&Y)is
⇤ ⇤ {y ,…,y
1 ⇤R }& ⇥2
error = ⇤(ti t̂i )22 = ⇤ ti ⇤ wk hk (xi ) ⇥2
• ⇤Define&a&weight&vector&w
error = (ti t̂i ) = ⇤ ti ⇤ wki&for&each&y
hk (xi ) i, i=1,…,R-1&
i 2 i k
error = i (ti t̂i ) = i ti k wk hk (xi )
⇤
i
P (Y = 1|X) ⇥ exp(w10 + i
w Xi ) ⇤k
P (Y = 1|X) ⇥ exp(w10 + ⇤ w1i
1i Xi ) P(Y=y1|X)
i
P (Y = 1|X) ⇥ exp(w10 + i w1i Xi ) biggest
⇤
P (Y = 2|X) ⇥ exp(w20 + ⇤
i w X )
P (Y = 2|X) ⇥ exp(w20 + ⇤ w2i i
2i i )
X
i
P (Y = 2|X) ⇥ exp(w20 + i w2i Xi )
… i
r 1 P(Y=y3|X) P(Y=y2|X)
⇤ biggest
P (Y = r|X) = 1 P (Y = j|X) biggest
j=1
Logis6c®ression&for&discrete&
classifica6on)
• Logis6c®ression&in&more&general&case,&where&&
Y)is&in&the&set){y1,…,yR}&
&for&k<R)
&for&k=R)(normaliza6on,&so&no&weights&for&this&class)&
Features(can(be(discrete(or(con&nuous!(
where the negative sign ensures that information is positive or zero. Note that low
∆ → 0. The
probability first term
events on the right-hand
x correspond to highside of (1.102)content.
information will approach the integral
The choice of
of basis
thelnlogarithm
p(x)
for p(x) in thisis limit so thatand for the moment we shall adopt the convention
arbitrary,
prevalent in information ! theory of using logarithms
# to the base of 2. In this case, as
" Information$theory
we shall see shortly,
lim the unitsp(xof h(x) are bits=(‘binary
i )∆ ln p(xi ) − p(x) digits’).
ln p(x) dx (1.103)
Now suppose that a isender wishes to transmit the value of a random variable to
∆ → 0
a receiver. The average amount of information that they transmit in the process is
where the
Entropy:
obtained by quantity
taking the onexpectation
the right-hand side iswith
of (1.92) called the differential
respect We see
entropy. p(x)
to the distribution and
isthat discrete and continuous forms of the entropy differ by a quantity ln ∆, which
theby
given
diverges in the limit ∆ → 0. This! reflects the fact that to specify a continuous
H[x] =
variable very precisely requires − number
a large p(x) log p(x).For a density defined (1.93)
of2 bits. over
multiple continuous variables, denotedxcollectively by the vector x, the differential
This important
entropy is givenquantity
by is called the $entropy of the random variable x. Note that
limp→0 p ln p = 0 and so we shall take p(x) ln p(x) = 0 whenever we encounter
H[x] = − p(x) ln p(x) dx. (1.104)
a
value for x such that p(x) = 0.
SoInfar
thewe have
case of given a rather
discrete heuristicwe
distributions, motivation
saw that for
the the definition
maximum of informa-
entropy con-
figuration corresponded to an equal distribution of probabilities across the possible
statesisof
H[x] theexpected
the variable. number
Let us now
of consider the maximum
bits needed entropy
to encode configuration
a randomly drawnfor
a continuous variable. In order for this maximum to be well defined, it will be nec-
value of X (under most efficient code)
essary to constrain the first and second moments of p(x) as well as preserving the
normalization constraint. We therefore maximize the differential entropy with the
High,&Low&Entropy&
• “High&Entropy”&& Entropy&of&a&coin&flip&
– Y&is&from&a&uniform&like&distribuPon&
More uncertainty, more entropy!
– Flat&histogram&
– Values&sampled&from&it&are&less&predictable&
Entropy&
Information Theory interpretation:
H(Y) is the expected number of bits
• “Low&Entropy”&& needed to encode a randomly
– Y&is&from&a&varied&(peaks&and&valleys)&
drawn value of Y (under most
distribuPon& efficient code)
– Histogram&has&many&lows&and&highs&
Probability&of&heads&
– Values&sampled&from&it&are&more&predictable&
number of concepts
Relative 1.6.1
entropy:fromRelative
information
entropy and mutual information
We now start to relateSo far inthese ideaswetohave introduced a number of concepts from information
this section,
distribution p(x),
theory, and suppose
including the keythat
notion of entropy. We now start to relate these ideas to
We have modelpattern some unknown
recognition. distribution
Consider some p(x) using
unknown an approximating
distribution and distribution
suppose that q(x).
distribution q(x). If we use q(x) to p(x),
we have modelled this using an approximating distribution q(x). If we use q(x) to
ransmitting values of x to a receiver,
construct a coding scheme for the purpose of transmitting values of x to a receiver,
ationIf we
(in use
nats)q(x) to construct
required
then
a coding
to specify
the average
scheme for transmitting values of x,
theamount
additional of information (in nats) required to specify the
dingthen
scheme) as a result
the average
value of usingwe
of xadditional
(assuming amount
chooseof
q(x) aninformation
efficient coding scheme) as a result of using q(x)
instead of the true distribution p(x) is given by
! " ! #
" ! #
KL(p∥q) = − p(x) ln q(x) dx − − p(x) ln p(x) dx
− − p(x) ln p(x) dx ! $ %
% q(x)
= − p(x) ln dx. (1.113)
p(x)
dx. (1.113)
This is known as the relative entropy or Kullback-Leibler divergence, or KL diver-
gence (Kullback and Leibler, 1951), between the distributions p(x) and q(x). Note
ack-Leibler divergence,
KL is a divergence,
that it is not aor KL diver-
non-negative
symmetrical and p(x)that
quantity, = q(x) if and
is to say only ̸≡
KL(p∥q) if KL(p||q)
KL(q∥p). = 0.
We now
he distributions p(x) andshow the Kullback-Leibler divergence satisfies KL(p∥q) ! 0 with
that Note
q(x).
if, and only if, p(x) = q(x). To do this we first introduce the concept of
ay KL(p∥q) ̸≡equality
KL(q∥p).
convex functions. A function f (x) is said to be convex if it has the property that
vergence satisfies chord lies on!or0above
every KL(p∥q) withthe function, as shown in Figure 1.31. Any value of x
x, y) and H[x] is the differential en-
he information needed to describe xH[x, y] = H[y|x] + H[x] (1.112)
needed to describe x alone
where H[x, plus
y] is the the entropy of p(x, y) and H[x] is the differential en-
differential
ven x. Information
tropy of the marginal distribution p(x). Thustheory
the information needed to describe x
and y is given by the sum of the information needed to describe x alone plus the
ual information
additional information required to specify y given x.
number of concepts
Relative 1.6.1
entropy:fromRelative
information
entropy and mutual information
We now start to relateSo far inthese ideaswetohave introduced a number of concepts from information
this section,
distribution p(x),
theory, and suppose
including the keythat
notion of entropy. We now start to relate these ideas to
We have modelpattern some unknown
recognition. distribution
Consider some p(x) using
unknown an approximating
distribution and distribution
suppose that q(x).
distribution q(x). If we use q(x) to p(x),
we have modelled this using an approximating distribution q(x). If we use q(x) to
ransmitting values of x to a receiver,
construct a coding scheme for the purpose of transmitting values of x to a receiver,
ationIf we
(in use
nats)q(x) to construct
required
then
a coding
to specify
the average
scheme for transmitting values of x,
theamount
additional of information (in nats) required to specify the
dingthen
scheme) as a result
the average
value of usingwe
of xadditional
(assuming amount
chooseof
q(x) aninformation
efficient coding scheme) as a result of using q(x)
instead of the true distribution p(x) is given by
! " ! #
" ! #
KL(p∥q) = − p(x) ln q(x) dx − − p(x) ln p(x) dx
− − p(x) ln p(x) dx ! $ %
% q(x)
= − p(x) ln dx. (1.113)
p(x)
dx. (1.113)
This is known as the relative entropy or Kullback-Leibler divergence, or KL diver-
gence (Kullback and Leibler, 1951), between the distributions p(x) and q(x). Note
ack-Leibler divergence,
KL is a divergence,
that it is not aor KL diver-
non-negative
symmetrical and p(x)that
quantity, = q(x) if and
is to say only ̸≡
KL(p∥q) if KL(p||q)
KL(q∥p). = 0.
We now
he distributions p(x) andshow the Kullback-Leibler divergence satisfies KL(p∥q) ! 0 with
that Note
q(x).
if, and only if, p(x) = q(x). To do this we first introduce the concept of
ay KL(p∥q) ̸≡equality
KL(q∥p).
convex functions. A function f (x) is said to be convex if it has the property that
vergence satisfies chord lies on!or0above
every KL(p∥q) withthe function, as shown in Figure 1.31. Any value of x
necessarily have a less efficient coding, and on average the additional information
that must be transmitted is (at least) equal to the Kullback-Leibler divergence be-
tween the two distributions.
Information
Suppose that data is being generated theorydistribution p(x) that we
from an unknown
wish to model. We can try to approximate this distribution using some parametric
distribution q(x|θ), governed by a set of adjustable parameters θ, for example a
Suppose that
multivariate the data
Gaussian. Oneis way
beingto generated
determine θfrom
is toan unknown
minimize thedistribution p(x) that we
Kullback-Leibler
divergence between
want to model withp(x) and q(x|θ) with respect to θ. We cannot do this directly
q(x|θ)
because we don’t know p(x). Suppose, however, that we have observed a finite set
of training points xn , for n = 1, . . . , N , drawn from p(x). Then the expectation
Onerespect
with way totodetermine
p(x) can beθ isapproximated
to minimise the
by aKL divergence
finite sum overbetween p(x) and
these points, q(x|θ)
using
(1.35), so that
"N
KL(p∥q) ≃ {− ln q(xn |θ) + ln p(xn )} . (1.119)
n=1