Académique Documents
Professionnel Documents
Culture Documents
Jason Corso
SUNY at Buffalo
1 / 59
2 / 59
Salmon
(1)
= 2
for salmon
(2)
Sea Bass
3 / 59
Preliminaries
Prior Probability
The a priori or prior probability reflects our knowledge of how likely
we expect a certain state of nature before we can actually observe
said state of nature.
4 / 59
Preliminaries
Prior Probability
The a priori or prior probability reflects our knowledge of how likely
we expect a certain state of nature before we can actually observe
said state of nature.
In the fish example, it is the probability that we will see either a
salmon or a sea bass next on the conveyor belt.
Note: The prior may vary depending on the situation.
If we get equal numbers of salmon and sea bass in a catch, then the
priors are equal, or uniform.
Depending on the season, we may get more salmon than sea bass, for
example.
4 / 59
Preliminaries
Prior Probability
The a priori or prior probability reflects our knowledge of how likely
we expect a certain state of nature before we can actually observe
said state of nature.
In the fish example, it is the probability that we will see either a
salmon or a sea bass next on the conveyor belt.
Note: The prior may vary depending on the situation.
If we get equal numbers of salmon and sea bass in a catch, then the
priors are equal, or uniform.
Depending on the season, we may get more salmon than sea bass, for
example.
4 / 59
Preliminaries
5 / 59
Preliminaries
5 / 59
Preliminaries
5 / 59
Preliminaries
For simplicity, lets assume that our features are all continuous values.
Denote a scalar feature as x and a vector feature as x. For a
d-dimensional feature space, x Rd .
6 / 59
Preliminaries
Class-Conditional Density
or Likelihood
(4)
2
1
0.3
0.2
0.1
x
J. Corso (SUNY at Buffalo) 9
10
Bayesian
Decision
11
12 Theory13
14
15
7 / 59
Preliminaries
Posterior Probability
Bayes Formula
P (|x) =
(5)
(6)
(7)
8 / 59
Preliminaries
Posterior Probability
Notice the likelihood and the prior govern the posterior. The p(x)
evidence term is a scale-factor to normalize the density.
For the case of P (1 ) = 2/3 and P (2 ) = 1/3 the posterior is
P(i|x)
1
0.8
0.6
0.4
2
0.2
x
9
10
11
12
13
14
15
FIGURE 2.2. Posterior probabilities for the particular priors P (1 ) = 2/3 and P (2 )
shown in Fig. 2.1. Thus in this
J. Corso
(SUNY
Buffalo)
Bayesian Decision
Theory
= 1/3
for at
the
class-conditional probability
densities
9 / 59
Decision Theory
Probability of Error
(8)
10 / 59
Decision Theory
Probability of Error
(8)
if we decide 2
if we decide 1
(9)
10 / 59
Decision Theory
Probability of Error
(10)
11 / 59
Decision Theory
Probability of Error
(10)
P (error) =
P (error|x)p(x)dx
(11)
11 / 59
Decision Theory
(12)
12 / 59
Decision Theory
(12)
12 / 59
Decision Theory
(12)
12 / 59
Decision Theory
(12)
12 / 59
Decision Theory
Loss Functions
A loss function states exactly how costly each action is.
As earlier, we have c classes {1 , . . . , c }.
We also have a possible actions {1 , . . . , a }.
The loss function (i |j ) is the loss incurred for taking action i
when the class is j .
13 / 59
Decision Theory
Loss Functions
A loss function states exactly how costly each action is.
As earlier, we have c classes {1 , . . . , c }.
We also have a possible actions {1 , . . . , a }.
The loss function (i |j ) is the loss incurred for taking action i
when the class is j .
The Zero-One Loss Function is a particularly common one:
(
0 i=j
(i |j ) =
i, j = 1, 2, . . . , c
1 i 6= j
(13)
13 / 59
Decision Theory
Expected Loss
a.k.a. Conditional Risk
We can consider the loss that would be incurred from taking each
possible action in our set.
The expected loss or conditional risk is by definition
R(i |x) =
c
X
(i |j )P (j |x)
(14)
P (j |x)
(15)
= 1 P (i |x)
(16)
j=1
j6=i
14 / 59
Decision Theory
Expected Loss
a.k.a. Conditional Risk
We can consider the loss that would be incurred from taking each
possible action in our set.
The expected loss or conditional risk is by definition
R(i |x) =
c
X
(i |j )P (j |x)
(14)
P (j |x)
(15)
= 1 P (i |x)
(16)
j=1
j6=i
14 / 59
Decision Theory
Overall Risk
Let (x) denote a decision rule, a mapping from the input feature
space to an action, Rd 7 {1 , . . . , a }.
This is what we want to learn.
15 / 59
Decision Theory
Overall Risk
Let (x) denote a decision rule, a mapping from the input feature
space to an action, Rd 7 {1 , . . . , a }.
This is what we want to learn.
The overall risk is the expected loss associated with a given decision
rule.
I
R = R ((x)|x) p (x) dx
(17)
Clearly, we want the rule () that minimizes R((x)|x) for all x.
15 / 59
Decision Theory
Bayes Risk
The Minimum Overall Risk
Bayes Decision Rule gives us a method for minimizing the overall risk.
Select the action that minimizes the conditional risk:
= arg min R (i |x)
i
= arg min
i
c
X
(i |j )P (j |x)
(18)
(19)
j=1
16 / 59
Decision Theory
(20)
(21)
(22)
(23)
17 / 59
Decision Theory
(24)
(25)
Thus, we can say the Bayes Decision Rule says to decide 1 if the
likelihood ratio exceeds a threshold that is independent of the
observation x.
18 / 59
Discriminants
(26)
or, equivalently
i = arg max gi (x) ,
i
decide
i .
19 / 59
Discriminants
Discriminants as a Network
We can view the discriminant classifier as a network (for c classes and
a d-dimensional input vector).
action
(e.g., classification)
costs
discriminant
functions
input
g2(x)
g1(x)
x1
x2
...
x3
gc(x)
...
xd
FIGURE 2.5. The functional structure of a general statistical pattern classifier which
includes d inputs and c discriminant functions gi (x). A subsequent step determines
J. Corso of
(SUNY
Buffalo)
Theory and categorizes the input pattern
20 / 59
which
theatdiscriminant
values Bayesian
is the Decision
maximum,
Discriminants
Bayes Discriminants
Minimum Conditional Risk Discriminant
(27)
(28)
j=1
21 / 59
Discriminants
Bayes Discriminants
Minimum Conditional Risk Discriminant
(27)
(28)
j=1
21 / 59
Discriminants
(29)
22 / 59
Discriminants
Uniqueness Of Discriminants
Is the choice of discriminant functions unique?
23 / 59
Discriminants
Uniqueness Of Discriminants
Is the choice of discriminant functions unique?
No!
Multiply by some positive constant.
Shift them by some additive constant.
23 / 59
Discriminants
Uniqueness Of Discriminants
Is the choice of discriminant functions unique?
No!
Multiply by some positive constant.
Shift them by some additive constant.
For monotonically increasing function f (), we can replace each gi (x)
by f (gi (x)) without affecting our classification accuracy.
These can help for ease of understanding or computability.
The following all yield the same exact classification results for
minimum-error-rate classification.
p(x|i )P (i )
gi (x) = P (i |x) = P
j p(x|j )P (j )
(30)
gi (x) = p(x|i )P (i )
(31)
gi (x) = ln p(x|i ) + ln P (i )
(32)
23 / 59
Discriminants
Visualizing Discriminants
Decision Regions
The effect of any decision rule is to divide the feature space into
decision regions.
Denote a decision region Ri for i .
One not necessarily connected region is created for each category and
assignments is according to:
If gi (x) > gj (x) j 6= i, then x is in Ri .
(33)
Decision boundaries separate the regions; they are ties among the
discriminant functions.
24 / 59
Discriminants
Visualizing Discriminants
Decision Regions
p(x|1)P(1)
0.3
p(x|2)P(2)
0.2
0.1
0
R1
R2
5
R2
decision
boundary
5
0
IGURE
2.6.(SUNY
In this
two-dimensional
two-category
J. Corso
at Buffalo)
Bayesian
Decision Theory classifier, the probability densities
25 / 59
Discriminants
Two-Category Discriminants
Dichotomizers
(34)
26 / 59
Discriminants
Two-Category Discriminants
Dichotomizers
(34)
(35)
26 / 59
Discriminants
Two-Category Discriminants
Dichotomizers
(34)
(35)
(36)
(37)
26 / 59
27 / 59
Expectation
X
f (x)P (x)
(39)
E[f (x)] =
x
28 / 59
2 2
The mean is the expected value of x is
Z
E[x] =
xp(x)dx .
(40)
(41)
(42)
29 / 59
Samples from the normal density tend to cluster around the mean and
be spread-out based on the variance.
p(x)
2.5%
2.5%
x
- 2 -
+ + 2
FIGURE 2.7. A univariate normal distribution has roughly 95% of its areain the range
|x | 2 , as shown. The peak of the distribution has value p() = 1/ 2 . From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright
c 2001 by John Wiley & Sons, Inc.
30 / 59
Samples from the normal density tend to cluster around the mean and
be spread-out based on the variance.
p(x)
2.5%
2.5%
x
- 2 -
+ + 2
FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range
The normal
is completely
specifiedhas
byvalue
thep()
mean
|x | density
2 , as shown.
The peak of the distribution
= 1/ and
2 .the
From:
Pattern
Classification
. Copyright
Richard
O.
Duda,
Peter
E.
Hart,
and
David
G.
Stork,
variance. These two are its sufficient statistics.
c 2001 by John Wiley & Sons, Inc.
We thus abbreviate the equation for the normal density as
p(x) N (, 2 )
J. Corso (SUNY at Buffalo)
(43)
30 / 59
Entropy
31 / 59
Entropy
31 / 59
(45)
(46)
(47)
32 / 59
E[(x )(x ) ] =
(x )(x )T p(x)dx
Symmetric.
Positive semi-definite (but DHS only considers positive definite so
that the determinant is strictly positive).
The diagonal elements ii are the variances of the respective
coordinate xi .
The off-diagonal elements ij are the covariances of xi and xj .
What does a ij = 0 imply?
33 / 59
E[(x )(x ) ] =
(x )(x )T p(x)dx
Symmetric.
Positive semi-definite (but DHS only considers positive definite so
that the determinant is strictly positive).
The diagonal elements ii are the variances of the respective
coordinate xi .
The off-diagonal elements ij are the covariances of xi and xj .
What does a ij = 0 imply?
That coordinates xi and xj are statistically independent.
33 / 59
E[(x )(x ) ] =
(x )(x )T p(x)dx
Symmetric.
Positive semi-definite (but DHS only considers positive definite so
that the determinant is strictly positive).
The diagonal elements ii are the variances of the respective
coordinate xi .
The off-diagonal elements ij are the covariances of xi and xj .
What does a ij = 0 imply?
That coordinates xi and xj are statistically independent.
What does reduce to if all off-diagonals are 0?
J. Corso (SUNY at Buffalo)
33 / 59
E[(x )(x ) ] =
(x )(x )T p(x)dx
Symmetric.
Positive semi-definite (but DHS only considers positive definite so
that the determinant is strictly positive).
The diagonal elements ii are the variances of the respective
coordinate xi .
The off-diagonal elements ij are the covariances of xi and xj .
What does a ij = 0 imply?
That coordinates xi and xj are statistically independent.
What does reduce to if all off-diagonals are 0?
The product of the d univariate densities.
J. Corso (SUNY at Buffalo)
33 / 59
Mahalanobis Distance
x2
x1
From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Clas
c 2001 by John Wiley & Sons, Inc.
right
34 / 59
x2
N(A w, I)
t
A w
N(,)
Aw
A
t
A
P
N(At, At A)
P
With the covariance matrix, we
can calculate the dispersion of
0
x
the data in any direction or
in 2.8. The action of a linear transformation on the feature space
FIGURE
vert an arbitrary normal distribution into another normal distribution. One tr
any subspace.
tion, A, takes the source distribution into distribution N (At , At A). Ano
t
36 / 59
37 / 59
-2
0.15
p(x|i)
0.4
1
0
P(2)=.5
0.1
0.05
1
0.3
R2
0.2
-1
P(2)=.5
0.1
P(1)=.5
x
-2
R1
P(1)=.5
R2
P(2)=.5
R2
R1
-2
P(1)=.5 R1
-2
-2
-1
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identity
matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of
d 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional
examples, we indicate p(x|i ) and the boundaries for the case P (1 ) = P (2 ). In the three-dimensional case,
the grid plane separates R1 from R2 . From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
c 2001 by John Wiley & Sons, Inc.
Classification. Copyright
37 / 59
Simple Case: i = 2 I
kx i k2
+ ln P (i )
2 2
(51)
The distance of the sample to the mean vector (for each i).
A normalization by the variance and offset by the prior.
38 / 59
Simple Case: i = 2 I
But, we dont need to actually compute the distances.
Expanding the quadratic form (x )T (x ) yields
gi (x) =
i
1 h T
T
T
x
x
2
x
+
+ ln P (i ) .
i
i
i
2 2
(52)
The quadratic term xT x is the same for all i and can thus be ignored.
This yields the equivalent linear discriminant functions
gi (x) = wiT x + wi0
1
wi = 2 i
1
wi0 = 2 T
+ ln P (i )
2 i i
(53)
(54)
(55)
39 / 59
Simple Case: i = 2 I
Decision Boundary Equation
(56)
w = i j
(57)
2
1
P (i )
x0 = (i + j )
ln
( j ) (58)
2
2
ki j k
P (j ) i
These equations define a hyperplane through point x0 with a normal
vector w.
40 / 59
Simple Case: i = 2 I
Decision Boundary Equation
p(x|i)
0.4
0.3
0.3
0.2
0.2
0.1
0.4
0.1
x
0
-2
R1
P(1)=.7
x
0
-2
R2
P(2)=.3
-2
R2
P(2)=.1
-2
0.15
0.1
0.05
0.05
P(2)=.01
P(2)=.2
P(1)=.8
R1
-2
R2
R2
P(1)=.99
R1
-2
0
2
2
4
3
2
0.15
0.1
P(1)=.9
R1
41 / 59
The discriminant functions are quadratic (the only term we can drop
is the ln 2 term):
gi (x) = xT Wi x + wiT x + wi0
1
Wi = 1
2 i
wi = 1
i i
1
1
1
wi0 = T
ln|i | + ln P (i )
i i i
2
2
(59)
(60)
(61)
(62)
42 / 59
FIGURE 2.14. Arbitrary Gaussian distributions lead to Bayes decision boundaries that
are general hyperquadrics. Conversely, given any hyperquadric, one can find two Gaussian distributions whose Bayes
decisionDecision
boundary isTheory
that hyperquadric. These variances
J. Corso (SUNY at Buffalo)
Bayesian
43 / 59
44 / 59
R4
R3
R2
R4
R1
The decisionQuite
regions
for four normal
Even w
A Complicated
Decision distributions.
Surface!
J. Corso (SUNY at Buffalo)
45 / 59
x*
46 / 59
|2 1 |
(63)
47 / 59
A Hit is the probability that the internal signal is above x given that
the external signal is present
P (x > x |x 2 )
(64)
48 / 59
A Hit is the probability that the internal signal is above x given that
the external signal is present
P (x > x |x 2 )
(64)
(65)
48 / 59
A Hit is the probability that the internal signal is above x given that
the external signal is present
P (x > x |x 2 )
(64)
(65)
(66)
48 / 59
A Hit is the probability that the internal signal is above x given that
the external signal is present
P (x > x |x 2 )
(64)
(65)
(66)
(67)
48 / 59
We can experimentally
determine the rates, in
particular the Hit-Rate and the
False-Alarm-Rate.
d'=3
d'=2
P(x > x*|x 2)
hit
d'=1
d'=0
false alarm
P(x < x*|x 2)
P (x > x |x 2 ). From the measured hit and false alarm rates (here corres
x in Fig. 2.19 and shown as the red dot), we can deduce that d
= 3. From:
Bayesian
Decision
Theory
49 / 59
Duda,
Peter
E. Hart,
and David G. Stork, Pattern Classification. Copyright
Missing Features
50 / 59
Missing Features
50 / 59
Missing Features
p(i , xg )
p(xg )
R
p(i , xg , xb )dxb
=
p(xg )
R
p(i |x)p(x)dxb
=
p(xg )
R
gi (x)p(x)dxb
= R
p(x)dxb
P (i |xg ) =
(68)
(69)
(70)
(71)
51 / 59
Statistical Independence
Two variables xi and xj are independent if
p(xi , xj ) = p(xi )p(xj )
(72)
x3
x1
x2
52 / 59
Cavity
Toothache
J. Corso (SUNY at Buffalo)
Catch
53 / 59
d
Y
p(xi |k )
(73)
i=1
54 / 59
55 / 59
P(a)
P(b)
P(c|a)
P(d|b)
D
C
P(c|d)
P(e|c)
E
P(f|e)
F
P(g|e)
G
P(g|f)
56 / 59
Burglary
.001
Alarm
JohnCalls
T
F
P(E)
Earthquake
P(B)
P(J)
.90
.05
T
T
F
F
T
F
T
F
.002
P(A)
.95
.94
.29
.001
MaryCalls
T
F
P(M)
.70
.01
Key: given knowledge of the values of some nodes in the network, we can
apply Bayesian inference to determine the maximum posterior values of the
unknown variables!
J. Corso (SUNY at Buffalo)
57 / 59
n
Y
P (xi |P(xi ))
(74)
i=1
58 / 59
P(x|a)
parents of X
P(x|b)
X
P(c|x)
P(d|x)
C
children of X
D
Children of X, the set C are the nodes
conditioned on X.
FIGURE 2.25. A portion of a belief network, consisting of a node X
Use the Bayes Rule, for the values
case(xon
, . . .), right:
its parents (A and B), and its children (C and D).
1 , x2 the
(75)
(76)
or more generally,
P (C(x), x, P(x)|e) = P (C(x)|x, e)P (x|P(x), e)P (P(x)|, e)
J. Corso (SUNY at Buffalo)
(77)
59 / 59