Vous êtes sur la page 1sur 17

L4: Bayesian Decision Theory

Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Likelihood ratio test (LRT)


Assume we are to classify an object based on the evidence provided by feature vector
Would the following decision rule be reasonable?
"Choose the class that is most probable given observation x More formally: Evaluate the posterior probability of each class ( |) and choose the class with largest ( |)

Lets examine this rule for a 2-class problem


In this case the decision rule becomes if 1 | > 2 | choose 1 else choose 2 Or, in a more compact form 1 | Applying Bayes rule |1 1
1 > < 2

2 |

1 > < 2

|2 2
2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Since () does not affect the decision rule, it can be eliminated* Rearranging the previous expression |1 = |2
1 > < 2

2 1

The term is called the likelihood ratio, and the decision rule is known as the likelihood ratio test

*() can be disregarded in the decision rule since it is constant regardless of class . However, () will be needed if we want to estimate the posterior | which, unlike |1 1 , is a true probability value and, therefore, gives us an estimate of the goodness of our decision
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3

Likelihood ratio test: an example


Problem
Given the likelihoods below, derive a decision rule based on the LRT (assume equal priors)
1 = 4,1 ; 2 = 10,1
1 1 4 2 1 e 2 2 > 1 1 < 1 1 10 2 2 e 2 2 1 1 1 4 2 + 10 2 > 2 2 < 2 1 2< > 2

Solution
Substituting into the LRT expression = Simplifying the LRT expression = e Changing signs and taking logs 4 Which yields 7 This LRT result is intuitive since the likelihoods differ only in their mean How would the LRT decision rule change if the priors were such that 1 = 2(2 )?
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

10

1 < > 2

R1: say 1

R2: say 2

P(x|1)

P(x|2)

10

x
4

Probability of error
The performance of any decision rule can be measured by []
Making use of the Theorem of total probability (L2): = =1 [ ] The class conditional probability can be expressed as | = =

So, for our 2-class problem, becomes = 1


2

1 + 2
1

2
2

where is the integral of over region where we choose

R1: say 1

R2: say 2

For the previous example, since we P(x|1) assumed equal priors, then [] = (1 + 2 )/2 How would you compute numerically? 4
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

P(x|2)

10 1

x
5

How good is the LRT decision rule?


To answer this question, it is convenient to express [] in terms of the posterior [|]

The optimal decision rule will minimize [|] at every value of in feature space, so that the integral above is minimized

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

At each , [| ] is equal to [ | ] when we choose


This is illustrated in the figure below
R 1, ALT R 1,LTR R 2, ALT R 2,LRT

Probability

P[error | x' ] for ALT decision rule


P( 1|x) P( 2|x)

P[error | x' ] for LRT decision rule


x

From the figure it becomes clear that, for any value of , the LRT will always have a lower [| ]
Therefore, when we integrate over the real line, the LRT decision rule will yield a lower []
For any given problem, the minimum probability of error is achieved by the LRT decision rule; this probability of error is called the Bayes Error Rate and is the best any classifier can do.
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7

Bayes risk
So far we have assumed that the penalty of misclassifying 1 as is the same as the reciprocal error
In general, this is not the case For example, misclassifying a cancer sufferer as a healthy patient is a much more serious problem than the other way around This concept can be formalized in terms of a cost function
represents the cost of choosing class when is the true class

We define the Bayes Risk as the expected value of the cost


2 = = 2 =1 =1 = 2 = 2 =1 =1 |

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

What is the decision rule that minimizes the Bayes Risk?


First notice that
R =

We can express the Bayes Risk as


=
1 2

[11 1 (|1 ) + 12 2 2 +

[21 1 (|1 ) + 22 2 2

Then we note that, for either likelihood, one can write:


+ = = 1

1 2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Merging the last equation into the Bayes Risk expression yields
= 11 1
1

1 + 12 2

+21 1
+21 1 21 1

2
1

1 + 22 2

2
1

1 + 22 2 1 22 2

2 2

Now we cancel out all the integrals over 2


= 21 1 + 22 2 + 12 22 2

>0

2 21 11 1

>0

The first two terms are constant w.r.t. 1 so they can be ignored Thus, we seek a decision region 1 that minimizes
1 =
1

12 22 2 2 21 11 1 (|1 ) =
1


10

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Lets forget about the actual expression of () to develop some intuition for what kind of decision region 1 we are looking for
Intuitively, we will select for 1 those regions that minimize In other words, those regions where < 0
1

g(x) R1A

R1=R1A R1B R1C R1B R1C

So we will choose 1 such that 21 11 1 1 > 12 22 2 2 And rearranging 1 |1 12 22 2 > < |2 2 21 11 1 Therefore, minimization of the Bayes Risk also leads to an LRT
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

The Bayes risk: an example


Consider a problem with likelihoods 1 = 0, 3 and 2 = 2,1
Sketch the two densities What is the likelihood ratio? Assume 1 = 2, = 0, 12 = 1 and 21 = 31/2 Determine a decision rule to minimize []
likelihood
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 -6

-4

-2

0 x

1 2 1 1 > + 2 2 < 0 2 3 2 2 1 > 2 2 12 + 12 < 0 2 = 4.73,1.27


CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

1 0, 3 > 1 2,1 < 3 2

R1

R2

R1

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 -6

-4

-2

0 x

12

LRT variations
Bayes criterion
This is the LRT that minimizes the Bayes risk 1 |1 12 22 2 > Bayes = < |2 2 21 11 1

Maximum A Posteriori criterion


Sometimes we may be interested in minimizing

0; = A special case of Bayes that uses a zero-one cost Cij = 1; Known as the MAP criterion, since it seeks to maximize 1 1 |1 | 2 1 > > MAP = < <1 |2 2 1 2 | 2

Maximum Likelihood criterion


For equal priors [ ] = 1/2 and 0/1 loss function, the LTR is known as a ML criterion, since it seeks to maximize (| ) 1 |1 > ML = <1 |2 2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13

Two more decision rules are commonly cited in the literature


The Neyman-Pearson Criterion, used in Detection and Estimation Theory, which also leads to an LRT, fixes one class error probabilities, say 1 < , and seeks to minimize the other
For instance, for the sea-bass/salmon classification problem of L1, there may be some kind of government regulation that we must not misclassify more than 1% of salmon as sea bass The Neyman-Pearson Criterion is very attractive since it does not require knowledge of priors and cost function

The Minimax Criterion, used in Game Theory, is derived from the Bayes criterion, and seeks to minimize the maximum Bayes Risk
The Minimax Criterion does nor require knowledge of the priors, but it needs a cost function

For more information on these methods, refer to Detection, Estimation and Modulation Theory, by H.L. van Trees

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

14

Minimum [] for multi-class problems


Minimizing [] generalizes well for multiple classes
For clarity in the derivation, we express [] in terms of the probability of making a correct assignment
= 1 [] The probability of making a correct assignment is
= =1

Minimizing [] is equivalent to maximizing [], so expressing the latter in terms of posteriors


= =1

|
Probability
R2 R1 R3 R2 P( 3|x) P( 2|x) P( 1|x) R1

To maximize [], we must maximize each integral , which we achieve by

choosing the class with largest posterior So each is the region where | is maximum, and the decision rule that minimizes P[error] is the MAP criterion
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

15

Minimum Bayes risk for multi-class problems


Minimizing the Bayes risk also generalizes well
As before, we use a slightly different formulation
We denote by the decision to choose class We denote by () the overall decision rule that maps feature vectors into classes , 1 , 2 ,

The (conditional) risk of assigning to class is = = =1 | And the Bayes Risk associated with decision rule () is = To minimize this expression, R2 R1 R2 R3 R2 we must minimize the conditional risk at each , which is equivalent to choosing such that is minimum
Risk
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

R1

R2 (3|x) (1|x) (2|x)

16

Discriminant functions
All the decision rules shown in L4 have the same structure
At each point in feature space, choose class that maximizes (or minimizes) some measure () This structure can be formalized with a set of discriminant functions (), = 1. . , and the decision rule assign to class if > Therefore, we can visualize the decision rule as a network that computes dfs and selects the class with highest discriminant And the three decision rules Discriminant functions can be summarized as
C r it e r io n B ayes MAP ML D is c r im in a n t F u n c t io n g i ( x ) = - ( i |x ) g i ( x ) = P ( i |x ) g i( x ) = P ( x | i)

Class assignment

Select Select max max


Costs Costs g (x) g (x) g g 1 2 1(x) 2(x) g (x) g C C (x)

Features

x x 1 1

x x 2 2

x x 3 3

x x d d

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

17

Vous aimerez peut-être aussi