PR l4

L4: Bayesian Decision Theory
Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
Likelihood ratio test (LRT)

Assume we are to classify an object based on the evidence provided by feature vector
Would the following decision rule be reasonable?
"Choose the class that is most probable given observation x More formally: Evaluate the posterior probability of each class ( |) and choose the class with largest ( |)
Lets examine this rule for a 2-class problem

In this case the decision rule becomes if 1 | > 2 | choose 1 else choose 2 Or, in a more compact form 1 | Applying Bayes rule |1 1
1 > < 2
2 |
1 > < 2
|2 2
2
Since () does not affect the decision rule, it can be eliminated* Rearranging the previous expression |1 = |2
1 > < 2
2 1
The term is called the likelihood ratio, and the decision rule is known as the likelihood ratio test
*() can be disregarded in the decision rule since it is constant regardless of class . However, () will be needed if we want to estimate the posterior | which, unlike |1 1 , is a true probability value and, therefore, gives us an estimate of the goodness of our decision
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
Likelihood ratio test: an example

Problem
Given the likelihoods below, derive a decision rule based on the LRT (assume equal priors)
1 = 4,1 ; 2 = 10,1
1 1 4 2 1 e 2 2 > 1 1 < 1 1 10 2 2 e 2 2 1 1 1 4 2 + 10 2 > 2 2 < 2 1 2< > 2
Solution
Substituting into the LRT expression = Simplifying the LRT expression = e Changing signs and taking logs 4 Which yields 7 This LRT result is intuitive since the likelihoods differ only in their mean How would the LRT decision rule change if the priors were such that 1 = 2(2 )?
10
1 < > 2
R1: say 1
R2: say 2
P(x|1)
P(x|2)
10
x
4
Probability of error
The performance of any decision rule can be measured by []
Making use of the Theorem of total probability (L2): = =1 [ ] The class conditional probability can be expressed as | = =
So, for our 2-class problem, becomes = 1

2
1 + 2
1
2
2
where is the integral of over region where we choose
R1: say 1
R2: say 2
For the previous example, since we P(x|1) assumed equal priors, then [] = (1 + 2 )/2 How would you compute numerically? 4
P(x|2)
10 1
x
5
How good is the LRT decision rule?

To answer this question, it is convenient to express [] in terms of the posterior [|]
The optimal decision rule will minimize [|] at every value of in feature space, so that the integral above is minimized
At each , [| ] is equal to [ | ] when we choose

This is illustrated in the figure below
R 1, ALT R 1,LTR R 2, ALT R 2,LRT
Probability
P[error | x' ] for ALT decision rule

P( 1|x) P( 2|x)
P[error | x' ] for LRT decision rule

x
From the figure it becomes clear that, for any value of , the LRT will always have a lower [| ]
Therefore, when we integrate over the real line, the LRT decision rule will yield a lower []
For any given problem, the minimum probability of error is achieved by the LRT decision rule; this probability of error is called the Bayes Error Rate and is the best any classifier can do.
Bayes risk
So far we have assumed that the penalty of misclassifying 1 as is the same as the reciprocal error
In general, this is not the case For example, misclassifying a cancer sufferer as a healthy patient is a much more serious problem than the other way around This concept can be formalized in terms of a cost function
represents the cost of choosing class when is the true class
We define the Bayes Risk as the expected value of the cost

2 = = 2 =1 =1 = 2 = 2 =1 =1 |
What is the decision rule that minimizes the Bayes Risk?

First notice that
R =
We can express the Bayes Risk as

=
1 2
[11 1 (|1 ) + 12 2 2 +
[21 1 (|1 ) + 22 2 2
Then we note that, for either likelihood, one can write:

+ = = 1
1 2
Merging the last equation into the Bayes Risk expression yields
= 11 1
1
1 + 12 2
+21 1
+21 1 21 1
2
1
1 + 22 2
2
1
1 + 22 2 1 22 2
2 2
Now we cancel out all the integrals over 2

= 21 1 + 22 2 + 12 22 2
>0
2 21 11 1
>0
The first two terms are constant w.r.t. 1 so they can be ignored Thus, we seek a decision region 1 that minimizes
1 =
1
12 22 2 2 21 11 1 (|1 ) =
1

10
Lets forget about the actual expression of () to develop some intuition for what kind of decision region 1 we are looking for
Intuitively, we will select for 1 those regions that minimize In other words, those regions where < 0
1
g(x) R1A
R1=R1A R1B R1C R1B R1C
So we will choose 1 such that 21 11 1 1 > 12 22 2 2 And rearranging 1 |1 12 22 2 > < |2 2 21 11 1 Therefore, minimization of the Bayes Risk also leads to an LRT
The Bayes risk: an example

Consider a problem with likelihoods 1 = 0, 3 and 2 = 2,1
Sketch the two densities What is the likelihood ratio? Assume 1 = 2, = 0, 12 = 1 and 21 = 31/2 Determine a decision rule to minimize []
likelihood
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 -6
-4
-2
0 x
1 2 1 1 > + 2 2 < 0 2 3 2 2 1 > 2 2 12 + 12 < 0 2 = 4.73,1.27

1 0, 3 > 1 2,1 < 3 2
R1
R2
R1
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 -6
-4
-2
0 x
12
LRT variations
Bayes criterion
This is the LRT that minimizes the Bayes risk 1 |1 12 22 2 > Bayes = < |2 2 21 11 1
Maximum A Posteriori criterion

Sometimes we may be interested in minimizing
0; = A special case of Bayes that uses a zero-one cost Cij = 1; Known as the MAP criterion, since it seeks to maximize 1 1 |1 | 2 1 > > MAP = < <1 |2 2 1 2 | 2
Maximum Likelihood criterion

For equal priors [ ] = 1/2 and 0/1 loss function, the LTR is known as a ML criterion, since it seeks to maximize (| ) 1 |1 > ML = <1 |2 2
Two more decision rules are commonly cited in the literature

The Neyman-Pearson Criterion, used in Detection and Estimation Theory, which also leads to an LRT, fixes one class error probabilities, say 1 < , and seeks to minimize the other
For instance, for the sea-bass/salmon classification problem of L1, there may be some kind of government regulation that we must not misclassify more than 1% of salmon as sea bass The Neyman-Pearson Criterion is very attractive since it does not require knowledge of priors and cost function
The Minimax Criterion, used in Game Theory, is derived from the Bayes criterion, and seeks to minimize the maximum Bayes Risk
The Minimax Criterion does nor require knowledge of the priors, but it needs a cost function
For more information on these methods, refer to Detection, Estimation and Modulation Theory, by H.L. van Trees
14
Minimum [] for multi-class problems

Minimizing [] generalizes well for multiple classes
For clarity in the derivation, we express [] in terms of the probability of making a correct assignment
= 1 [] The probability of making a correct assignment is
= =1
Minimizing [] is equivalent to maximizing [], so expressing the latter in terms of posteriors

= =1
|
Probability
R2 R1 R3 R2 P( 3|x) P( 2|x) P( 1|x) R1
To maximize [], we must maximize each integral , which we achieve by
choosing the class with largest posterior So each is the region where | is maximum, and the decision rule that minimizes P[error] is the MAP criterion
15
Minimum Bayes risk for multi-class problems

Minimizing the Bayes risk also generalizes well
As before, we use a slightly different formulation
We denote by the decision to choose class We denote by () the overall decision rule that maps feature vectors into classes , 1 , 2 ,
The (conditional) risk of assigning to class is = = =1 | And the Bayes Risk associated with decision rule () is = To minimize this expression, R2 R1 R2 R3 R2 we must minimize the conditional risk at each , which is equivalent to choosing such that is minimum
Risk
R1
R2 (3|x) (1|x) (2|x)
16
Discriminant functions
All the decision rules shown in L4 have the same structure
At each point in feature space, choose class that maximizes (or minimizes) some measure () This structure can be formalized with a set of discriminant functions (), = 1. . , and the decision rule assign to class if > Therefore, we can visualize the decision rule as a network that computes dfs and selects the class with highest discriminant And the three decision rules Discriminant functions can be summarized as
C r it e r io n B ayes MAP ML D is c r im in a n t F u n c t io n g i ( x ) = - ( i |x ) g i ( x ) = P ( i |x ) g i( x ) = P ( x | i)
Class assignment
Select Select max max

Costs Costs g (x) g (x) g g 1 2 1(x) 2(x) g (x) g C C (x)
Features
x x 1 1
x x 2 2
x x 3 3
x x d d
17

PR l4

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

PR l4

Transféré par

Droits d'auteur :

Formats disponibles

L4: Bayesian Decision Theory

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Likelihood ratio test (LRT)

Lets examine this rule for a 2-class problem

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Likelihood ratio test: an example

So, for our 2-class problem, becomes = 1

where is the integral of over region where we choose

How good is the LRT decision rule?

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

At each , [| ] is equal to [ | ] when we choose

P[error | x' ] for ALT decision rule

P[error | x' ] for LRT decision rule

We define the Bayes Risk as the expected value of the cost

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

What is the decision rule that minimizes the Bayes Risk?

We can express the Bayes Risk as

Then we note that, for either likelihood, one can write:

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Now we cancel out all the integrals over 2

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

R1=R1A R1B R1C R1B R1C

The Bayes risk: an example

1 2 1 1 > + 2 2 < 0 2 3 2 2 1 > 2 2 12 + 12 < 0 2 = 4.73,1.27

1 0, 3 > 1 2,1 < 3 2

Maximum A Posteriori criterion

Maximum Likelihood criterion

Two more decision rules are commonly cited in the literature

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Minimum [] for multi-class problems

Minimizing [] is equivalent to maximizing [], so expressing the latter in terms of posteriors

To maximize [], we must maximize each integral , which we achieve by

Minimum Bayes risk for multi-class problems

R2 (3|x) (1|x) (2|x)

Select Select max max

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU

Vous aimerez peut-être aussi