Académique Documents
Professionnel Documents
Culture Documents
Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions
2 |
1 > < 2
|2 2
2
Since () does not affect the decision rule, it can be eliminated* Rearranging the previous expression |1 = |2
1 > < 2
2 1
The term is called the likelihood ratio, and the decision rule is known as the likelihood ratio test
*() can be disregarded in the decision rule since it is constant regardless of class . However, () will be needed if we want to estimate the posterior | which, unlike |1 1 , is a true probability value and, therefore, gives us an estimate of the goodness of our decision
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
Solution
Substituting into the LRT expression = Simplifying the LRT expression = e Changing signs and taking logs 4 Which yields 7 This LRT result is intuitive since the likelihoods differ only in their mean How would the LRT decision rule change if the priors were such that 1 = 2(2 )?
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
10
1 < > 2
R1: say 1
R2: say 2
P(x|1)
P(x|2)
10
x
4
Probability of error
The performance of any decision rule can be measured by []
Making use of the Theorem of total probability (L2): = =1 [ ] The class conditional probability can be expressed as | = =
1 + 2
1
2
2
R1: say 1
R2: say 2
For the previous example, since we P(x|1) assumed equal priors, then [] = (1 + 2 )/2 How would you compute numerically? 4
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
P(x|2)
10 1
x
5
The optimal decision rule will minimize [|] at every value of in feature space, so that the integral above is minimized
Probability
From the figure it becomes clear that, for any value of , the LRT will always have a lower [| ]
Therefore, when we integrate over the real line, the LRT decision rule will yield a lower []
For any given problem, the minimum probability of error is achieved by the LRT decision rule; this probability of error is called the Bayes Error Rate and is the best any classifier can do.
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7
Bayes risk
So far we have assumed that the penalty of misclassifying 1 as is the same as the reciprocal error
In general, this is not the case For example, misclassifying a cancer sufferer as a healthy patient is a much more serious problem than the other way around This concept can be formalized in terms of a cost function
represents the cost of choosing class when is the true class
[11 1 (|1 ) + 12 2 2 +
[21 1 (|1 ) + 22 2 2
1 2
Merging the last equation into the Bayes Risk expression yields
= 11 1
1
1 + 12 2
+21 1
+21 1 21 1
2
1
1 + 22 2
2
1
1 + 22 2 1 22 2
2 2
>0
2 21 11 1
>0
The first two terms are constant w.r.t. 1 so they can be ignored Thus, we seek a decision region 1 that minimizes
1 =
1
12 22 2 2 21 11 1 (|1 ) =
1
10
Lets forget about the actual expression of () to develop some intuition for what kind of decision region 1 we are looking for
Intuitively, we will select for 1 those regions that minimize In other words, those regions where < 0
1
g(x) R1A
So we will choose 1 such that 21 11 1 1 > 12 22 2 2 And rearranging 1 |1 12 22 2 > < |2 2 21 11 1 Therefore, minimization of the Bayes Risk also leads to an LRT
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11
-4
-2
0 x
R1
R2
R1
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 -6
-4
-2
0 x
12
LRT variations
Bayes criterion
This is the LRT that minimizes the Bayes risk 1 |1 12 22 2 > Bayes = < |2 2 21 11 1
0; = A special case of Bayes that uses a zero-one cost Cij = 1; Known as the MAP criterion, since it seeks to maximize 1 1 |1 | 2 1 > > MAP = < <1 |2 2 1 2 | 2
The Minimax Criterion, used in Game Theory, is derived from the Bayes criterion, and seeks to minimize the maximum Bayes Risk
The Minimax Criterion does nor require knowledge of the priors, but it needs a cost function
For more information on these methods, refer to Detection, Estimation and Modulation Theory, by H.L. van Trees
14
|
Probability
R2 R1 R3 R2 P( 3|x) P( 2|x) P( 1|x) R1
choosing the class with largest posterior So each is the region where | is maximum, and the decision rule that minimizes P[error] is the MAP criterion
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
15
The (conditional) risk of assigning to class is = = =1 | And the Bayes Risk associated with decision rule () is = To minimize this expression, R2 R1 R2 R3 R2 we must minimize the conditional risk at each , which is equivalent to choosing such that is minimum
Risk
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU
R1
16
Discriminant functions
All the decision rules shown in L4 have the same structure
At each point in feature space, choose class that maximizes (or minimizes) some measure () This structure can be formalized with a set of discriminant functions (), = 1. . , and the decision rule assign to class if > Therefore, we can visualize the decision rule as a network that computes dfs and selects the class with highest discriminant And the three decision rules Discriminant functions can be summarized as
C r it e r io n B ayes MAP ML D is c r im in a n t F u n c t io n g i ( x ) = - ( i |x ) g i ( x ) = P ( i |x ) g i( x ) = P ( x | i)
Class assignment
Features
x x 1 1
x x 2 2
x x 3 3
x x d d
17