Discrimination Rules in Practice

DISCRIMINATION RULES IN PRACTICE
The ML rule is used if the distribution of the data is known up to parameters. Suppose forexample that the data come from multivariate normal distributions Np(_j ; _). If we have Jgroups with njobservations in each group, we use xjto estimate _j , and Sjto estimate _.The common covariance may be estimated bySu =XJj=1nj_Sjn J_; (12.9)withn =PJj=1 nj. Thus the empirical version of the ML rule of Theorem 12.2 is to allocatea new observation x to _j such that j minimizes(x xi)>S1u (x xi) for i 2 f1; : : : ; Jg: Estimation of the probabilities of misclassi_cationsMisclassi_cation probabilities are given by (12.7) and can be estimated by replacing theunknown parameters by their corresponding estimators.For the ML rule for two normal populations we obtain^p12 = ^p21 = __12^__where ^_2= (_x1 _x2)>S1u (_x1 _x2) is the estimator for _2. The probabilities of misclassi_cation may also be estimated by the re-substitution method.We reclassify each original observation xi, i = 1; _ _ _ ; n into _1; _ _ _ ;_J according to thechosen rule. Then denoting the number of individuals coming from _j which have beenclassi_ed into _i by nij, we have ^pij= nij
nj, an estimator of pij. Clearly, this method leadsto too optimistic estimators of pij, but it provides a rough measure of the quality of thediscriminant rule. The matrix (^pij) is called the confussion matrix in Johnson and Wichern(1998).
Fisher's linear discrimination function Another approach stems from R. A. Fisher. His idea was to base the discriminant rule on aprojection a>x such that a good separation was achieved. This LDA projection method iscalled Fisher's linear discrimination function. IfY = Xadenotes a linear combination of observations, then the total sum of squares of y,Pni=1(yi_y)2,is equal toY>HY = a>X>HXa= a>T a (12.11)with the centering matrix H = I n11n1>n and T = X>HX.Suppose we have samples Xj, j = 1; : : : ; J, from J populations. Fisher's suggestion wasto _nd the linear combination a>x which maximizes the ratio of the between-group-sum ofsquares to the withingroup-sum of squares. The within-group-sum of squares is given
byXJj=1Y>jHjYj=XJj=1a>X>jHjXja= a>Wa; (12.12)whereYjdenotes the j-th submatrix of Y corresponding to observations of group j and Hjdenotes the (nj_nj) centering matrix. The within-group-sum of squares measures the sumof variations within each group.The between-group-sum of squares isXJj=1nj(yjy)2
=XJj=1njfa>(xjx)g2 = a>Ba; (12.13)where yjand xjdenote the means of Yjand
Xjand y and x denote the sample means ofY and X. The between-group-sum of squares measures the variation of the means acrossgroups.The total sum of squares (12.11) is the sum of the within-group-sum of squares and the between-group-sum of squares, i.e.,a>T a = a>Wa+ a>Ba:Fisher's idea was to select a projection vector a that maximizes the ratioa>Baa>Wa: (12.14)The solution is found by applying Theorem 2.5.THEOREM 12.4 The vector a that maximizes (12.14) is the eigenvector of W1B thatcorresponds to the largest eigenvalue.Now a discrimination rule is easy to obtain:classify x into group j where a>_xjis closest to a>x, i.e.,x! _j where j = argminija>(x _xi)j:When J = 2 groups, the discriminant rule is easy to compute. Suppose that group 1 has n1elements and group 2 has n2 elements. In this caseB =_n1n2n_dd>; whered = (x1 x2). W1B has only one eigenvalue which equalstr(W1B) =_n1n2n_d>W1d;and the corresponding eigenvector is a = W1d. The corresponding discriminant rule isx!_1 if a>fx12 (x1 +x2)g>0;x ! _2 if a>fx12 (x1 + x2)g _ 0:(12.15) The Fisher LDA is closely related to projection pursuit (Chapter 18) since the statisticaltechnique is based on a one dimensional index a>x.
c. Classification Rules
To develop a classification rule for classifying an observation y into one or the other populationin the two group case requires some new notation. First, we let f1 (y) and f2 (y)represent the probability density functions (pdfs) associated with the random vector Y forpopulations 1 and 2, respectively. We let p1 and p2 be the prior probabilities that y is amember of 1 and 2, respectively, where p1 + p2 = 1. And, we let c1 = C (2 | 1) andc2 = C (1 | 2) represent the misclassification cost of assigning an observation from 2 to1, and from 1 to 2, respectively. Then, assuming the pdfs f1 (y) and f2 (y) are known,the total probability of misclassification (T PM) is equal to p1 times the probability ofassigning an observation to 2 given that it is from 1, P (2 | 1), plus p2 times the probabilitythat an observation is classified into 1 given that it is from 2, P (1 | 2). Hence,T PM = p1P (2 | 1) + p2P (1 | 2) (7.2.14)The optimal error rate (OER) is the error rate that minimizes the T PM. Taking costs intoaccount, the average or expected cost of misclassification is defined asECM = p1P (2 | 1)C(2 | 1) + p2P (1 | 2)C (1 | 2) (7.2.15)A reasonable classification rule is to make the ECM as small as possible. In practice costsof misclassification are usually unknown.To assign an observation y to 1 or 2, Fisher (1936) employed his LDF. To apply therule, he assumed that _1 = _2 = _ and because he did not assume any pdf, Fishers
ruledoes not require normality. He also assumed that p1 = p2 and that C (1 | 2) = C (2 | 1). Using (7.2.3), we see that D2 >0 so that L1 L2 >0 and L1 >L2. Hence, ifL = a_sy=_y1 y2__S1y >L1 +L22(7.2.16
Factor Analysis When there are many variables in a research design, it is often helpful to reduce the variables to a smaller set of factors. This is an independence technique, in which there is no dependent variable. Rather, the researcher is looking for the underlying structure of the data matrix. Ideally, the independent variables are normal and continuous, with at least 3 to 5 variables loading onto a factor. The sample size should be over 50 observations, with over 5 observations per variable.
Multicollinearity is generally preferred between the variables, as the correlations are key to data reduction. Kaisers Measure of Statistical Adequacy (MSA) is a measure of the degree to which every variable can be predicted by all other variables.
An overall MSA of .80 or higher is very good, with a measure of under .50 deemed poor. There are two main factor analysis methods: common factor analysis, which extracts factors based on the variance shared by the factors, and principal component analysis, which extracts factors based on the total variance of the factors. Common factor analysis is used to look for the latent (underlying) factors, where as principal components analysis is used to find the fewest number of variables that explain the most variance.
The first factor extracted explains the most variance. Typically, factors are extracted as long as the eigenvalues are greater than 1.0 or the scree test visually indicates how many factors to extract. The factor loadings are the correlations between the factor and the variables. Typically a factor loading of .4 or higher is required to attribute a specific variable to a factor. An orthogonal rotation
assumes no correlation between the factors, whereas an oblique rotation is used when some relationship is believed to exist.

Discrimination Rules in Practice

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Discrimination Rules in Practice

Transféré par

Droits d'auteur :

Formats disponibles

DISCRIMINATION RULES IN PRACTICE

=XJj=1njfa>(xjx)g2 = a>Ba; (12.13)where yjand xjdenote the means of Yjand

Vous aimerez peut-être aussi