Vous êtes sur la page 1sur 79

Classification

What is Classification?
Assigning an object to a certain class based on its similarity to
previous examples of other objects
Can be done with reference to original data or based on a model of
that data
E.g: Me: Its round, green, delicious and crunchy You: Its an
apple!

Examples
Classifying transactions as genuine or fraud e.g credit card usage,
insurance claims, cell phone calls
Classifying prospects as good or bad customers
Classifying engine faults by their symptoms
Classifying healthy and sick people based on the symptoms
Classifying tumor and normal cell line based on the DNA mutation,
Gene expression

(Un)Certainty
As with most data mining solutions, a classification usually comes
with a degree of certainty.
It might be the probability of the object belonging to the class or it
might be some other measure of how closely the object resembles
other examples from that class

Techniques
Non-parametric, e.g. K nearest neighbour
Mathematical models, LDA, logistic regression e.g. neural networks
Rule based models, e.g. decision trees
Support vector Machine
Etc

Classification vs. Prediction


Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data

Prediction:
models continuous-valued functions, i.e., predicts unknown or
missing values

Typical Applications

credit approval
target marketing
medical diagnosis
treatment effectiveness analysis

ClassificationA Two-Step Process


Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or mathematical
formulae

Model usage: for classifying future or unknown objects


Estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Test set is independent of training set, otherwise over-fitting will occur

Classification Process (1): Model


Construction
Classification
Algorithms
Training
Data

NAME
M ike
M ary
B ill
Jim
D ave
Anne

RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no

Classification Process (1): Model


Construction
Classification
Algorithms
Training
Data

NAME
M ike
M ary
B ill
Jim
D ave
Anne

RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no

Classifier
(Model)

IF rank = professor
OR years > 6
THEN tenured = yes

Classification Process (2): Use the Model in


Prediction

Classifier

Testing
Data

Unseen Data

(Jeff, Professor, 4)

NAME
T om
M erlisa
G eorge
Joseph

RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes

Tenured?

Supervised vs. Unsupervised Learning


Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)


The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

Issues (1): Data Preparation


Data cleaning
Preprocess data in order to reduce noise and handle missing values

Relevance analysis (feature selection)


Remove the irrelevant or redundant attributes

Data transformation
Generalize and/or normalize data

Issues (2): Evaluating Classification Methods


Predictive accuracy
Speed and scalability

time to construct the model


time to use the model

Robustness

handling noise and missing values

Scalability

efficiency in disk-resident databases

Interpretability:

understanding and insight provided by the model

Goodness of rules

decision tree size


compactness of classification rules

Classification Algorithms

The k-Nearest Neighbor Algorithm


Performed on raw data
Count number of other examples that are close
Winner is most common
Lazy learners

The k-Nearest Neighbor Algorithm


Compute the distance (similarity) between training records and the
new object
Identify the k nearest objects by ordering the training objects based
on the distance
Assign the label which is most frequent among k-training record
nearest to that object.

The k-Nearest Neighbor Algorithm


All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of Euclidean distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most common value
among the k training examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples.

_
_

.
+

+
xq

_
+

.
.

Find the best K nearest


Perform P times K fold Cross Validation (with different K, e.g., K= 1:10 )
Calculate the classification error of each CV
ni= misclassified objects, m=number objects

Choose K with the smallest Classification error

Mathematical Model Approaches

Logistic Regression
In logistic regression the outcome variable is a binary variable
The purpose is to assess the effects of multiple explanatory variables,
which can be numeric and/or categorical, on the outcome variable.
Used because having a categorical outcome variable violates the
assumption of linearity in normal regression

P
1 P

Measuring the Probability of Outcome


The probability of the outcome is measured by the odds of occurrence
of an event.
If P is the probability of an event, then (1-P) is the probability of it not
occurring.
Odds of success = P / 1-P

The Logistic Regression


The joint effects of all explanatory variables put together on the odds is
Odds = P/1-P = e + 1X1 + 2X2 + +pXp
Taking the logarithms of both sides
Log{P/1-P} = log +1X1+2X2++pXp
Logit P = +1X1+2X2+..+pXp

Logistic Regression
Logistic regression analysis requires that the dependent variable be
dichotomous.

Logistic regression analysis requires that the independent variables be


metric or dichotomous.

Logistic Regression
Response - Presence/Absence of characteristic
Predictor - Numeric variable observed for each case
Model - p(x) Probability of event (P)
bx

e
p ( x)
bx
1 e
b = 0 P(Presence) is the same at each level of x

b > 0 P(Presence) increases as x increases


b < 0 P(Presence) decreases as x increases

Odds Ratio
Interpretation of Regression Coefficient (b):
In linear regression, the slope coefficient is the change in the mean
response as x increases by 1 unit
In logistic regression, we can show that:

odds( x 1)
eb
odds( x)

p ( x)
odds( x)

1 p ( x)

Thus eb represents the change in the odds of the outcome (multiplicatively)


by increasing x by 1 unit

If b = 0, the odds and probability are the same at all x levels (eb=1)
If b > 0 , the odds and probability increase as x increases (eb>1)
If b < 0 , the odds and probability decrease as x increases (eb<1)

Multiple Logistic Regression


Extension to more than one predictor variable (either numeric or dummy
variables).
With k predictors, the model is written:
e b1x1 b k xk
p
1 e b1x1 b k xk

Adjusted Odds ratio for raising xi by 1 unit, holding all other predictors
constant:

ORi e b i
Many models have nominal/ordinal predictors, and widely make use of
dummy variables

Assessing the Model


log likelihood

Y lnPY 1 Y ln1 PY
i

i1

The Log-likelihood statistic


Analogous to the residual sum of squares in multiple regression
It is an indicator of how much unexplained information there is after
the model has been fitted.
Large values indicate poorly fitting statistical models.

Hosmer-Lemeshow Statistic
Measure of lack of fit
Null hypothesis:there is no difference between observed and modelpredicted values,
If the H-L goodness-of-fit test statistic is greater than 0.05:
model's estimates fit the data at an acceptable level.
indicating model prediction that is not significantly different from observed
values.

Assessing Changes in Models


Its possible to calculate a log-likelihood for different models and to
compare these models by looking at the difference between their loglikelihoods.
2
2LL(New) LL(Baseline)

df

knew kbaseline

If the significance of the chi-square statistic is less than .05, then the model
is a significant fit of the data

Assessing Predictors: The Wald Statistic


Wald

Similar to t-statistic in Regression.


Tests the null hypothesis that b = 0.

b
SE b

Assessing Predictors: The Odds Ratio or Exp(b)


Exp(b)

Odds after a unit change in the predictor


Odds before a unit change in the predictor

Indicates the change in odds resulting from a unit change in


the predictor.
OR > 1: Predictor , Probability of outcome occurring .
OR < 1: Predictor , Probability of outcome occurring .

Example
A trial (based on 2,000 patients) with probability of dying at 30 days as response
() and age, sex (F=1, M=0), and treatment (C=0, Tr=1) as regressors.
Estimated multiple logistic regression model
logit () = -7.65+ 0.073 age + 0.69 sex + 0.17 treatment
(P<0.0001) (P=0.007)

(P=0.45)

Interpretation:
Treatments have no significant different effect (taking into account age & sex)
The older the patient the higher the probability of dying before (or at) 30 days
(taking into account sex and treatment)
Women have a significantly higher 30 day mortality rate (taking into account
age and treatment)

Example
Further interpretation:
e0.17 = 1.19 odds ratio for rt-PA (but NS)
e0.69 = 1.99 odds ratio for sex = female
e0.073 = 1.08 odds ratio for an increase of age with 1 yr
e0.73 = 2.08 odds ratio for an increase of age with 10 yrs
For all odds ratios are controlled for the other factors

The logistic regression model predicts the


outcome
The logistic regression model gives a probability that y=1 (30-day
mortality) will happen. The same applies when there are more than 1
regressors.
When p > 0.50, we predict the subject as y=1 (died at 30 days), when
p < 0.50 the subject is predicted as y=0 (survived at 30 days)
A classification table is a result, and we can determine the sensitivity
and specificity.
By varying the threshold (from 0 -> 0.50 > 1) different sensitivities
and specificities are obtained and hereby a ROC (Receiver Operating
Curve) appears.

Simple Measures Of ClassficationAccuracy

A false positive is defined as a positive test result when the disease or


condition being tested for is not actually present
A false negative is defined as a negative test result when the disease
or condition being tested for is actually present.

Simple Measures Of Diagnostic Accuracy

Sensitivity is defined as the ability of a test to detect the disease


status or condition when it is truly present, i.e., it is the probability of
a positive test result given that the patient has the disease or
condition of interest

Simple Measures Of Diagnostic Accuracy

Specificity is the ability of a test to exclude the condition or disease in


patients who do not have the condition or the disease i.e., it is the
probability of a negative test result given that the patient does not
have the disease or condition of interest.

Clinical Sensitivity is how often the test is positive in diseased patients


Clinical Specificity is how often the test is negative in non-diseased
patients.

If the comparative procedure is imperfect, then sensitivity and


specificity estimates are almost always statistically biased
(systematically too high or too low).

Choosing a Comparative Procedure


FDAs four general recommendations regarding choosing a comparative procedure
to evaluate a new diagnostic test and reporting the results:
If a perfect standard is available, use it. Calculate estimated sensitivity and
specificity.
If a perfect standard is available but impractical, use it to the extent possible.
Calculate adjusted estimates of sensitivity and specificity.
If a perfect standard is not available, consider constructing one. Calculate
estimated sensitivity and specificity under the constructed standard.
If a perfect standard is not available and cannot be constructed, then an
appropriate approach may be reporting a measure of agreement.
(Statistical Guidance on reporting result from Studies Evaluating Diagnostic Tests, FDA)

Calculating Estimates of Sensitivity and


Specificity

Another example
The following table is the results from a diagnostic test like x-ray or
computer tomographic (CT) scan and the true disease or condition of
the patient is known (Altman and Bland, 1994a).

SENS = 231/258 = 0.90


SPES = 54/86 = 0.63

Predictive Value positive (PV+)


The predictive value positive (PV+) of a screening test (or symptom)
is the probability that a person has a disease given that the screening
test is positive (or has the symptom).
Pr(disease|test+)

Predictive Value negative (PV-)


The predictive value negative (PV) of a screening test (or symptom)
is the probability that a person does not have a disease given that the
screening test is negative (or does not have the symptom).
Pr(no disease | test)

From the CT example


The following table is the results from a diagnostic test like x-ray or
computer tomographic (CT) scan and the true disease or condition of
the patient is known (Altman and Bland, 1994a).

PPV (PV+) = 231/263 = 0.88


NPV (PV-) = 54/81 = 0.67

The CT case, but different Prevalence

Calculate the specificity, Sensitivity, PV+ and PV-.

The CT case, but different Prevalence

SENS = 231/258 = 0.90


SPES = 54/86 = 0.63
PPV = 77/173 = 0.45
NPV = 162/171 = 0.95

The PPV and the NPV are dependent on the prevalence of the disease
in the patient population being studied (Altman and Bland, 1994b).

Prevalence
The probability of currently having the disease regardless of the
duration of time one has had the disease.
Obtained by dividing the number of people who currently have the
disease by the number of people in the study population.

Cumulative incidence
The probability that a person with no prior disease will develop a new
case of the disease over some specified time period.

General Formula PV+ and PV-

Prevalence = 30%, Sensitivity = 50%, Specificity =


90%

If the prevalence 4%, Sn 50%, Sp 90%

Another Example
Suppose 84% of hypertensives and 23% of normotensives are
classified as hypertensive by an automated blood pressure
machine. What are the PV+ and PV of the machine,
assuming 20% of the adult population is hypertensive?

Suppose 84% of hypertensives and 23% of normotensives are


classified as hypertensive by an automated blood pressure
machine. What are the PV+ and PV of the machine,
assuming 20% of the adult population is hypertensive?
Sensitivity = .84
Specificity = 1 .23 = .77. Thus,
PV+ = 0.84 x 0.2 / [0.84 x 0.2 + 0.23x0.8] = .48
PV = (.77)(.8)/[(.77)(.8)+(.16)(.2)] = .95

Thus a negative result from the machine is reasonably predictive


because we are 95% sure a person with a negative result from the
machine is normotensive.
However, a positive result is not very predictive because we are only
48% sure a person with a positive result from the machine is
hypertensive.

When we can use PPV and NPV?


Both SENS and SPES can be applied to other populations that have
different prevalence rates,
It is not appropriate to apply universally the PPV and the NPV
obtained from one study without information on prevalence.
The rarer the prevalence of the disease, the more sure one can be
that a negative test result indeed means that there is no disease, and
less sure that a positive test result indicates the presence of a disease.
The lower the prevalence, greater is the number of people who will
be diagnosed as FP, even if the SENS and the SPES are high

Likelihood Ratio (LR)


Another simple measure of diagnostic accuracy, given by the ratio of the probability
of the test result among patients who truly had the disease / condition to the
probability of the same test among patients who do not have the disease/condition.
LR is the ratio of SENS / (1-SPES).
For the previous example the LR is 2.4.
The magnitude of the LR informs about the certainty of a positive diagnosis.
A general guideline:
LR=1 indicates that the test result is equally likely in patients with and without
the disease/condition,
LR > 1 indicate that the test result is more likely in patients with the disease /
condition
LR < 1 indicate that the test result is more likely in patients without the disease /
condition (Zhou et al., 2002).

Receiver Operating Characteristic (ROC) Curve


Both SENS and SPES require a cutpoint in order to classify the test results
as positive or negative.
The SENS and SPES for a diagnostic test are therefore tied to the diagnostic
threshold or cutpoint selected for the test.
Many times the results from a diagnostic test may be on an ordinal or
numerical scale rather than just a binary outcome of positive or negative.
In this situations, the SENS and SPES are based on just one cutpoint when
in reality multiple cutpoints or thresholds are possible.
An ROC curve overcomes this limitation by including all the decision
thresholds possible for the results from a diagnostic test

Receiver Operating Characteristic (ROC) Curve


Is a graphical plot that illustrates the performance of a binary
classifier system as its discrimination threshold is varied. The curve is
created by plotting the true positive rate(TPR) against the false
positive rate (FPR) at various threshold settings.

An ROC curve is a plot of the SENS versus (1-SPES) of a diagnostic test,


where the different points on the curve correspond to different cutpoints used to determine if the test results are positive.

ROC curve Illustration


Below are the ratings of CT images from 109 subjects by a radiologist,
given by Table 3 (Hanley and McNeil, 1982).

Multiple cutpoints are possible for classifying a patient as normal or


abnormal based on the CT scan.

ROC curve Illustration


The designation of a cutpoint to classify the test results as positive or
negative is relatively arbitrary.
Suppose that the ratings of 4 or above indicate, for instance, that the
test is positive, then the SENS and SPES would be 0.86 and 0.78.
If the ratings of 3 or above are considered as positive, then the SENS
and SPES are 0.90 and 0.67.

ROC Curve

Area Under Curve (AUC)


The area under the ROC curve is an effective way to summarize the
overall diagnostic accuracy of the test.
It takes values from 0 to 1, where a value of 0 indicates a perfectly
inaccurate test and a value of 1 reflects a perfectly accurate test.
The closer the ROC curve of a diagnostic test is to the (0, 1)
coordinate, the better is the test.

ROC
For the example the area under the ROC curve is 0.89
This means that the radiologist reading the CT scan has an 89%
chance of correctly distinguishing a normal from an abnormal patient
based on the ordering of the CT ratings.

ROC curves are invariant to the prevalence of a disease, but


dependent on the patient characteristics and the disease spectrum.
An ROC curve does not depend on the scale of the test results, and
can be used to provide a visual comparison of two or more test
results on a common scale.
ROC curves are useful for comparing the diagnostic ability of two or
more screening tests for the same disease.

We can fit those models using logistic regression.


Calculate the AUC for each model M1: 0.82, M2: 0.79, and M3:0.77
The test of equality of three ROC curve areas: A chi-square statistic of 15.48 suggests
that at least 2 models differ significantly (p=0.0004).
3 pair-wise comparisons (0.0103, <.0001, 0.0183) indicate that the 3 models are
different from each other.
Model 1 with Gleason score, PSA, and digital rectal exam results is considered to have
the best ability to discriminate between the subjects
However, the final decision should also be based on the clinical meaningfulness of
such differences identified by statistical analysis.

In case they are similar, and if the Gleason score is both easier to measure and cost
effective relative to other two factors, then one can go with a parsimonious model of
using just the Gleason score as a predictor.

Penalized Logistic Regression


Logistic regression is a supervised method for binary
or multi-class classification.
In high-dimensional data (e.g., microarray): More
variables than the observations Classical logistic
regression does not work.
Other problems: Variables are correlated
(multicolinierity) and over fitting.
Solution: Introduce a penalty for complexity in the
model.
75

Penalized
Logistic Regression

Logistic model:

Maximize the log-likelihood:

-Penalization

(Lasso):
76

L1 Penalized Logistic Regression

Shrinks all regression coefficients (b) toward zero


and set some of them to zero.
Performs parameter estimation and variable
selection at the same time.
The choice of is crucial and chosen via k-fold
cross-validation procedure.
The procedure is implemented in an R package
called penalized.
77

Classification of Severe Malaria Anemia vs.


Uncomplicated Malaria group

AUC: 0.86

78

Next . Still classification


Decision Tree
Neural Network
Support Vector Machine
Ensemble Methods (optional)