Vous êtes sur la page 1sur 79

# Classification

What is Classification?
Assigning an object to a certain class based on its similarity to
previous examples of other objects
Can be done with reference to original data or based on a model of
that data
E.g: Me: Its round, green, delicious and crunchy You: Its an
apple!

Examples
Classifying transactions as genuine or fraud e.g credit card usage,
insurance claims, cell phone calls
Classifying prospects as good or bad customers
Classifying engine faults by their symptoms
Classifying healthy and sick people based on the symptoms
Classifying tumor and normal cell line based on the DNA mutation,
Gene expression

(Un)Certainty
As with most data mining solutions, a classification usually comes
with a degree of certainty.
It might be the probability of the object belonging to the class or it
might be some other measure of how closely the object resembles
other examples from that class

Techniques
Non-parametric, e.g. K nearest neighbour
Mathematical models, LDA, logistic regression e.g. neural networks
Rule based models, e.g. decision trees
Support vector Machine
Etc

## Classification vs. Prediction

Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data

Prediction:
models continuous-valued functions, i.e., predicts unknown or
missing values

Typical Applications

credit approval
target marketing
medical diagnosis
treatment effectiveness analysis

## ClassificationA Two-Step Process

Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or mathematical
formulae

## Model usage: for classifying future or unknown objects

Estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Test set is independent of training set, otherwise over-fitting will occur

## Classification Process (1): Model

Construction
Classification
Algorithms
Training
Data

NAME
M ike
M ary
B ill
Jim
D ave
Anne

RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no

## Classification Process (1): Model

Construction
Classification
Algorithms
Training
Data

NAME
M ike
M ary
B ill
Jim
D ave
Anne

RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no

Classifier
(Model)

IF rank = professor
OR years > 6
THEN tenured = yes

## Classification Process (2): Use the Model in

Prediction

Classifier

Testing
Data

Unseen Data

(Jeff, Professor, 4)

NAME
T om
M erlisa
G eorge
Joseph

RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes

Tenured?

## Supervised vs. Unsupervised Learning

Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations

## Unsupervised learning (clustering)

The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

## Issues (1): Data Preparation

Data cleaning
Preprocess data in order to reduce noise and handle missing values

## Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes

Data transformation
Generalize and/or normalize data

## Issues (2): Evaluating Classification Methods

Predictive accuracy
Speed and scalability

## time to construct the model

time to use the model

Robustness

Scalability

## efficiency in disk-resident databases

Interpretability:

## understanding and insight provided by the model

Goodness of rules

## decision tree size

compactness of classification rules

Classification Algorithms

## The k-Nearest Neighbor Algorithm

Performed on raw data
Count number of other examples that are close
Winner is most common
Lazy learners

## The k-Nearest Neighbor Algorithm

Compute the distance (similarity) between training records and the
new object
Identify the k nearest objects by ordering the training objects based
on the distance
Assign the label which is most frequent among k-training record
nearest to that object.

## The k-Nearest Neighbor Algorithm

All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of Euclidean distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most common value
among the k training examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples.

_
_

.
+

+
xq

_
+

.
.

## Find the best K nearest

Perform P times K fold Cross Validation (with different K, e.g., K= 1:10 )
Calculate the classification error of each CV
ni= misclassified objects, m=number objects

## Mathematical Model Approaches

Logistic Regression
In logistic regression the outcome variable is a binary variable
The purpose is to assess the effects of multiple explanatory variables,
which can be numeric and/or categorical, on the outcome variable.
Used because having a categorical outcome variable violates the
assumption of linearity in normal regression

P
1 P

## Measuring the Probability of Outcome

The probability of the outcome is measured by the odds of occurrence
of an event.
If P is the probability of an event, then (1-P) is the probability of it not
occurring.
Odds of success = P / 1-P

## The Logistic Regression

The joint effects of all explanatory variables put together on the odds is
Odds = P/1-P = e + 1X1 + 2X2 + +pXp
Taking the logarithms of both sides
Log{P/1-P} = log +1X1+2X2++pXp
Logit P = +1X1+2X2+..+pXp

Logistic Regression
Logistic regression analysis requires that the dependent variable be
dichotomous.

## Logistic regression analysis requires that the independent variables be

metric or dichotomous.

Logistic Regression
Response - Presence/Absence of characteristic
Predictor - Numeric variable observed for each case
Model - p(x) Probability of event (P)
bx

e
p ( x)
bx
1 e
b = 0 P(Presence) is the same at each level of x

## b > 0 P(Presence) increases as x increases

b < 0 P(Presence) decreases as x increases

Odds Ratio
Interpretation of Regression Coefficient (b):
In linear regression, the slope coefficient is the change in the mean
response as x increases by 1 unit
In logistic regression, we can show that:

odds( x 1)
eb
odds( x)

p ( x)
odds( x)

1 p ( x)

## Thus eb represents the change in the odds of the outcome (multiplicatively)

by increasing x by 1 unit

If b = 0, the odds and probability are the same at all x levels (eb=1)
If b > 0 , the odds and probability increase as x increases (eb>1)
If b < 0 , the odds and probability decrease as x increases (eb<1)

## Multiple Logistic Regression

Extension to more than one predictor variable (either numeric or dummy
variables).
With k predictors, the model is written:
e b1x1 b k xk
p
1 e b1x1 b k xk

Adjusted Odds ratio for raising xi by 1 unit, holding all other predictors
constant:

ORi e b i
Many models have nominal/ordinal predictors, and widely make use of
dummy variables

## Assessing the Model

log likelihood

Y lnPY 1 Y ln1 PY
i

i1

## The Log-likelihood statistic

Analogous to the residual sum of squares in multiple regression
It is an indicator of how much unexplained information there is after
the model has been fitted.
Large values indicate poorly fitting statistical models.

Hosmer-Lemeshow Statistic
Measure of lack of fit
Null hypothesis:there is no difference between observed and modelpredicted values,
If the H-L goodness-of-fit test statistic is greater than 0.05:
model's estimates fit the data at an acceptable level.
indicating model prediction that is not significantly different from observed
values.

## Assessing Changes in Models

Its possible to calculate a log-likelihood for different models and to
compare these models by looking at the difference between their loglikelihoods.
2
2LL(New) LL(Baseline)

df

knew kbaseline

If the significance of the chi-square statistic is less than .05, then the model
is a significant fit of the data

Wald

## Similar to t-statistic in Regression.

Tests the null hypothesis that b = 0.

b
SE b

Exp(b)

## Odds after a unit change in the predictor

Odds before a unit change in the predictor

## Indicates the change in odds resulting from a unit change in

the predictor.
OR > 1: Predictor , Probability of outcome occurring .
OR < 1: Predictor , Probability of outcome occurring .

Example
A trial (based on 2,000 patients) with probability of dying at 30 days as response
() and age, sex (F=1, M=0), and treatment (C=0, Tr=1) as regressors.
Estimated multiple logistic regression model
logit () = -7.65+ 0.073 age + 0.69 sex + 0.17 treatment
(P<0.0001) (P=0.007)

(P=0.45)

Interpretation:
Treatments have no significant different effect (taking into account age & sex)
The older the patient the higher the probability of dying before (or at) 30 days
(taking into account sex and treatment)
Women have a significantly higher 30 day mortality rate (taking into account
age and treatment)

Example
Further interpretation:
e0.17 = 1.19 odds ratio for rt-PA (but NS)
e0.69 = 1.99 odds ratio for sex = female
e0.073 = 1.08 odds ratio for an increase of age with 1 yr
e0.73 = 2.08 odds ratio for an increase of age with 10 yrs
For all odds ratios are controlled for the other factors

## The logistic regression model predicts the

outcome
The logistic regression model gives a probability that y=1 (30-day
mortality) will happen. The same applies when there are more than 1
regressors.
When p > 0.50, we predict the subject as y=1 (died at 30 days), when
p < 0.50 the subject is predicted as y=0 (survived at 30 days)
A classification table is a result, and we can determine the sensitivity
and specificity.
By varying the threshold (from 0 -> 0.50 > 1) different sensitivities
and specificities are obtained and hereby a ROC (Receiver Operating
Curve) appears.

## A false positive is defined as a positive test result when the disease or

condition being tested for is not actually present
A false negative is defined as a negative test result when the disease
or condition being tested for is actually present.

## Sensitivity is defined as the ability of a test to detect the disease

status or condition when it is truly present, i.e., it is the probability of
a positive test result given that the patient has the disease or
condition of interest

## Specificity is the ability of a test to exclude the condition or disease in

patients who do not have the condition or the disease i.e., it is the
probability of a negative test result given that the patient does not
have the disease or condition of interest.

## Clinical Sensitivity is how often the test is positive in diseased patients

Clinical Specificity is how often the test is negative in non-diseased
patients.

## If the comparative procedure is imperfect, then sensitivity and

specificity estimates are almost always statistically biased
(systematically too high or too low).

## Choosing a Comparative Procedure

FDAs four general recommendations regarding choosing a comparative procedure
to evaluate a new diagnostic test and reporting the results:
If a perfect standard is available, use it. Calculate estimated sensitivity and
specificity.
If a perfect standard is available but impractical, use it to the extent possible.
Calculate adjusted estimates of sensitivity and specificity.
If a perfect standard is not available, consider constructing one. Calculate
estimated sensitivity and specificity under the constructed standard.
If a perfect standard is not available and cannot be constructed, then an
appropriate approach may be reporting a measure of agreement.
(Statistical Guidance on reporting result from Studies Evaluating Diagnostic Tests, FDA)

## Calculating Estimates of Sensitivity and

Specificity

Another example
The following table is the results from a diagnostic test like x-ray or
computer tomographic (CT) scan and the true disease or condition of
the patient is known (Altman and Bland, 1994a).

## SENS = 231/258 = 0.90

SPES = 54/86 = 0.63

## Predictive Value positive (PV+)

The predictive value positive (PV+) of a screening test (or symptom)
is the probability that a person has a disease given that the screening
test is positive (or has the symptom).
Pr(disease|test+)

## Predictive Value negative (PV-)

The predictive value negative (PV) of a screening test (or symptom)
is the probability that a person does not have a disease given that the
screening test is negative (or does not have the symptom).
Pr(no disease | test)

## From the CT example

The following table is the results from a diagnostic test like x-ray or
computer tomographic (CT) scan and the true disease or condition of
the patient is known (Altman and Bland, 1994a).

## PPV (PV+) = 231/263 = 0.88

NPV (PV-) = 54/81 = 0.67

## SENS = 231/258 = 0.90

SPES = 54/86 = 0.63
PPV = 77/173 = 0.45
NPV = 162/171 = 0.95

The PPV and the NPV are dependent on the prevalence of the disease
in the patient population being studied (Altman and Bland, 1994b).

Prevalence
The probability of currently having the disease regardless of the
duration of time one has had the disease.
Obtained by dividing the number of people who currently have the
disease by the number of people in the study population.

Cumulative incidence
The probability that a person with no prior disease will develop a new
case of the disease over some specified time period.

90%

## If the prevalence 4%, Sn 50%, Sp 90%

Another Example
Suppose 84% of hypertensives and 23% of normotensives are
classified as hypertensive by an automated blood pressure
machine. What are the PV+ and PV of the machine,
assuming 20% of the adult population is hypertensive?

## Suppose 84% of hypertensives and 23% of normotensives are

classified as hypertensive by an automated blood pressure
machine. What are the PV+ and PV of the machine,
assuming 20% of the adult population is hypertensive?
Sensitivity = .84
Specificity = 1 .23 = .77. Thus,
PV+ = 0.84 x 0.2 / [0.84 x 0.2 + 0.23x0.8] = .48
PV = (.77)(.8)/[(.77)(.8)+(.16)(.2)] = .95

## Thus a negative result from the machine is reasonably predictive

because we are 95% sure a person with a negative result from the
machine is normotensive.
However, a positive result is not very predictive because we are only
48% sure a person with a positive result from the machine is
hypertensive.

## When we can use PPV and NPV?

Both SENS and SPES can be applied to other populations that have
different prevalence rates,
It is not appropriate to apply universally the PPV and the NPV
obtained from one study without information on prevalence.
The rarer the prevalence of the disease, the more sure one can be
that a negative test result indeed means that there is no disease, and
less sure that a positive test result indicates the presence of a disease.
The lower the prevalence, greater is the number of people who will
be diagnosed as FP, even if the SENS and the SPES are high

## Likelihood Ratio (LR)

Another simple measure of diagnostic accuracy, given by the ratio of the probability
of the test result among patients who truly had the disease / condition to the
probability of the same test among patients who do not have the disease/condition.
LR is the ratio of SENS / (1-SPES).
For the previous example the LR is 2.4.
The magnitude of the LR informs about the certainty of a positive diagnosis.
A general guideline:
LR=1 indicates that the test result is equally likely in patients with and without
the disease/condition,
LR > 1 indicate that the test result is more likely in patients with the disease /
condition
LR < 1 indicate that the test result is more likely in patients without the disease /
condition (Zhou et al., 2002).

## Receiver Operating Characteristic (ROC) Curve

Both SENS and SPES require a cutpoint in order to classify the test results
as positive or negative.
The SENS and SPES for a diagnostic test are therefore tied to the diagnostic
threshold or cutpoint selected for the test.
Many times the results from a diagnostic test may be on an ordinal or
numerical scale rather than just a binary outcome of positive or negative.
In this situations, the SENS and SPES are based on just one cutpoint when
in reality multiple cutpoints or thresholds are possible.
An ROC curve overcomes this limitation by including all the decision
thresholds possible for the results from a diagnostic test

## Receiver Operating Characteristic (ROC) Curve

Is a graphical plot that illustrates the performance of a binary
classifier system as its discrimination threshold is varied. The curve is
created by plotting the true positive rate(TPR) against the false
positive rate (FPR) at various threshold settings.

## An ROC curve is a plot of the SENS versus (1-SPES) of a diagnostic test,

where the different points on the curve correspond to different cutpoints used to determine if the test results are positive.

## ROC curve Illustration

Below are the ratings of CT images from 109 subjects by a radiologist,
given by Table 3 (Hanley and McNeil, 1982).

## Multiple cutpoints are possible for classifying a patient as normal or

abnormal based on the CT scan.

## ROC curve Illustration

The designation of a cutpoint to classify the test results as positive or
negative is relatively arbitrary.
Suppose that the ratings of 4 or above indicate, for instance, that the
test is positive, then the SENS and SPES would be 0.86 and 0.78.
If the ratings of 3 or above are considered as positive, then the SENS
and SPES are 0.90 and 0.67.

ROC Curve

## Area Under Curve (AUC)

The area under the ROC curve is an effective way to summarize the
overall diagnostic accuracy of the test.
It takes values from 0 to 1, where a value of 0 indicates a perfectly
inaccurate test and a value of 1 reflects a perfectly accurate test.
The closer the ROC curve of a diagnostic test is to the (0, 1)
coordinate, the better is the test.

ROC
For the example the area under the ROC curve is 0.89
This means that the radiologist reading the CT scan has an 89%
chance of correctly distinguishing a normal from an abnormal patient
based on the ordering of the CT ratings.

## ROC curves are invariant to the prevalence of a disease, but

dependent on the patient characteristics and the disease spectrum.
An ROC curve does not depend on the scale of the test results, and
can be used to provide a visual comparison of two or more test
results on a common scale.
ROC curves are useful for comparing the diagnostic ability of two or
more screening tests for the same disease.

## We can fit those models using logistic regression.

Calculate the AUC for each model M1: 0.82, M2: 0.79, and M3:0.77
The test of equality of three ROC curve areas: A chi-square statistic of 15.48 suggests
that at least 2 models differ significantly (p=0.0004).
3 pair-wise comparisons (0.0103, <.0001, 0.0183) indicate that the 3 models are
different from each other.
Model 1 with Gleason score, PSA, and digital rectal exam results is considered to have
the best ability to discriminate between the subjects
However, the final decision should also be based on the clinical meaningfulness of
such differences identified by statistical analysis.

In case they are similar, and if the Gleason score is both easier to measure and cost
effective relative to other two factors, then one can go with a parsimonious model of
using just the Gleason score as a predictor.

## Penalized Logistic Regression

Logistic regression is a supervised method for binary
or multi-class classification.
In high-dimensional data (e.g., microarray): More
variables than the observations Classical logistic
regression does not work.
Other problems: Variables are correlated
(multicolinierity) and over fitting.
Solution: Introduce a penalty for complexity in the
model.
75

Penalized
Logistic Regression

Logistic model:

-Penalization

(Lasso):
76

## Shrinks all regression coefficients (b) toward zero

and set some of them to zero.
Performs parameter estimation and variable
selection at the same time.
The choice of is crucial and chosen via k-fold
cross-validation procedure.
The procedure is implemented in an R package
called penalized.
77

## Classification of Severe Malaria Anemia vs.

Uncomplicated Malaria group

AUC: 0.86

78

## Next . Still classification

Decision Tree
Neural Network
Support Vector Machine
Ensemble Methods (optional)