35 vues

Transféré par elan

teori classification

- DEGRADUATES
- brest cancer
- Logistic Analysis
- TB 3 gene signature paper
- Ensemble Classifications of Wavelets based GLCM Texture Feature from MR Human Head Scan Brain Slices Analysis
- ejifcc-19-203
- Solution Ramit
- Rough and Tumble Play
- Logistic Regression
- Introduction to Statistical Methods
- Environmental and Individual Correlates of Various Types Of
- ADA Ijcem0007-3126 International Journal of Clinical and Experimental Medicine
- Mining Evaluation Java
- Isi (English)
- thypoid
- RF Journal Pattern
- Whiplash Statistical Study
- swdcomplete_5143
- Jurnal
- Classification

Vous êtes sur la page 1sur 79

What is Classification?

Assigning an object to a certain class based on its similarity to

previous examples of other objects

Can be done with reference to original data or based on a model of

that data

E.g: Me: Its round, green, delicious and crunchy You: Its an

apple!

Examples

Classifying transactions as genuine or fraud e.g credit card usage,

insurance claims, cell phone calls

Classifying prospects as good or bad customers

Classifying engine faults by their symptoms

Classifying healthy and sick people based on the symptoms

Classifying tumor and normal cell line based on the DNA mutation,

Gene expression

(Un)Certainty

As with most data mining solutions, a classification usually comes

with a degree of certainty.

It might be the probability of the object belonging to the class or it

might be some other measure of how closely the object resembles

other examples from that class

Techniques

Non-parametric, e.g. K nearest neighbour

Mathematical models, LDA, logistic regression e.g. neural networks

Rule based models, e.g. decision trees

Support vector Machine

Etc

Classification:

predicts categorical class labels

classifies data (constructs a model) based on the training set and the

values (class labels) in a classifying attribute and uses it in classifying

new data

Prediction:

models continuous-valued functions, i.e., predicts unknown or

missing values

Typical Applications

credit approval

target marketing

medical diagnosis

treatment effectiveness analysis

Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined by

the class label attribute

The set of tuples used for model construction: training set

The model is represented as classification rules, decision trees, or mathematical

formulae

Estimate accuracy of the model

The known label of test sample is compared with the classified result from the

model

Accuracy rate is the percentage of test set samples that are correctly classified

by the model

Test set is independent of training set, otherwise over-fitting will occur

Construction

Classification

Algorithms

Training

Data

NAME

M ike

M ary

B ill

Jim

D ave

Anne

RANK

YEARS TENURED

A ssistan t P ro f

3

no

A ssistan t P ro f

7

yes

P ro fesso r

2

yes

A sso ciate P ro f

7

yes

A ssistan t P ro f

6

no

A sso ciate P ro f

3

no

Construction

Classification

Algorithms

Training

Data

NAME

M ike

M ary

B ill

Jim

D ave

Anne

RANK

YEARS TENURED

A ssistan t P ro f

3

no

A ssistan t P ro f

7

yes

P ro fesso r

2

yes

A sso ciate P ro f

7

yes

A ssistan t P ro f

6

no

A sso ciate P ro f

3

no

Classifier

(Model)

IF rank = professor

OR years > 6

THEN tenured = yes

Prediction

Classifier

Testing

Data

Unseen Data

(Jeff, Professor, 4)

NAME

T om

M erlisa

G eorge

Joseph

RANK

YEARS TENURED

A ssistant P rof

2

no

A ssociate P rof

7

no

P rofessor

5

yes

A ssistant P rof

7

yes

Tenured?

Supervised learning (classification)

Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters in the data

Data cleaning

Preprocess data in order to reduce noise and handle missing values

Remove the irrelevant or redundant attributes

Data transformation

Generalize and/or normalize data

Predictive accuracy

Speed and scalability

time to use the model

Robustness

Scalability

Interpretability:

Goodness of rules

compactness of classification rules

Classification Algorithms

Performed on raw data

Count number of other examples that are close

Winner is most common

Lazy learners

Compute the distance (similarity) between training records and the

new object

Identify the k nearest objects by ordering the training objects based

on the distance

Assign the label which is most frequent among k-training record

nearest to that object.

All instances correspond to points in the n-D space.

The nearest neighbor are defined in terms of Euclidean distance.

The target function could be discrete- or real- valued.

For discrete-valued, the k-NN returns the most common value

among the k training examples nearest to xq.

Vonoroi diagram: the decision surface induced by 1-NN for a

typical set of training examples.

_

_

.

+

+

xq

_

+

.

.

Perform P times K fold Cross Validation (with different K, e.g., K= 1:10 )

Calculate the classification error of each CV

ni= misclassified objects, m=number objects

Logistic Regression

In logistic regression the outcome variable is a binary variable

The purpose is to assess the effects of multiple explanatory variables,

which can be numeric and/or categorical, on the outcome variable.

Used because having a categorical outcome variable violates the

assumption of linearity in normal regression

P

1 P

The probability of the outcome is measured by the odds of occurrence

of an event.

If P is the probability of an event, then (1-P) is the probability of it not

occurring.

Odds of success = P / 1-P

The joint effects of all explanatory variables put together on the odds is

Odds = P/1-P = e + 1X1 + 2X2 + +pXp

Taking the logarithms of both sides

Log{P/1-P} = log +1X1+2X2++pXp

Logit P = +1X1+2X2+..+pXp

Logistic Regression

Logistic regression analysis requires that the dependent variable be

dichotomous.

metric or dichotomous.

Logistic Regression

Response - Presence/Absence of characteristic

Predictor - Numeric variable observed for each case

Model - p(x) Probability of event (P)

bx

e

p ( x)

bx

1 e

b = 0 P(Presence) is the same at each level of x

b < 0 P(Presence) decreases as x increases

Odds Ratio

Interpretation of Regression Coefficient (b):

In linear regression, the slope coefficient is the change in the mean

response as x increases by 1 unit

In logistic regression, we can show that:

odds( x 1)

eb

odds( x)

p ( x)

odds( x)

1 p ( x)

by increasing x by 1 unit

If b = 0, the odds and probability are the same at all x levels (eb=1)

If b > 0 , the odds and probability increase as x increases (eb>1)

If b < 0 , the odds and probability decrease as x increases (eb<1)

Extension to more than one predictor variable (either numeric or dummy

variables).

With k predictors, the model is written:

e b1x1 b k xk

p

1 e b1x1 b k xk

Adjusted Odds ratio for raising xi by 1 unit, holding all other predictors

constant:

ORi e b i

Many models have nominal/ordinal predictors, and widely make use of

dummy variables

log likelihood

Y lnPY 1 Y ln1 PY

i

i1

Analogous to the residual sum of squares in multiple regression

It is an indicator of how much unexplained information there is after

the model has been fitted.

Large values indicate poorly fitting statistical models.

Hosmer-Lemeshow Statistic

Measure of lack of fit

Null hypothesis:there is no difference between observed and modelpredicted values,

If the H-L goodness-of-fit test statistic is greater than 0.05:

model's estimates fit the data at an acceptable level.

indicating model prediction that is not significantly different from observed

values.

Its possible to calculate a log-likelihood for different models and to

compare these models by looking at the difference between their loglikelihoods.

2

2LL(New) LL(Baseline)

df

knew kbaseline

If the significance of the chi-square statistic is less than .05, then the model

is a significant fit of the data

Wald

Tests the null hypothesis that b = 0.

b

SE b

Exp(b)

Odds before a unit change in the predictor

the predictor.

OR > 1: Predictor , Probability of outcome occurring .

OR < 1: Predictor , Probability of outcome occurring .

Example

A trial (based on 2,000 patients) with probability of dying at 30 days as response

() and age, sex (F=1, M=0), and treatment (C=0, Tr=1) as regressors.

Estimated multiple logistic regression model

logit () = -7.65+ 0.073 age + 0.69 sex + 0.17 treatment

(P<0.0001) (P=0.007)

(P=0.45)

Interpretation:

Treatments have no significant different effect (taking into account age & sex)

The older the patient the higher the probability of dying before (or at) 30 days

(taking into account sex and treatment)

Women have a significantly higher 30 day mortality rate (taking into account

age and treatment)

Example

Further interpretation:

e0.17 = 1.19 odds ratio for rt-PA (but NS)

e0.69 = 1.99 odds ratio for sex = female

e0.073 = 1.08 odds ratio for an increase of age with 1 yr

e0.73 = 2.08 odds ratio for an increase of age with 10 yrs

For all odds ratios are controlled for the other factors

outcome

The logistic regression model gives a probability that y=1 (30-day

mortality) will happen. The same applies when there are more than 1

regressors.

When p > 0.50, we predict the subject as y=1 (died at 30 days), when

p < 0.50 the subject is predicted as y=0 (survived at 30 days)

A classification table is a result, and we can determine the sensitivity

and specificity.

By varying the threshold (from 0 -> 0.50 > 1) different sensitivities

and specificities are obtained and hereby a ROC (Receiver Operating

Curve) appears.

condition being tested for is not actually present

A false negative is defined as a negative test result when the disease

or condition being tested for is actually present.

status or condition when it is truly present, i.e., it is the probability of

a positive test result given that the patient has the disease or

condition of interest

patients who do not have the condition or the disease i.e., it is the

probability of a negative test result given that the patient does not

have the disease or condition of interest.

Clinical Specificity is how often the test is negative in non-diseased

patients.

specificity estimates are almost always statistically biased

(systematically too high or too low).

FDAs four general recommendations regarding choosing a comparative procedure

to evaluate a new diagnostic test and reporting the results:

If a perfect standard is available, use it. Calculate estimated sensitivity and

specificity.

If a perfect standard is available but impractical, use it to the extent possible.

Calculate adjusted estimates of sensitivity and specificity.

If a perfect standard is not available, consider constructing one. Calculate

estimated sensitivity and specificity under the constructed standard.

If a perfect standard is not available and cannot be constructed, then an

appropriate approach may be reporting a measure of agreement.

(Statistical Guidance on reporting result from Studies Evaluating Diagnostic Tests, FDA)

Specificity

Another example

The following table is the results from a diagnostic test like x-ray or

computer tomographic (CT) scan and the true disease or condition of

the patient is known (Altman and Bland, 1994a).

SPES = 54/86 = 0.63

The predictive value positive (PV+) of a screening test (or symptom)

is the probability that a person has a disease given that the screening

test is positive (or has the symptom).

Pr(disease|test+)

The predictive value negative (PV) of a screening test (or symptom)

is the probability that a person does not have a disease given that the

screening test is negative (or does not have the symptom).

Pr(no disease | test)

The following table is the results from a diagnostic test like x-ray or

computer tomographic (CT) scan and the true disease or condition of

the patient is known (Altman and Bland, 1994a).

NPV (PV-) = 54/81 = 0.67

SPES = 54/86 = 0.63

PPV = 77/173 = 0.45

NPV = 162/171 = 0.95

The PPV and the NPV are dependent on the prevalence of the disease

in the patient population being studied (Altman and Bland, 1994b).

Prevalence

The probability of currently having the disease regardless of the

duration of time one has had the disease.

Obtained by dividing the number of people who currently have the

disease by the number of people in the study population.

Cumulative incidence

The probability that a person with no prior disease will develop a new

case of the disease over some specified time period.

90%

Another Example

Suppose 84% of hypertensives and 23% of normotensives are

classified as hypertensive by an automated blood pressure

machine. What are the PV+ and PV of the machine,

assuming 20% of the adult population is hypertensive?

classified as hypertensive by an automated blood pressure

machine. What are the PV+ and PV of the machine,

assuming 20% of the adult population is hypertensive?

Sensitivity = .84

Specificity = 1 .23 = .77. Thus,

PV+ = 0.84 x 0.2 / [0.84 x 0.2 + 0.23x0.8] = .48

PV = (.77)(.8)/[(.77)(.8)+(.16)(.2)] = .95

because we are 95% sure a person with a negative result from the

machine is normotensive.

However, a positive result is not very predictive because we are only

48% sure a person with a positive result from the machine is

hypertensive.

Both SENS and SPES can be applied to other populations that have

different prevalence rates,

It is not appropriate to apply universally the PPV and the NPV

obtained from one study without information on prevalence.

The rarer the prevalence of the disease, the more sure one can be

that a negative test result indeed means that there is no disease, and

less sure that a positive test result indicates the presence of a disease.

The lower the prevalence, greater is the number of people who will

be diagnosed as FP, even if the SENS and the SPES are high

Another simple measure of diagnostic accuracy, given by the ratio of the probability

of the test result among patients who truly had the disease / condition to the

probability of the same test among patients who do not have the disease/condition.

LR is the ratio of SENS / (1-SPES).

For the previous example the LR is 2.4.

The magnitude of the LR informs about the certainty of a positive diagnosis.

A general guideline:

LR=1 indicates that the test result is equally likely in patients with and without

the disease/condition,

LR > 1 indicate that the test result is more likely in patients with the disease /

condition

LR < 1 indicate that the test result is more likely in patients without the disease /

condition (Zhou et al., 2002).

Both SENS and SPES require a cutpoint in order to classify the test results

as positive or negative.

The SENS and SPES for a diagnostic test are therefore tied to the diagnostic

threshold or cutpoint selected for the test.

Many times the results from a diagnostic test may be on an ordinal or

numerical scale rather than just a binary outcome of positive or negative.

In this situations, the SENS and SPES are based on just one cutpoint when

in reality multiple cutpoints or thresholds are possible.

An ROC curve overcomes this limitation by including all the decision

thresholds possible for the results from a diagnostic test

Is a graphical plot that illustrates the performance of a binary

classifier system as its discrimination threshold is varied. The curve is

created by plotting the true positive rate(TPR) against the false

positive rate (FPR) at various threshold settings.

where the different points on the curve correspond to different cutpoints used to determine if the test results are positive.

Below are the ratings of CT images from 109 subjects by a radiologist,

given by Table 3 (Hanley and McNeil, 1982).

abnormal based on the CT scan.

The designation of a cutpoint to classify the test results as positive or

negative is relatively arbitrary.

Suppose that the ratings of 4 or above indicate, for instance, that the

test is positive, then the SENS and SPES would be 0.86 and 0.78.

If the ratings of 3 or above are considered as positive, then the SENS

and SPES are 0.90 and 0.67.

ROC Curve

The area under the ROC curve is an effective way to summarize the

overall diagnostic accuracy of the test.

It takes values from 0 to 1, where a value of 0 indicates a perfectly

inaccurate test and a value of 1 reflects a perfectly accurate test.

The closer the ROC curve of a diagnostic test is to the (0, 1)

coordinate, the better is the test.

ROC

For the example the area under the ROC curve is 0.89

This means that the radiologist reading the CT scan has an 89%

chance of correctly distinguishing a normal from an abnormal patient

based on the ordering of the CT ratings.

dependent on the patient characteristics and the disease spectrum.

An ROC curve does not depend on the scale of the test results, and

can be used to provide a visual comparison of two or more test

results on a common scale.

ROC curves are useful for comparing the diagnostic ability of two or

more screening tests for the same disease.

Calculate the AUC for each model M1: 0.82, M2: 0.79, and M3:0.77

The test of equality of three ROC curve areas: A chi-square statistic of 15.48 suggests

that at least 2 models differ significantly (p=0.0004).

3 pair-wise comparisons (0.0103, <.0001, 0.0183) indicate that the 3 models are

different from each other.

Model 1 with Gleason score, PSA, and digital rectal exam results is considered to have

the best ability to discriminate between the subjects

However, the final decision should also be based on the clinical meaningfulness of

such differences identified by statistical analysis.

In case they are similar, and if the Gleason score is both easier to measure and cost

effective relative to other two factors, then one can go with a parsimonious model of

using just the Gleason score as a predictor.

Logistic regression is a supervised method for binary

or multi-class classification.

In high-dimensional data (e.g., microarray): More

variables than the observations Classical logistic

regression does not work.

Other problems: Variables are correlated

(multicolinierity) and over fitting.

Solution: Introduce a penalty for complexity in the

model.

75

Penalized

Logistic Regression

Logistic model:

-Penalization

(Lasso):

76

and set some of them to zero.

Performs parameter estimation and variable

selection at the same time.

The choice of is crucial and chosen via k-fold

cross-validation procedure.

The procedure is implemented in an R package

called penalized.

77

Uncomplicated Malaria group

AUC: 0.86

78

Decision Tree

Neural Network

Support Vector Machine

Ensemble Methods (optional)

- DEGRADUATESTransféré parlmisara
- brest cancerTransféré parAna Nedeljkovic
- Logistic AnalysisTransféré parShahid Javaid
- TB 3 gene signature paperTransféré parAditya Rao
- Ensemble Classifications of Wavelets based GLCM Texture Feature from MR Human Head Scan Brain Slices AnalysisTransféré parEditor IJRITCC
- ejifcc-19-203Transféré pargangjothelancer
- Solution RamitTransféré parramit77
- Rough and Tumble PlayTransféré parryn8011
- Logistic RegressionTransféré parKongkiti Liwcharoenchai
- Introduction to Statistical MethodsTransféré parulastuna2001
- Environmental and Individual Correlates of Various Types OfTransféré parHuxley77
- ADA Ijcem0007-3126 International Journal of Clinical and Experimental MedicineTransféré parPaty Palomino
- Mining Evaluation JavaTransféré parSteven Böhlert
- Isi (English)Transféré parFally Usman Arif
- thypoidTransféré parFatimah Fitriani
- RF Journal PatternTransféré parJB
- Whiplash Statistical StudyTransféré parboga358
- swdcomplete_5143Transféré parJuani Amor Espín
- JurnalTransféré parAlifiya Changmin Cassiopeia
- ClassificationTransféré parahmetdursun03
- instructorsmanualforprinciplesofeconometricsfourtheditionwilliame-150910183908-lva1-app6892.pdfTransféré parAsrafuzzaman Robin
- 06examTransféré parPETER
- epi & bio xTransféré parwalt65
- 2003 Pitch perception.pdfTransféré parDario Jovick Jovick
- Dead Under the Bed a Case of Diogenes SyndromeTransféré parSara Tobón Grajales
- restatgofTransféré parGrace Widyarani
- Overestimation of Impairment Related Asthma Control by AdolescentsTransféré parromeoenny4154
- h2o Training DayTransféré parparashar1505
- art-3A10.1023-2FA-3A1011110207885Transféré pardongaquoctrung
- 2015 enterosocopy and small bowelTransféré parapi-282091976

- Blaze Erickson ResumeTransféré parBlaze Erickson
- Wall Chart RoutingTransféré parscribdeer
- w-250-isspecificationsheetsTransféré parapi-261294473
- Rounding ValueTransféré parzafer nadeem
- EA-10-16rev00Transféré parGrave-da
- Attitudes and Moral Values of Tenants and Customers Towards Security Personnel in StaTransféré parLya Niro
- kevin grindey reference letterTransféré parapi-252180809
- Brochures Fluke 754Transféré parlehuylap
- Tesis EgiptoTransféré parJohnJosePartidas
- Eaglet 2010-2011 1st IssueTransféré parMara Melanie D. Perez
- Interop Guide SwitchTransféré parPablo Aros Alcaino
- Form 5 Chapter 10 Linear ProgramingTransféré parJenn
- Ip Office InstallationTransféré parToàn Vũ Đình
- Piracy at Sea.pdfTransféré parAnonymous sSR6x6VC8a
- An Efficient Implementation of Chronic Inflation based Power Iterative Clustering AlgorithmTransféré parIRJET Journal
- science lesson 2 - plant life cycleTransféré parapi-350666556
- Muscle Building Workouts- 23 Ways to Get Big and StrongTransféré parJM Gym Manticao
- 1 Tutorial Rolling Contact BearingTransféré parNirav
- Marley-AV-TS-10-LTransféré parEvi Mata
- ChessZone Magazine, 1 (2007)Transféré parb36wrz
- Alcoholism and HepatitisTransféré parTom Mallinson
- L580.pdfTransféré parCao Hao Nguyen
- XVME-660 Double-Slot VMEbus Intel® Celeron™/Pentium® III Processor Module User ManualTransféré parKevin Budzynski
- WacTransféré parkriteesinha
- Preoperative PeriodTransféré parakashkumarpanwar
- Building evacuation_ rules and reality.pdfTransféré parPradeep Nair
- trade life cycleTransféré parNaren Goku
- AT-03423_Aspen Plus FAQ_2017_1002-FINAL (1)Transféré parharry_chem
- AAATransféré parVP
- 14 Valuation Report Dieter Energy_New 11-04-2016Transféré parsidosh