Vous êtes sur la page 1sur 143

Data Mining: Overview

What is Data Mining?

Recently* coined term for confluence of ideas from


statistics and computer science (machine learning
and database methods) applied to large databases
in science, engineering and business.
In a state of flux, many definitions, lot of debate
about what it is and what it is not. Terminology not
standard e.g. bias, classification, prediction, feature
= independent variable, target = dependent
variable, case = exemplar = row.
* First International workshop on Knowledge Discovery
and Data Mining was in 1995

Broad and Narrow Definitions


Broad Definition includes traditional
statistical methods, Narrow Definition
emphasizes automated and heuristic
methods
Data mining, data dredging, fishing
expeditions
Knowledge Discovery in Databases (KDD)

My Favorite
Statistics at scale and speed
Darryl Pregibon
My extension:
. . . And simplicity

Gartner Group
Data mining is the process of discovering
meaningful new correlations, patterns and
trends by sifting through large amounts of
data stored in repositories, using pattern
recognition technologies as well as
statistical and mathematical techniques.

Drivers
Market: From focus on product/service to focus on
customer
IT: From focus on up-to-date balances to focus on
patterns in transactions - Data Warehouses OLAP
Dramatic drop in storage costs : Huge databases
e.g Walmart: 20 million transactions/day, 10 terabyte
database, Blockbuster: 36 million households

Automatic Data Capture of Transactions


e.g. Bar Codes , POS devices, Mouse clicks, Location
data (GPS, cell phones)

Internet: Personalized interactions, longitudinal


data

Core Disciplines
Statistics (adapted for 21st century data sizes and
speed requirements). Examples:
Descriptive: Visualization
Models (DMD): Regression, Cluster Analysis

Machine Learning: e.g. Neural Nets


Data Base Retrieval: e.g. Association Rules
Parallel developments: e.g. Tree methods, k
Nearest Neighbors, OLAP-EDA

Process
1. Develop understanding of application, goals
2. Create dataset for study (often from Data
Warehouse)
3. Data Cleaning and Preprocessing
4. Data Reduction and projection
5. Choose Data Mining task
6. Choose Data Mining algorithms
7. Use algorithms to perform task
8. Interpret and iterate thru 1-7 if necessary
9. Deploy: integrate into operational systems.

Data
Mining

SEMMA Methodology (SAS)


Sample from data sets, Partition into
Training, Validation and Test datasets
Explore data set statistically and graphically
Modify:Transform variables, Impute
missing values
Model: fit models e.g. regression,
classfication tree, neural net
Assess: Compare models using Partition,
Test datasets

Illustrative Applications
Customer Relationship Management
Finance
E-commerce and Internet

Customer Relationship
Management

Target Marketing
Attrition Prediction/Churn Analysis
Fraud Detection
Credit Scoring

Target marketing
Business problem: Use list of prospects for
direct mailing campaign
Solution: Use Data Mining to identify most
promising respondents combining
demographic and geographic data with data
on past purchase behavior
Benefit: Better response rate, savings in
campaign cost

Example: Fleet Financial Group


Redesign of customer service infrastructure,
including $38 million investment in data
warehouse and marketing automation
Used logistic regression to predict response
probabilities to home-equity product for sample of
20,000 customer profiles from 15 million
customer base
Used CART to predict profitable customers and
customers who would be unprofitable even if they
respond

Churn Analysis: Telcos


Business Problem: Prevent loss of customers,
avoid adding churn-prone customers
Solution: Use neural nets, time series analysis to
identify typical patterns of telephone usage of
likely-to-defect and likely-to-churn customers
Benefit: Retention of customers, more effective
promotions

Example: France Telecom


CHURN/Customer Profiling System implemented
as part of major custom data warehouse solution
Preventive CPS based on customer characteristics
and known cases of churning and non-churning
customers identify significant characteristics for
churn
Early detection CPS based on usage pattern
matching with known cases of churn customers.

Fraud Detection
Business problem: Fraud increases costs or
reduces revenue
Solution: Use logistic regression, neural
nets to identify characteristics of fraudulent
cases to prevent in future or prosecute more
vigorously
Benefit: Increased profits by reducing
undesirable customers

Example: Automobile Insurance


Bureau of Massachusetts
Past reports on claims adjustors scrutinized by
experts to identify cases of fraud
Several characteristics (over 60) of claimant, type
of accident, type of injury/treatment coded into
database
Dimension Reduction methods used to obtain
weighted variables. Multiple Regression Step-wise
Subset selection methods used to identify
characteristics strong correlated with fraud

Risk Analysis
Business problem: Reduce risk of loans to
delinquent customers
Solution: Use credit scoring models using
discriminant analysis to create score
functions that separate out risky customers
Benefit: Decrease in cost of bad debts

Finance
Business problem: Pricing of corporate
bonds depends on several factors, risk
profile of company , seniority of debt,
dividends, prior history, etc.
Solution Approach: Through DM, develop
more accurate models of predicting prices.

10

E-commerce and Internet


Collaborative Filtering
From Clicks to Customers

Recommendation systems
Business opportunity: Users rate items
(Amazon.com, CDNOW.com, MovieFinder.com)
on the web. How to use information from other
users to infer ratings for a particular user?
Solution: Use of a technique known as
collaborative filtering
Benefit: Increase revenues by cross selling, up
selling

11

Clicks to Customers
Business problem: 50% of Dells clients order
their computer through the web. However, the
retention rate is 0.5%, i.e. of visitors of Dells web
page become customers.
Solution Approach: Through the sequence of their
clicks, cluster customers and design website,
interventions to maximize the number of
customers who eventually buy.
Benefit: Increase revenues

Emerging Major Data Mining


applications

Spam
Bioinformatics/Genomics
Medical History Data Insurance Claims
Personalization of services in e-commerce
RF Tags : Gillette
Security :
Container Shipments
Network Intrusion Detection

12

Core Concepts
Types of Data:
Numeric
Continuous ratio and interval
Discrete
Need for Binning

Categorical order and unordered


Binary

Overfitting and Generalization


Regularization: Penalty for model complexity
Distance
Curse of Dimensionality
Random and stratified sampling, resampling
Loss Functions

13

Typical characteristics of mining


data
Standard format is spreadsheet:
Row=observation unit, Column=variable

Many rows, many columns


Many rows moderate number of columns (e.g. tel.
calls)
Many columns, moderate number of rows (e.g.
genomics)
Opportunistic (often by-product of transactions)
Not from designed experiments
Often has outliers, missing data

14

Course Topics
Supervised Techniques
Classification:
k-Nearest Neighbors, Nave Bayes, Classification Trees
Discriminant Analysis, Logistic Regression, Neural Nets

Prediction (Estimation):
Regression, Regression Trees, k-Nearest Neighbors

Unsupervised Techniques
Cluster Analysis, Principal Components
Association Rules, Collaborative Filtering

15

15.062 Data Mining Spring 2003

Comparison of Data Mining techniques large data sets Guidelines ( and only guidelines)
H: high, M:medium, L:low.
Neural
Nets

Trees

k-Nearest
Neighbors

Accuracy

Logistic
Discriminant Nave
Multiple
Regression Analysis
Bayes
Linear
Regression
M
M
M
HM

HM

Intepretability

SpeedTraining

HM

HM

SpeedDeployment

HM

Effort in
choice and
transformation
of indep.Vars.
Effort to tune
performance
parameters
Robustness to
Outliers in
indep vars
Robustness to
irrelevant
variables
Ease of
handling of
missing
values
Natural
handling both
categorical
and
continuous
variables

HM

HM

HM

HM

ML

ML

ML

ML

ML

ML

ML

ML

HM

HM

HM

ML

ML

ML

Lecture 2

Judging the Performance of

Classifiers

In this note we will examine the question of how to judge the usefulness of a classifier and how to
compare different classifiers. Not only do we have a wide choice of different types of classifiers
to choose from but within each type of classifier we have many options such as how many nearest
neighbors to use in a k-nearest neighbors classifier, the minimum number of cases we should
require in a leaf node in a tree classifier, which subsets of predictors to use in a logistic regression
model, and how many hidden layer neurons to use in a neural net.
A Two-class Classifier
Let us first look at a single classifier for two classes with options set at certain values. The twoclass situation is certainly the most common and occurs very frequently in practice. We will
extend our analysis to more than two classes later.
A natural criterion for judging the performance of a classifier is the probability that it makes a
misclassification. A classifier that makes no errors would be perfect but we do not expect to be
able to construct such classifiers in the real world due to noise and to not having all the
information needed to precisely classify cases. Is there a minimum probability of
misclassification we should require of a classifier?
Suppose that the two classes are denoted by C0 and C1. Let p(C0) and p(C1) be the apriori
probabilities that a case belongs to C0 and C1 respectively. The apriori probability is the
probability that a case belongs to a class without any more knowledge about it than that it belongs
to a population where the proportion of C0s is p(C0) and the proportion of C1s is p(C1) . In this
situation we will minimize the chance of a misclassification error by assigning class C1 to the
case if p(C1 ) > p(C0 ) and to C0 otherwise. The probability of making a misclassification would
be the minimum of p(C0) and p(C1). If we are using misclassification rate as our criterion any
classifier that uses predictor variables must have an error rate better than this.
What is the best performance we can expect from a classifier? Clearly the more training data
available to a classifier the more accurate it will be. Suppose we had a huge amount of training
data, would we then be able to build a classifier that makes no errors? The answer is no. The
accuracy of a classifier depends critically on how separated the classes are with respect to the
predictor variables that the classifier uses. We can use the well-known Bayes formula from
probability theory to derive the best performance we can expect from a classifier for a given set
of predictor variables if we had a very large amount of training data. Bayes' formula uses the
distributions of the decision variables in the two classes to give us a classifier that will have the
minimum error amongst all classifiers that use the same predictor variables. This classifier
follows the Minimum Error Bayes Rule.
Bayes Rule for Minimum Error
Let us take a simple situation where we have just one continuous predictor variable for

classification, say X. X is a random variable, since it's value depends on the individual case we

sample from the population consisting of all possible cases of the class to which the case belongs.

Suppose that we have a very large training data set. Then the relative frequency histogram of the

variable X in each class would be almost identical to the probability density function (p.d.f.) of X

for that class. Let us assume that we have a huge amount of training data and so we know the

p.d.f.s accurately. These p.d.f.s are denoted f0(x) and f1(x) for classes C0 and C1 in Fig. 1 below.

Figure 1

f1(x)
f0(x)

a
x

Now suppose we wish to classify an object for which the value of X is x0. Let us use Bayes

formula to predict the probability that the object belongs to class 1 conditional on the fact that it

has an X value of x0. Appling Bayes formula, the probability, denoted by p(C1|X= x0), is given

by:

p(C1 | X = x0 ) =

p( X = x0 |C1 ) p(C1 )
p( X = x0 |C0 ) p(C0 ) + p( X = x0 |C1 ) p(C1 )

Writing this in terms of the density functions, we get

p(C1 | X = x0 ) =

f 1 ( x0 ) p(C1 )
f 0 (x0 ) p(C0 ) + f 1 (x0 ) p(C1 )

Notice that to calculate p(C1|X= x0) we need to know the apriori probabilities p(C0) and p(C1).

Since there are only two possible classes, if we know p(C1) we can always compute p(C0) because

p(C0) = 1 - p(C1). The apriori probability p(C1) is the probability that an object belongs to C1

without any knowledge of the value of X associated with it. Bayes formula enables us to update

this apriori probability to the aposteriori probability, the probability of the object belonging to C1

after knowing that its X value is x0.

When p(C1) = p(C0) = 0.5, the formula shows

that p(C1 | X = x0 ) > p(C0 | X = x0 ) if f 1 (x0 ) > f 0 (x0 ) . This means that if x0 is greater than a,

and we classify the object as belonging to C1 we will make a smaller misclassification error than

if we were to classify it as belonging to C0 .Similarly if x0 is less than a, and we classify the

object as belonging to C0 we will make a smaller misclassification error than if we were to

classify it as belonging to C1. If x0 is exactly equal to a we have a 50% chance of making an

error for either classification.

Figure 2

f1(x)
2f0(x)

ab
x

What if the prior class probabilities were not the same? Suppose C0 is twice as likely apriori as

C1. Then the formula says that p(C1 | X = x0 ) > p(C0 | X = x0 ) if f 1 (x0 ) > 2 f 0 ( x0 ) . The new

boundary value, b for classification will be to the right of a as shown in Fig.2. This is intuitively

what we would expect. If a class is more likely we would expect the cut-off to move in a direction

that would increase the range over which it is preferred.

In general we will minimize the misclassification error rate if we classify a case as belonging to

C1 if p(C1 ) f 1 (x0 ) > p(C0 ) f 0 (x0 ) , and to C0 , otherwise. This rule holds even when X is a

vector consisting of several components, each of which is a random variable. In the remainder of

this note we shall assume that X is a vector.

An important advantage of Bayes Rule is that, as a by-product of classifying a case, we can

compute the conditional probability that the case belongs to each class. This has two advantages.

First, we can use this probability as a score for each case that we are classifying. The score

enables us to rank cases that we have predicted as belonging to a class in order of confidence that

we have made a correct classification. This capability is important in developing a lift curve

(explained later) that is important for many practical data mining applications.

Second, it enables us to compute the expected profit or loss for a given case. This gives us a

better decision criterion than misclassification error when the loss due to error is different for the

two classes.

Practical assessment of a classifier using misclassification error as the criterion

In practice, we can estimate p(C1 ) and p(C0 ) from the data we are using to build the classifier

by simply computing the proportion of cases that belong to each class. Of course, these are

estimates and they can be incorrect, but if we have a large enough data set and neither class is

very rare our estimates will be reliable. Sometimes, we may be able to use public data such as

census data to estimate these proportions. However, in most practical business settings we will

not know f 1 (x) and f 0 ( x) . If we want to apply Bayes Rule we will need to estimate these

density functions in some way. Many classification methods can be interpreted as being methods

for estimating such density functions1. In practice X will almost always be a vector. This makes

the task difficult and subject to the curse of dimensionality we referred to when discussing the k-

Nearest Neighbors technique.

To obtain an honest estimate of classification error, let us suppose that we have partitioned a data

set into training and validation data sets by random selection of cases. Let us assume that we have

constructed a classifier using the training data. When we apply it to the validation data, we will

classify each case into C0 or C1. The resulting misclassification errors can be displayed in what is

known as a confusion table, with rows and columns corresponding to the true and predicted

classes respectively. We can summarize our results in a confusion table for training data in a

similar fashion. The resulting confusion table will not give us an honest estimate of the

misclassification rate due to over-fitting. However such a table will be useful to signal over-

fitting when it has substantially lower misclassification rates than the confusion table for

validation data.

Predicted Class

Confusion Table
(Validation Cases)
True Class

C0

C1

C0

True Negatives (Number of correctly


classified cases that belong to C0)

False Positives (Number of cases


incorrectly classified as C1 that belong
to C0)

C1

False Negatives (Number of cases


incorrectly classified as C0 that belong
to C1)

True Positives (Number of correctly


classified cases that belong to C1)

If we denote the number in the cell at row i and column j by Nij, the estimated misclassification
rate Err = ( N 01 + N 10 ) / N val where N val ( N 00 + N 01 + N10 + N 11 ) , or the total number of
cases in the validation data set. If Nval is reasonably large, our estimate of the misclassification

There are classifiers that focus on simply finding the boundary between the regions to predict each class

without being concerned with estimating the density of cases within each region. For example, Support

Vector Machine classifiers have this characteristic.

rate is probably quite accurate. We can compute a confidence interval for Err using the standard
formula for estimating a population proportion from a random sample.
The table below gives an idea of how the accuracy of the estimate varies with Nval . The column
headings are values of the misclassification rate and the rows give the desired accuracy in
estimating the misclassification rate as measured by the half-width of the confidence interval at
the 99% confidence level. For example, if we think that the true misclassification rate is likely to
be around 0.05 and we want to be 99% confident that Err is within 0.01 of the true
misclassification rate, we need to have a validation data set with 3,152 cases.

0.01
0.05
504
250
0.025
3,152
657
0.010
0.005 2,628 12,608

0.10
956
5,972
23,889

0.15
1,354
8,461
33,842

0.20
1,699
10,617
42,469

0.30
2,230
13,935
55,741

0.40
2,548
15,926
63,703

0.50
2,654
16,589
66,358

Note that we are assuming that the cost (or benefit) of making correct classifications is zero. At
first glance, this may seem incomplete. After all, the benefit (negative cost) of correctly
classifying a buyer as a buyer would seem substantial. And, in other circumstances (e.g.
calculating the expected profit from having a new mailing list ), it will be appropriate to consider
the actual net dollar impact of classifying each case on the list. Here, however, we are attempting
to assess the value of a classifier in terms of misclassifications, so it greatly simplifies matters if
we can capture all cost/benefit information in the misclassification cells. So, instead of recording
the benefit of correctly classifying a buyer, we record the cost of failing to classify him as a
buyer. It amounts to the same thing. In fact the costs we are using are the opportunity costs.

Asymmetric misclassification costs and Bayes Risk


Up to this point we have been using the misclassification error rate as the criterion for judging the
efficacy of a classifier. However, there are circumstances when this measure is not appropriate.
Sometimes the error of misclassifying a case belonging to one class is more serious than for the
other class. For example, misclassifying a household as unlikely to respond to a sales offer when
it belongs to the class that would respond incurs a greater opportunity cost than the converse
error. In such a scenario using misclassification error as a criterion can be misleading. Consider
the situation where the sales offer is accepted by 1% of the households on a list. If a classifier
simply classifies every household as a non-responder it will have an error rate of only 1% but will
be useless in practice. A classifier that misclassifies 30% of buying households as non-buyers and
2% of the non-buyers as buyers would have a higher error rate but would be better if the profit
from a sale is substantially higher than the cost of sending out an offer. In these situations, if we
have estimates of the cost of both types of misclassification, we can use the confusion table to
compute the expected cost of misclassification for each case in the validation data. This enables
us to compare different classifiers using opportunity cost as the criterion. This may suffice for
some situations, but a better method would be to change the classification rules (and hence the
misclassification rates) to reflect the asymmetric costs. In fact, there is a Bayes classifier for this
situation which gives rules that are optimal for minimizing the expected opportunity loss from
misclassification. This classifier is known as the Bayes Risk Classifier and the corresponding
minimum expected opportunity cost of misclassification is known as the Bayes Risk. The Bayes
Risk Classifier employs the following classification rule:
Classify a case as belonging to C1 if p(C1 ) f 1 (x0 ) C(0|1) > p(C0 ) f 0 (x0 ) C(1|0) , and to
C0 , otherwise. Here C(0|1) is the opportunity cost of misclassifying a C1 case as belonging to C0
and C(1|0) is the opportunity cost of misclassifying a C0 case as belonging to C1. Note that the
opportunity cost of correct classification for either class is zero. Notice also that this rule reduces
to the Minimum Error Bayes Rule when C(0|1) = C(1|0).
Again, as we rarely know f 1 (x) and f 0 ( x) , we cannot construct this classifier in practice.
Nonetheless, it provides us with an ideal that the various classifiers we construct for minimizing
expected opportunity cost attempt to emulate. The method used most often in practice is to use
stratified sampling instead of random sampling so as to change the ratio of cases in the training
set to reflect the relative costs of making the two types of misclassification errors.

Stratified sampling to make the classifier sensitive to asymmetric costs


The basic idea in using stratified sampling is to oversample the cases from a class to increase the
weight given to errors made in classifying cases in that class. If we feel that the opportunity cost
of misclassifying a class C1 case as a C0 case is ten times that of misclassifying a class C0 case as
a C1 case, we randomly sample ten times as many C1 cases as we randomly sample C0 cases. By
virtue of this oversampling, the training data will automatically tune the classifier to be more
accurate in classifying C1 cases than C0 cases. Most of the time the class that has a higher
misclassification cost will be the less frequently occurring class (for example fraudulent cases). In
this situation rather than reduce the number of cases that are not fraudulent to a small fraction of
the fraud cases and thus have a drastic reduction in the training data size, a good thumb rule often
used in practice is to sample an equal number of cases from each class. This is a very commonly
used option as it tends to produce rules that are quite efficient relative to the best over a wide
range of misclassification cost ratios.
Generalization to more than two classes
All the comments made above about two-class classifiers extend readily to classification into
more than two classes. Let us suppose we have k classes C0, C1, C2, Ck-1. Then Bayes formula
gives us:

p(C j | X = x0 ) =

f j (x0 ) p(C j )
k 1

f i ( x0 ) p(Ci )
i =1

The Bayes Rule for Minimum Error is to classify a case as belonging to Cj


if p(C j ) f j (x0 ) Max p(Ci ) f i (x0 ) .
i =0,1, k 1

The confusion table has k rows and k columns. The opportunity cost associated with the diagonal
cells is always zero. If the costs are asymmetric the Bayes Risk Classifier follows the rule:
Classify a case as belonging to C1

if p(C j ) f j (x0 ) C(~ j| j) Max p(Ci ) f i (x0 ) C(~ i|i) .


i j

where C(~j|j) is the cost of misclassifying a case that belongs to Cj to any other class Ci, i j.

Lift Charts for two-class classifiers


Often in practice, opportunity costs are not known accurately and decision makers would like to
examine a range of possible opportunity costs. In such cases, when the classifier gives a
probability of belonging to each class and not just a binary (or hard) classification to C1 or C0,
we can use a very useful device known as the lift curve. The lift curve is a popular technique in
direct marketing. The input required to construct a lift curve is a validation data set that has been
scored by appending the probability predicted by a classifier to each case. In fact we can use
classifiers that do not predict probabilities but give scores that enable us to rank cases in order of
how likely the cases are to belong to one of the classes.
Example: Boston Housing (Two classes)
Let us fit a logistic regression model to the Boston Housing data. We fit a logistic regression
model to the training data (304 randomly selected cases) with all the 13 variables available in the
data set as predictor variables and with the binary variable HICLASS (high valued property
neighborhood) as the dependent variable. The model coefficients are applied to the validation
data (the remaining 202 cases in the data set). The first three columns of XLMiner output for the
first 30 cases in the validation data are shown below.
Predicted
Log-odds of
Success

Predicted
Prob. of
Success

Actual
Value of
HICLASS

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

3.5993

0.9734

-6.5073

0.0015

0.4061

0.6002

-14.2910

0.0000

4.5273

0.9893

-1.2916

0.2156

-37.6119

0.0000

-1.1157

0.2468

-4.3290

0.0130

-24.5364

0.0000

-21.6854

0.0000

-19.8654

0.0000

-13.1040

0.0000

4.4472

0.9884

3.5294

0.9715

3.6381

0.9744

-2.6806

0.0641

-0.0402

0.4900

-10.0750

0.0000

-10.2859

0.0000

-14.6084

0.0000

8.9016

0.9999

0.0874

0.5218

-6.0590

0.0023

-1.9183

0.1281

-13.2349

0.0000

27
28
29
30

-9.6509

0.0001

-13.4562

0.0000

-13.9340

0.0000

1.7257

0.8489

The same 30 cases are shown below sorted in descending order of the predicted probability of
being a HCLASS=1 case.
Predicted
Log-odds of
Success

22
5
14
16
1
15
30
3
23
18
8
6
25
17
9
24
2
27
19
20
13
26
28
29
4
21
12
11
10
7

Predicted
Prob. of
Success

Actual
Value of
HICLASS

8.9016

0.9999

4.5273

0.9893

4.4472

0.9884

3.6381

0.9744

3.5993

0.9734

3.5294

0.9715

1.7257

0.8489

0.4061

0.6002

0.0874

0.5218

-0.0402

0.4900

-1.1157

0.2468

-1.2916

0.2156

-1.9183

0.1281

-2.6806

0.0641

-4.3290

0.0130

-6.0590

0.0023

-6.5073

0.0015

-9.6509

0.0001

-10.0750

0.0000

-10.2859

0.0000

-13.1040

0.0000

-13.2349

0.0000

-13.4562

0.0000

-13.9340

0.0000

-14.2910

0.0000

-14.6084

0.0000

-19.8654

0.0000

-21.6854

0.0000

-24.5364

0.0000

-37.6119

0.0000

First, we need to set a cutoff probability value, above which we will consider a case to be a
positive or "1," and below which we will consider a case to be a negative or "0." For any given
cutoff level, we can use the sorted table to compute a confusion table for a given cut-off
probability. For example, if we use a cut-off probability level of 0.400, we will predict 10
positives (7 true positives and 3 false positives); we will also predict 20 negatives (18 true
negatives and 2 false negatives). For each cut-off level, we can calculate the appropriate
confusion table. Instead of looking at a large number of confusion tables, it is much more
convenient to look at the cumulative lift curve (sometimes called a gains chart) which

10

summarizes all the information in these multiple confusion tables into a graph. The graph is
constructed with the cumulative number of cases (in descending order of probability) on the x
axis and the cumulative number of true positives on the y axis as shown below.

Probability
Rank

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Predicted
Prob. of
Success

Actual
cumulative
Value of
Actual Value
HICLASS

0.9999

0.9893

0.9884

0.9744

0.9734

0.9715

0.8489

0.6002

0.5218

0.4900

0.2468

0.2156

0.1281

0.0641

0.0130

0.0023

0.0015

0.0001

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

1
2
3
4
5
6
7
7
7
7
7
7
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9

The cumulative lift chart is shown below.

11

Cumulative positives

Cumulative Lift
10
9
8
7
6
5
4
3
2
1
0
0

10

15

20

25

30

35

Score rank

The line joining the points (0,0) to (30,9) is a reference line. It represents the expected number of
positives we would predict if we did not have a model but simply selected cases at random. It
provides a benchmark against which we can see performance of the model. If we had to choose
10 neighborhoods as HICLASS=1 neighborhoods and used our model to pick the ones most
likely to be "1's,", the lift curve tells us that we would be right about 7 of them. If we simply
select 10 cases at random we expect to be right for 10 9/30 = 3 cases. The model gives us a
"lift" in predicting HICLASS of 7/3 = 2.33. The lift will vary with the number of cases we choose
to act on. A good classifier will give us a high lift when we act on only a few cases (i.e. use the
prediction for the ones at the top). As we include more cases the lift will decrease. The lift curve
for the best possible classifier is shown as a broken line.
XLMiner automatically creates lift charts from probabilities predicted by logistic regression for
both training and validation data. The charts created for the full Boston Housing data are shown
below.

12

Lift chart (validation dataset)


45
40
Cumulative
HIGHCLASS when
sorted using
predicted values

Cumulative

35
30
25
20

Cumulative
HIGHCLASS using
average

15
10
5
0
0

100

200

300

# cases

Lift chart (training dataset)


45
40
Cumulative
HIGHCLASS when
sorted using
predicted values

Cumulative

35
30
25
20

Cumulative
HIGHCLASS using
average

15
10
5
0
0

100

200

300

400

# cases

It is worth mentioning that a curve that captures the same information as the lift curve in a
slightly different manner is also popular in data mining applications. This is the ROC (short for
Receiver Operating Characteristic) curve. It uses the same variable on the y axis as the lift curve
(but expressed as a percentage of the maximum) and on the x axis it shows the false positives
(also expressed as a percentage of the maximum) for differing cut-off levels.

13

The ROC curve for our 30 cases example above is shown below.

100.0%
90.0%

True Positive Cases%

80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

False Positive Cases %

14

Classification using a Triage strategy


In some cases it is useful to have a cant say option for the classifier. In a two-class situation
this means that for a case we can make one of three predictions. The case belongs to C0, or the
case belongs to C1, or we cannot make a prediction because there is not enough information to
confidently pick C0 or C1. Cases that the classifier cannot classify are subjected to closer scrutiny
either by using expert judgment or by enriching the set of predictor variables by gathering
additional information that is perhaps more difficult or expensive to obtain. This is analogous to
the strategy of triage that is often employed during retreat in battle. The wounded are classified
into those who are well enough to retreat, those who are too ill to retreat even if medically treated
under the prevailing conditions, and those who are likely to become well enough to retreat if
given medical attention. An example is in processing credit card transactions where a classifier
may be used to identify clearly legitimate cases and the obviously fraudulent ones while referring
the remaining cases to a human decision-maker who may look up a database to form a judgment.
Since the vast majority of transactions are legitimate, such a classifier would substantially reduce
the burden on human experts.
To gain some insight into forming such a strategy let us revisit the simple two-class, one predictor
variable, classifier that we examined at the beginning of this chapter.

f1(x)

f0(x)

Grey area

Clearly the grey area of greatest doubt in classification is the area around a. At a the ratio of the
conditional probabilities of belonging to the classes is one. A sensible rule way to define the grey
area is the set of x values such that:

t>

p(C1 ) f 1 (x0 )
> 1/ t
p(C0 ) f 0 (x0 )

where t is a threshold for the ratio. A typical value of t may in the range 1.05 or 1.2.

15

Lecture 3

Classification Trees

Classification and Regression Trees


If one had to choose a classification technique that performs well across a wide range of
situations without requiring much effort from the application developer while being readily
understandable by the end-user a strong contender would be the tree methodology developed by
Brieman, Friedman, Olshen and Stone (1984). We will discuss this classification procedure first,
then in later sections we will show how the procedure can be extended to prediction of a
continuous dependent variable. The program that Brieman et. al. created to implement these
procedures was called CART for Classification And Regression Trees.
Classification Trees
There are two key ideas underlying classification trees. The first is the idea of recursive
partitioning of the space of the independent variables. The second is of pruning using validation
data. In the next few sections we describe recursive partitioning, subsequent sections explain the
pruning methodology.
Recursive Partitioning
Let us denote the dependent (categorical) variable by y and the independent variables by x1, x2,

x3 xp. Recursive partitioning divides up the p dimensional space of the x variables into non-

overlapping rectangles. This division is accomplished recursively. First one of the variables is

selected, say xi and a value of xi, say si is chosen to split the p dimensional space into two parts:

one part is the p-dimensional hyper-rectangle that contains all the points with xi si and the other

part is the hyper-rectangle with all the points with xi > ci. Then one of these two parts is divided

in a similar manner by choosing a variable again (it could be xi or another variable) and a split

value for the variable. This results in three rectangular regions. (From here onwards we refer to

hyper-rectangles simply as rectangles.) This process is continued so that we get smaller and

smaller rectangles. The idea is to divide the entire x-space up into rectangles such that each

rectangle is as homogenous or pure as possible. By pure we mean containing points that

belong to just one class. (Of course, this is not always possible, as there may be points that belong

to different classes but have exactly the same values for every one of the independent variables.)

Let us illustrate recursive partitioning with an example.

Example 1 (Johnson and Wichern)

A riding-mower manufacturer would like to find a way of classifying families in a city into those

that are likely to purchase a riding mower and those who are not likely to buy one. A pilot

random sample of 12 owners and 12 non-owners in the city is undertaken. The data are shown in

Table I and plotted in Figure 1 below. The independent variables here are Income (x1) and Lot

Size (x2). The categorical y variable has two classes: owners and non-owners.

Table 1
Observation
1
2
3
4
5

Income
($ 000's)
60
85.5
64.8
61.5
87

Lot Size
(000's sq. ft.)

Owners=1,
Non-owners=2

18.4
16.8
21.6
20.8
23.6

1
1
1
1
1

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63

19.2
17.6
22.4
20
20.8
22
20
19.6
20.8
17.2
20.4
17.6
17.6
16
18.4
16.4
18.8
14
14.8

Figure 1

Lot Size (000's sq. ft.)

25.0

20.0

15.0

10.0
30.0

50.0

70.0

90.0

Income ($ 000's)
Owners

Non-owners

110.0

If we apply CART to this data it will choose x2 for the first split with a splitting value of 19. The
(x1,x2) space is now divided into two rectangles, one with the Lot Size variable, x2 19 and the
other with x2> 19. See Figure 2.

Figure 2

Lot Size (000's sq. ft.)

25.0

20.0

15.0

10.0
30.0

50.0

70.0

90.0

110.0

Income ($ 000's)
Owners

Non-owners

Notice how the split into two rectangles has created two rectangles each of which is much more
homogenous than the rectangle before the split. The upper rectangle contains points that are
mostly owners (9 owners and 3 non-owners) while the lower rectangle contains mostly nonowners (9 non-owners and 3 owners).
How did CART decide on this particular split? It examined each variable and all possible split
values for each variable to find the best split. What are the possible split values for a variable?
They are simply the mid-points between pairs of consecutive values for the variable. The possible
split points for x1 are {38.1, 45.3, 50.1, , 109.5} and those for x2 are {14.4, 15.4, 16.2, ,
23}. These split points are ranked according to how much they reduce impurity (heterogeneity of
composition). The reduction in impurity is defined as the impurity of the rectangle before the split

minus the sum of the impurities for the two rectangles that result from a split. There are a number
of ways we could measure impurity. We will describe the most popular measure of impurity: the
Gini index. If we denote the classes by k, k=1, 2, , C, where C is the total number of classes
for the y variable, the Gini impurity index for a rectangle A is defined by
C

k =1

k =1

I ( A) = 1- p2k I ( A) = 1- p2k where pk pk is the fraction of observations in rectangle A


that belong to class k. Notice that I(A) = 0 if all the observations belong to a single class and I(A)
is maximized when all classes appear in equal proportions in rectangle A. Its maximum value is
(C-1)/C
The next split is on the Income variable, x1 at the value 84.75. Figure 3 shows that once again the
CART procedure has astutely chosen to split a rectangle to increase the purity of the resulting
rectangles. The left lower rectangle which contains data points with x1 84.75 and x2 19 has all
but one points that are non-owners; while the right lower rectangle which contains data points
with x1 > 84.75 and x2 19 consists exclusively of owners.
Figure 3

Lot Size (000's sq. ft.)

25.0

20.0

15.0

10.0
30.0

50.0

70.0

90.0

Income ($ 000's)
Owners

Non-owners

The next split is shown below:

110.0

Figure 4

Lot Size (000's sq. ft.)

25.0

20.0

15.0

10.0
30.0

50.0

70.0

90.0

110.0

Income ($ 000's)
Owners

Non-owners

We can see how the recursive partitioning is refining the set of constituent rectangles to become
purer as the algorithm proceeds. The final stage of the recursive partitioning is shown in Figure 5.

Figure 5

Lot Size (000's sq. ft.)

25.0

20.0

15.0

10.0
30.0

50.0

70.0

90.0

110.0

Income ($ 000's)
Owners

Non-owners

Notice that now each rectangle is pure it contains data points from just one of the two classes.
The reason the method is called a classification tree algorithm is that each split can be depicted as
a split of a node into two successor nodes. The first split is shown as a branching of the root node
of a tree in Figure 6.
Figure 6
1

x2<=19

19

Lot Size
12

x2>19
12

3
57.15

84.75

3
Income

Income

The tree representing the first three splits is shown in Figure 7 below.
Figure 7

x2<=19

19

x2>19

Lot Size
12

12

84.75

Income
x1<=84.75

10

x1<=57.15
2

x1> 84.75

57.15

Income

x1>57.15
9

The full tree is shown in Figure 8 below. We have represented the nodes that have successors by
circles. The numbers inside the circle are the splitting values and the name of the variable chosen
for splitting at that node is shown below the node. The numbers on the left fork at a decision node
shows the number of points in the decision node that had values less than or equal to the splitting
value while the number on the right fork shows the number that had a greater value. These are
called decision nodes because if we were to use a tree to classify a new observation for which we
knew only the values of the independent variables we would drop the observation down the tree
in such a way that at each decision node the appropriate branch is taken until we get to a node that
has no successors. Such terminal nodes are called the leaves of the tree. Each leaf node
corresponds to one of the final rectangles into which the x-space is partitioned and is depicted as
a rectangle -shaped node. When the observation has dropped down all the way to a leaf we can
predict a class for it by simply taking a vote of all the training data that belonged to the leaf
when the tree was grown. The class with the highest vote is the class that we would predict for the
new observation. The number below the leaf node is the class with the most votes in the
rectangle. The % value in a leaf node shows the percentage of the total number of training
observations that belonged to that node. It is useful to note that the type of trees grown by CART
(called binary trees) have the property that the number of leaf nodes is exactly one more than the
number of decision nodes.
Figure 8

x2<=19

19

x2>19

Lot Size
12

12

3
57.15

84.75

3
x1<=57.15

Income
10

x1<=84.75

x1> 84.75
8.33 %

18

21.40

19.8

Lot Size

29.1 %

1
7

x1>57.15

Income

Lot Size
3

18.6

Lot Size
1

8.33 %

4.16 %

1
Lot Size

63

29.1 %

19.4

Lot Size
2

4.16 %

4.16 %

4.16 %

Income
1

4.16 %

4.16 %

Pruning
The second key idea in the CART procedure, that of using the validation data to prune back the
tree that is grown from the training data using independent validation data, was the real
innovation. Previously, methods had been developed that were based on the idea of recursive
partitioning but they had used rules to prevent the tree from growing excessively and over-fitting
the training data. For example, CHAID (Chi-Squared Automatic Interaction Detection) is a
recursive partitioning method that predates CART by several years and is widely used in database
marketing applications to this day. It uses a well-known statistical test (the chi-square test for
independence) to assess if splitting a node improves the purity by a statistically significant
amount. If the test does not show a significant improvement the split is not carried out. By
contrast, CART uses validation data to prune back the tree that has been deliberately overgrown
using the training data.
The idea behind pruning is to recognize that a very large tree is likely to be over-fitting the
training data. In our example, the last few splits resulted in rectangles with very few points
(indeed four rectangles in the full tree have just one point). We can see intuitively that these last
splits are likely to be simply capturing noise in the training set rather than reflecting patterns that
would occur in future data such as the validation data. Pruning consists of successively selecting a

10

decision node and re-designating it as a leaf node (thereby lopping off the branches extending
beyond that decision node (its subtree) and thereby reducing the size of the tree).The pruning
process trades off misclassification error in the validation data set against the number of decision
nodes in the pruned tree to arrive at a tree that captures the patterns but not the noise in the
training data. It uses a criterion called the cost complexity of a tree to generate a sequence of
trees which are successively smaller to the point of having a tree with just the root node. (What is
the classification rule for a tree with just one node?). We then pick as our best tree the one tree in
the sequence that gives the smallest misclassification error in the validation data.
The cost complexity criterion that CART uses is simply the misclassification error of a tree
(based on the validation data) plus a penalty factor for the size of the tree. The penalty factor is
based on a parameter, let us call it a, that is the per node penalty. The cost complexity criterion
for a tree is thus Err(T) + a |L(T)| where Err(T) is the fraction of validation data observations that
are misclassified by tree T, L(T) is the number of leaves in tree T and a is the per node penalty
cost: a number that we will vary upwards from zero. When a = 0 there is no penalty for having
too many nodes in a tree and the best tree using the cost complexity criterion is the full grown
unpruned tree. When we increase a to a very large value the penalty cost component swamps the
misclassification error component of the cost complexity criterion function and the best tree is
simply the tree with the fewest leaves, namely the tree with simply one node. As we increase the
value of a from zero at some value we will first encounter a situation where for some tree T1
formed by cutting off the subtree at a decision node we just balance the extra cost of increased
misclassification error (due to fewer leaves) against the penalty cost saved from having fewer
leaves. We prune the full tree at this decision node by cutting off its subtree and redesignating this
decision node as a leaf node. Lets call this tree T1 . We now repeat the logic that we had applied
previously to the full tree, with the new tree T1 by further increasing the value of a. Continuing in
this manner we generate a succession of trees with diminishing number of nodes all the way to
the trivial tree consisting of just one node.
From this sequence of trees it seems natural to pick the one that gave the minimum
misclassification error on the validation data set. We call this the Minimum Error Tree.
Let us use the Boston Housing data to illustrate. Shown below is the output that XLMiner
generates when it is using the training data in the tree growing phase of the algorithm

11

Training Log
Growing the Tree
#Nodes
0

Error
36.18

1
2
3

15.64
5.75
3.29

4
5

2.94
1.88

6
7
8

1.42
1.26
1.2

9
10
11

0.63
0.59
0.49

12
13
14

0.42
0.35
0.34

15
16
17

0.32
0.25
0.22

18
19
20

0.21
0.15
0.09

21
22

0.09
0.09

23
24
25

0.08
0.05
0.03

26
27
28

0.03
0.02
0.01

29
30

0
0

Training Misclassification Summary


Classification Confusion Matrix
Predicted Class
Actual Class

59

2
3

0
0

194
0

0
51

Error Report
Class
1
2
3
Overall

# Cases
59

# Errors
0

194
51
304

0
0
0

12

% Error
0.00 These are
0.00 cases in
0.00 the training
0.00 data

The top table logs the tree-growing phase by showing in each row the number of decision nodes
in the tree at each stage and the corresponding (percentage) misclassification error for the training
data applying the voting rule at the leaves. We see that the error steadily decreases as the number
of decision nodes increases from zero (where the tree consists of just the root node) to thirty. The
error drops steeply in the beginning, going from 36% to 3% with just an increase of decision
nodes from 0 to 3. Thereafter the improvement is slower as we increase the size of the tree.
Finally we stop at a full tree of 30 decision nodes (equivalently, 31 leaves) with no error in the
training data, as is also shown in the confusion table and the error report by class.

The output generated by XLMiner during the pruning phase is shown below.
# Decision
Nodes
30

Training
Validation
Error
Error
0.00%
15.84%

29
28
27

0.00%
0.01%
0.02%

15.84%
15.84%
15.84%

26
25

0.03%
0.03%

15.84%
15.84%

24
23

0.05%
0.08%

15.84%
15.84%

22
21

0.09%
0.09%

15.84%
16.34%

20
19

0.09%
0.15%

15.84%
15.84%

18
17

0.21%
0.22%

15.84%
15.84%

16
15

0.25%
0.32%

15.84%
15.84%

14
13

0.34%
0.35%

15.35%
14.85%

12
11

0.42%
0.49%

14.85%
15.35%

10
9

0.59%
0.63%

14.85% <-- Minimum Error Prune


15.84%

8
7

1.20%
1.26%

15.84%
16.83%

6
5

1.42%
1.88%

16.83%
15.84% <-- Best Prune

4
3

2.94%
3.29%

21.78%
21.78%

2
1

5.75%
15.64%

30.20%
33.66%

Std. Err.

0.02501957

Notice now that as the number of decision nodes decreases the error in the validation data has a
slow decreasing trend (with some fluctuation) up to a 14.85% error rate for the tree with 10
nodes. This is more readily visible from the graph below. Thereafter the error increases, going up

13

sharply when the tree is quite small. The Minimum Error Tree is selected to be the one with 10
decision nodes (why not the one with 13 decision nodes?)

Effect of tree size


40%

Training Error
Validation Error

35%
30%

%error

25%
20%
15%
10%
5%
0%
0

10

15

20

# Decision Nodes

This Minimum Error tree is shown in Figure 9.

14

25

30

35

Figure 9

15.145

LSTAT
137

65

6.5545

5.86

RM
83

CRIM
54

35

30

41.0 %

14.8 %
7.11

19.3

RM

LSTAT

1
34

20

25

9.90 %

12.3 %

10

5.69

26.62

LSTAT
14

LSTAT
20

9.90 %

2.97 %

2.5

19.6

RAD
1

PTRATIO
13

0.49 %

0.99 %

0.99 %

11.30

2
AGE
1

12

0.49 %

5.94 %

You will notice that the XLMiner output from the pruning phase highlights another tree besides
the Minimum Error Tree. This is the Best Pruned Tree, the tree with 5 decision nodes. The reason
this tree is important is that it is the smallest tree in the pruning sequence that has an error that is
within one standard error of the Minimum Error Tree. The estimate of error that we get from the
validation data is just that: it is an estimate. If we had had another set of validation data the
minimum error would have been different. The minimum error rate we have computed can be
viewed as an observed value of a random variable with standard error (estimated standard
deviation) equal to Emin (1 - Emin ) Nval where E min is the error rate (as a fraction) for the

15

minimum error tree and N val is the number of observations in the validation data set. For our
example E min = 0.1485 and N val = 202, so that the standard error is 0.025. The Best Pruned Tree
is shown in Figure 10.
Figure 10

15.145

LSTAT
137

65

6.5545

5.86

RM

CRIM

83

54

41.0 %

35

7.11

30

17.3 %

14.8 %

RM
34

20

9.90 %
5.69

3
LSTAT
14

20

6.93 %

9.90 %

We show the confusion table and summary of classification errors for the Best Pruned Tree
below.

16

Validation Misclassification Summary


Classification Confusion Matrix
Predicted Class
Actual Class

25

10

2
3

5
0

120
8

9
25

# Cases

# Errors

% Error

35

10

28.57

2
3

134
33

14
8

10.45
24.24

Overall

202

32

15.84

Error Report
Class

It is important to note that since we have used the validation data in classification, strictly
speaking it is not fair to compare the above error rates directly with the other classification
procedures that use only the training data to construct classification rules. A fair comparison
would be to partition the training data (TDother) for other procedures into a further partition into
training (TDtree) and validation data (VDtree) for classification trees. The error rate for the
classification tree created from using TDtree to grow the tree and then VDtree to prune it can
now be compared using the validation data used by the other classifiers (VDother) as new hold
out data..

Classification Rules from Trees


One of the reasons tree classifiers are very popular is that they provide easily understandable
classification rules (at least if the trees are not too large). Each leaf is equivalent to a classification
rule. For example, the upper left leaf in the Best Pruned Tree above, gives us the rule:
IF(LSTAT 15.145) AND (ROOM 6.5545) THEN CLASS = 2
Such rules are easily explained to managers and operating staff compared to outputs of other
classifiers such as discriminant functions. Their logic is certainly far more transparent than that of
weights in neural networks!
The tree method is a good off-the-shelf choice of classifier
We said at the beginning of this chapter that classification trees require relatively little effort from
developers. Let us give our reasons for this statement. Trees need no tuning parameters. There is
no need for transformation of variables (any monotone transformation of the variables will give
the same trees). Variable subset selection is automatic since it is part of the split selection; in our
example notice that the Best Pruned Tree has automatically selected just three variables (LSTAT,
RM and CRIM) out of the set thirteen variables available. Trees are also intrinsically robust to
outliers as the choice of a split depends on the ordering of observation values and not on the
absolute magnitudes of these values.
Finally the CART procedure can be readily modified to handle missing data without having to
impute values or delete observations with missing values. The method can also be extended to

17

incorporate an importance ranking for the variables in terms of their impact on quality of the
classification.
Notes:
1. We have not described how categorical independent variables are handled in CART. In
principle there is no difficulty. The split choices for a categorical variable are all ways in
which the set of categorical values can be divided into two subsets. For example a
categorical variable with 4 categories, say {1,2,3,4} can be split in 7 ways into two
subsets: {1} and {2,3,4}; {2} and {1,3,4}; {3} and {1,2,4}; {4} and {1,2,3};{1.2} and
{3,4}; {1,3} and {2,4}; {1,4} and {2,3}. When the number of categories is large the
number of splits becomes very large. XLMiner supports only binary categorical variables
(coded as numbers). If you have a categorical independent variable that takes more than
two values, you will need to replace the variable with several dummy variables each of
which is binary in a manner that is identical to the use of dummy variables in regression.
2. There are several variations on the basic recursive partitioning scheme described above.
A common variation is to permit splitting of the x variable space using straight lines
(planes for p =3 and hyperplanes for p> 3) that are not perpendicular to the co-ordinate
axes. This can result in a full tree that is pure with far fewer nodes particularly when the
classes lend themselves to linear separation functions. However this improvement comes
at a price. First, the simplicity of interpretation of classification trees is lost because we
now split on the basis of weighted sums of independent variables where the weights may
not be easy to interpret. Second the important property remarked on earlier of the trees
being invariant to monotone transformations of independent variables is no longer true.
3. Besides CHAID, another popular tree classification method is ID3 (and its successor
C4.5). This method was developed by Quinlan a leading researcher in machine learning
and is popular with developers of classifiers who come from a background in machine
learning.

As usual for regression we denote the dependent (categorical) variable by y and the independent
variables by x1, x2, x3 xp. Both key ideas underlying classification trees carry over with small
modifications for regression trees..

18

Lecture 4

Discriminant Analysis

Discriminant analysis uses continuous variable measurements on different groups of


items to highlight aspects that distinguish the groups and to use these measurements to
classify new items. Common uses of the method have been in biological classification
into species and sub-species, classifying applications for loans, credit cards and insurance
into low risk and high risk categories, classifying customers of new products into early
adopters, early majority, late majority and laggards, classification of bonds into bond
rating categories, research studies involving disputed authorship, college admissions,
medical studies involving alcoholics and non-alcoholics, anthropological studies such as
classifying skulls of human fossils and methods to identify human fingerprints.
Example 1 (Johnson and Wichern)
A riding-mower manufacturer would like to find a way of classifying families in a city
into those that are likely to purchase a riding mower and those who are not likely to buy
one. A pilot random sample of 12 owners and 12 non-owners in the city is undertaken.
The data are shown in Table I and plotted in Figure 1 below:
Table 1
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Income
Lot Size
Owners=1,
($ 000's) (000's sq. ft.) Non-owners=2
60
85.5
64.8
61.5
87
110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63

18.4
16.8
21.6
20.8
23.6
19.2
17.6
22.4
20
20.8
22
20
19.6
20.8
17.2
20.4
17.6
17.6
16
18.4
16.4
18.8
14
14.8

1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2

Figure 1

Lot Size (000's sq. ft.

25

20

15

10
30

50

70

90

110

Income ($ 000,s)

Owners

Non-owners

We can think of a linear classification rule as a line that separates the x1-x2 region into
two parts where most of the owners are in one half-plane and the non-owners are in the
complementary half-space. A good classification rule would separate out the data so that
the fewest points are misclassified: the line shown in Fig.1 seems to do a good job in
discriminating between the two groups as it makes 4 misclassifications out of 24 points.
Can we do better?
We can obtain linear classification functions that were suggested by Fisher using
statistical software. You can use XLMiner to find Fishers linear classification functions.
Output 1 shows the results of invoking the discriminant routine.

Output 1
Prior Class Probabilities
Prior class probabilities

According to relative occurrences in training data

Class

Probability

0.5

0.5

Classification Functions
Classification Function

Variables
Constant
Income

($ 000's)

Lot Size

(000's sq. ft.)

-73.160202

-51.4214439

0.42958561

0.32935533

5.46674967

4.68156528

Canonical Variate Loadings


Variables
Income
000's)
Lot Size
(000's sq. ft.)

($

Variate1
0.01032889

0.08091455

Training Misclassification Summary


Classification Confusion Matrix
Predicted Class
Actual Class

11

10

Error Report
Class

# Cases

# Errors

12

% Error
8.33

12

16.67

Overall

24

12.50

We note that it is possible to have a misclassification rate that is lower (3 in 24) by using
the classification functions specified in the output. These functions are specified in a way
that can be easily generalized to more than two classes. A family is classified into Class 1
of owners if Function 1 is higher than Function 2, and into Class 2 if the reverse is the
case. The values given for the functions are simply the weights to be associated with each
4

variable in the linear function in a manner analogous to multiple linear regression. For
example, the value of the Classification function for class1 is 53.20. This is calculated
using the coefficients of classification function1 shown in Output 1 above as 73.1602 +
0.4296 60 + 5.4667 18.4. XLMiner computes these functions for the observations in
our dataset. The results are shown in Table 3 below.
Table 3
Classes
Observation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Predicted
Class

Classification Function Values


Actual Class

Max Value

Value for
Class - 1

Input Variables
Value for
Class - 2

Income
Lot Size
($ 000's) (000's sq. ft.)

54.48067856

53.203125 54.48067856

60

18.4

55.41075897 55.41075897 55.38873291

85.5

16.8

72.75873566 72.75873566 71.04259491

64.8

21.6

66.96770477 66.96770477 66.21046448

61.5

20.8

93.22903442 93.22903442 87.71740723

87

23.6

79.09877014 79.09877014 74.72663116

110.1

19.2

69.44983673 69.44983673 66.54447937

108

17.6

84.86467743 84.86467743 80.71623993

82.8

22.4

65.81620026 65.81620026 64.93537903

69

20

80.49964905 80.49964905

76.5851593

93

20.8

69.01715851 69.01715851 68.37011719

51

22

70.97122192 70.97122192 68.88764191

81

20

66.20701599 66.20701599 65.03888702

75

19.6

63.3450737

52.8

20.8

50.44370651 48.70503998 50.44370651

64.8

17.2

58.31063843

56.91959 58.31063843

43.2

20.4

59.13978195 59.13978195 58.63995361

84

17.6

47.17838669 44.19020462 47.17838669

49.2

17.6

43.04730988 39.82517624 43.04730988

63.3450737 63.23031235

59.4

16

56.45681

66

18.4

40.96767044 36.85684967 40.96767044

47.4

16.4

47.46070862 43.79101944 47.46070862

33

18.8

56.45681 55.78063965

30.917593 25.28316116

30.917593

51

14

38.61510849 34.81158447 38.61510849

63

14.8

Notice that observations 1, 13 and 17 are misclassified as we would expect from the
output shown in Table 2.
Let us describe the reasoning behind Fishers linear classification rules. Figure 3 depicts
the logic.

25

Lot Size (000's sq. ft.

D2

20

15

D1

P2
P1
10
30

50

70

90

110

Income ($ 000,s)

Owners

Non-owners

Consider various directions such as directions D1 and D2 shown in Figure 2. One way to
identify a good linear discriminant function is to choose amongst all possible directions
the one that has the property that when we project (drop a perpendicular line from) the
means of the two groups onto a line in the chosen direction the projections of the group
means (feet of the perpendiculars, e.g. P1 and P2 in direction D1) are separated by the
maximum possible distance. The means of the two groups are:

Mean1
Mean2

Income
79.5
57.4

Area
20.3
17.6

We still need to decide how to measure the distance. We could simply use Euclidean
distance. This has two drawbacks. First, the distance would depend on the units we
choose to measure the variables. We will get different answers if we decided to measure
area in say, square yards instead of thousands of square feet. Second , we would not be
taking any account of the correlation structure. This is often a very important
consideration especially when we are using many variables to separate groups. In this
case often there will be variables which by themselves are useful discriminators between
groups but in the presence of other variables are practically redundant as they capture the
same effects as the other variables.
Fishers method gets over these objections by using a measure of distance that is a
generalization of Euclidean distance known as Mahalanobis distance. This distance is
defined with respect to a positive definite matrix . The squared Mahalanobis distance

between two p-dimensional (column) vectors y1 and y2 is (y1 y2) -1 (y1 y2) where
is a symmetric positive definite square matrix with dimension p. Notice that if is the
identity matrix the Mahalanobis distance is the same as Euclidean distance. In linear
discriminant analysis we use the pooled sample variance matrix of the different groups. If
X1 and X2 are the n1 x p and n2 x p matrices of observations for groups 1 and 2, and the
respective sample variance matrices are S1 and S2, the pooled matrix S is equal to
{(n1-1) S1 + (n2-1) S2}/(n1 +n2 2). The matrix S defines the optimum direction
(actually the eigenvector associated with its largest eigenvalue) that we referred to when
we discussed the logic behind Figure 2. This choice of Mahalanobis distance can also be
shown to be optimal* in the sense of minimizing the expected misclassification error
when the variable values of the populations in the two groups (from which we have
drawn our samples) follow a multivariate normal distribution with a common covariance
matrix. In fact it is optimal for the larger family of elliptical distributions with equal
variance-covariance matrices. In practice the robustness of the method is quite
remarkable in that even for situations that are only roughly normal it performs quite well.
If we had a prospective customer list with data on income and area, we could use the
classification functions in Output 1 to identify the sub-list of families that are classified as
group 1. This sub-list would consist of owners (within the classification accuracy of our
functions) and therefore prospective purchasers of the product.

Classification Error
What is the accuracy we should expect from our classification functions? We have an
training data error rate (often called the re-substitution error rate) of 12.5% in our
example. However this is a biased estimate as it is overly optimistic. This is because we
have used the same data for fitting the classification parameters as well for estimating the
error. In data mining applications we would randomly partition our data into training and
validation subsets. We would use the training part to estimate the classification functions
and hold out the validation part to get a more reliable, unbiased estimate of classification
error.
So far we have assumed that our objective is to minimize the classification error and that
the chances of encountering an item from either group requiring classification is the
same. . If the probability of encountering an item for classification in the future is not
equal for both groups we should modify our functions to reduce our expected (long run
average) error rate. Also we may not want to minimize misclassifaction rate in certain
situations. If the cost of mistakenly classifying a group 1 item as group 2 is very different
from the cost of classifying a group 2 item as a group 1 item, we may want to minimize
the expect cost of misclassification rather than the error rate that does not take cognizance
of unequal misclassification costs. It is simple to incorporate these situations into our
framework for two classes. All we need to provide are estimates of the ratio of the
*

This is true asymptotically, i.e. for large training samples. Large training samples are
required for S, the pooled sample variance matrix, to be a good estimate of the population
variance matrix.
7

chances of encountering an item in class 1 as compared to class 2 in future classifications


and the ratio of the costs of making the two kinds of classification error. These ratios will
alter the constant terms in the linear classification functions to minimize the expected
cost of misclassification. The intercept term for function 1 is increased by ln(C(2|1)) +
ln(P(C1)) and that for function2 is increased by ln(C(1|2)) + ln(P(C2)), where C(i|j) is the
cost of misclassifying a Group j item as Group i and P(Cj) is the apriori probability of an
item belonging to Group j.
Extension to more than two classes
The above analysis for two classes is readily extended to more than two classes. Example
2 illustrates this setting.

Example 2: Fishers Iris Data This is a classic example used by Fisher to illustrate his
method for computing clasification functions. The data consists of four length
measurements on different varieties of iris flowers. Fifty different flowers were measured
for each species of iris. A sample of the data are given in Table 4 below:
Table 4
OBS#

SPECIES
CLASSCODE SEPLEN SEPW
1
5.1
1 Iris-setosa
1
4.9
2 Iris-setosa
1
4.7
3 Iris-setosa
1
4.6
4 Iris-setosa
1
5
5 Iris-setosa
1
5.4
6 Iris-setosa
1
4.6
7 Iris-setosa
1
5
8 Iris-setosa
1
4.4
9 Iris-setosa
1
4.9
10 Iris-setosa
...

51 Iris-versicolor
2
7
52 Iris-versicolor
2
6.4
53 Iris-versicolor
2
6.9
54 Iris-versicolor
2
5.5
55 Iris-versicolor
2
6.5
56 Iris-versicolor
2
5.7
57 Iris-versicolor
2
6.3
58 Iris-versicolor
2
4.9
59 Iris-versicolor
2
6.6
60 Iris-versicolor
2
5.2
...

101 Iris-virginica
3
6.3
102 Iris-virginica
3
5.8
103 Iris-virginica
3
7.1
104 Iris-virginica
3
6.3

PETLEN PETW
3.5
1.4
3
1.4
3.2
1.3
3.1
1.5
3.6
1.4
3.9
1.7
3.4
1.4
3.4
1.5
2.9
1.4
3.1
1.5

3.2
4.7
3.2
4.5
3.1
4.9
2.3
4
2.8
4.6
2.8
4.5
3.3
4.7
2.4
3.3
2.9
4.6
2.7
3.9

3.3
6
2.7
5.1
3
5.9
2.9
5.6

0.2
0.2
0.2
0.2
0.2
0.4
0.3
0.2
0.2
0.1

1.4
1.5
1.5
1.3
1.5
1.3
1.6
1
1.3
1.4

2.5
1.9
2.1
1.8

105 Iris-virginica
106 Iris-virginica
107 Iris-virginica
108 Iris-virginica
109 Iris-virginica
110 Iris-virginica

3
3
3
3
3
3

6.5
7.6
4.9
7.3
6.7
7.2

3
3
2.5
2.9
2.5
3.6

5.8
6.6
4.5
6.3
5.8
6.1

2.2
2.1
1.7
1.8
1.8
2.5

The results from applying the discriminant analysis procedure of Xlminer are shown in
Output 2:
Output 2
Classification Functions
Classification Function

Variables
Constant

-86.3084793

-72.8526154

-104.368332

SEPLEN

23.5441742

15.6982136

12.4458504

SEPW

23.5878677

7.07251072

3.68528175

PETLEN
PETW

-16.4306431

5.21144867

12.7665491

-17.398407

6.43422985

21.0791111

Canonical Variate Loadings


Variables
SEPLEN
SEPW

Variate1

Variate2

0.06840593

0.00198865

0.12656119

0.17852645

PETLEN

-0.18155289

-0.0768638

PETW

-0.23180288

0.23417209

Training Misclassification Summary


Classification Confusion Matrix
Predicted Class
Actual Class

50

48

49

Error Report
Class
1

# Cases

# Errors

% Error

50

0.00

50

4.00

50

2.00

150

2.00

Overall

For illustration the computations of the classification function values for observations 40
to 55 and 125 to 135 are shown in Table 5.
Table 5
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
125
126
127
128
129
130
131
132
133
134
135

85.83991241 85.83991241

40.3588295

-4.99889183

5.1

3.4

1.5

0.2

87.39057159 87.39057159 39.09738922

-6.32034588

3.5

1.3

0.3

47.3130455 22.76127243

-16.9656143

4.5

2.3

1.3

0.3

67.92755127 67.92755127 26.91328812

47.3130455

-17.0013542

4.4

3.2

1.3

0.2

77.24185181 77.24185181 42.59109879 3.833352089

3.5

1.6

0.6

85.22311401 85.22311401 46.55926132 5.797660828

5.1

3.8

1.9

0.4

69.24474335 69.24474335

4.8

1.4

0.3

32.9426384

-9.37550068

93.63198853 93.63198853 43.70898056

-2.24812603

5.1

3.8

1.6

0.2

70.99331665 70.99331665

30.5740757

-13.2355289

4.6

3.2

1.4

0.2

97.62510681 97.62510681

45.620224

-1.40413904

5.3

3.7

1.5

0.2

82.76977539 82.76977539 37.56061172

-7.88865852

3.3

1.4

0.2

93.16864014 52.40012741 93.16864014 84.05905914

3.2

4.7

1.4

83.35085297 39.81990051 83.35085297 76.14614868

6.4

3.2

4.5

1.5

92.57727814 42.66094208 92.57727814 87.10716248

6.9

3.1

4.9

1.5

58.96462631 9.096075058 58.96462631 51.02903748

5.5

2.3

1.3

82.61280823 31.09611893 82.61280823 77.19327545

6.5

2.8

4.6

1.5

108.2157593 19.08612823 98.88184357 108.2157593

6.7

3.3

5.7

2.1

111.5763855 28.78975868 105.6568604 111.5763855

7.2

3.2

1.8

82.33656311 15.52721214

83.10569

80.8759079 82.33656311

16.2473011 81.24172974

83.10569

101.362709 1.872017622 90.11497498

6.2

2.8

4.8

1.8

6.1

4.9

1.8

101.362709

6.4

2.8

5.6

2.1

104.0701981 30.83799362 101.9132233 104.0701981

7.2

5.8

1.6

115.9760056 20.68055344 107.1320648 115.9760056

7.4

2.8

6.1

1.9

131.8220978 49.37147522 124.2605438 131.8220978

7.9

3.8

6.4

103.4706192 0.132176965 90.75839996 103.4706192

6.4

2.8

5.6

2.2

82.07889557 18.17195129 82.07889557 81.08737946

6.3

2.8

5.1

1.5

82.13652039 2.270064592 79.48704529 82.13652039

6.1

2.6

5.6

1.4

10

Canonical Variate Loadings


The canonical variate loadings are useful for graphical representation of the discriminant
analysis results. These loadings are used to map the observations to lower dimensions
while minimizing loss of separability information between the groups.
Fig. 3 shows the canonical values for Example 1. The number of canonical variates is the
minimum of one less than the number of classes and the number of variables in the data.
In this example this is Min( 2-1 , 2 ) = 1. So the 24 observations are mapped into 24
points in one dimension ( a line). We have condensed the separability information into 1
dimension from the 2 dimensions in the original data. Notice the separation line between
the x values and the mapped values of the misclassified points.
Actual Predicted

Obs. Class Class


2
11
1
1
2
1

3
41
51
61
71
81
91
10 1
11 1
12 1
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 2
22 2
23 2
24 2

Canonical
Score 1
2.10856112
2.242484535

2.417066352

2.318249375

2.80819681

2.690770149

2.5396162

2.667718012

2.33098441

2.64360941

2.30689349

2.45493109

2.36059193

2.228388032

2.061042332

2.096864868

2.29172284

1.932277468

1.908168866

2.17053446

1.816588006

1.86204691

1.65957709

1.84825541

Obs 17 &13

Obs 1

In the case of the iris we would condense the separability information into 2 dimensions.
If we had c classes and p variables, and Min(c-1,p) > 2 , we can only plot the first two
canonical values for each observation. In such datasets sometimes we still get insight into
the separation of the observations in the data by plotting the observations in these two coordinates.

11

Extension to unequal covariance structures


When the classification variables follow a multivariate normal distribution with variance
matrices that differ substantially between different groups, the linear classification rule is
no longer optimal. In that case the optimal classification function is quadratic in the
classification variables. However, in practice this has not been found to be useful except
when the difference in the variance matrices is large and the number of observations
available for training and testing is large. The reason is that the quadratic model requires
many more parameters that are all subject to error to be estimated. If there are c classes
and p variables, the number of parameters to be estimated for the different variance
matrices is cp(p + 1)/2. This is an example of the importance of regularization in practice.
Logistic discrimination for categorical and non-normal situations
We often encounter situations in which the classification variables are discrete, even
binary. In these situations, and where we have reason to believe that the classification
variables are not approximately multivariate normal, we can use a more generally
applicable classification technique based on logistic regression.

12

LOGISTIC REGRESSION
Nitin R Patel
Logistic regression extends the ideas of multiple linear regression to the situation
where the dependent variable, y, is binary (for convenience we often code these values as
0 and 1). As with multiple linear regression the independent variables x1 , x2 xk may
be categorical or continuous variables or a mixture of these two types.
Let us take some examples to illustrate [1]:
Example 1: Market Research
The data in Table 1 were obtained in a survey conducted by AT & T in the US
from a national sample of co-operating households. Interest was centered on the adoption
of a new telecommunications service as it related to education, residential stability and
income.
Table 1: Adoption of New Telephone Service

Low
Income
High
Income

High School or below


No Change in
Change in
Residence during Residence during
Last ve years
Last ve years
153/2160 = 0.071 226/1137 = 0.199

Some College or above


No change in
Change in
Residence during Residence during
Last ve years
Last ve years
61/886 = 0.069
233/1091 = 0.214

147/1363 = 0.108

287/1925 = 0.149

139/ 547 = 0.254

382/1415 = 0.270

(For fractions in cells above, the numerator is the number of adopters out of the
number in the denominator).
Note that the overall probability of adoption in the sample is 1628/10524 = 0.155.
However, the adoption probability varies depending on the categorical independent variables education, residential stability and income. The lowest value is 0.069 for low- income
no-residence-change households with some college education while the highest is 0.270 for
1

high-income residence changers with some college education.


The standard multiple linear regression model is inappropriate to model this data for
the following reasons:
1. The models predicted probabilities could fall outside the range 0 to 1.
2. The dependent variable is not normally distributed. In fact a binomial model would
be more appropriate. For example, if a cell total is 11 then this variable can take
on only 12 distinct values 0, 1, 2 11. Think of the response of the households in
a cell being determined by independent ips of a coin with, say, heads representing
adoption with the probability of heads varying between cells.
3. If we consider the normal distribution as an approximation for the binomial model,
the variance of the dependent variable is not constant across all cells: it will be
higher for cells where the probability of adoption, p, is near 0.5 than where it is
near 0 or 1. It will also increase with the total number of households, n, falling in
the cell. The variance equals n(p(1 p)).
The logistic regression model was developed to account for all these diculties. It
has become very popular in describing choice behavior in econometrics and in modeling
risk factors in epidemiology. In the context of choice behavior it can be shown to follow
from the random utility theory developed by Manski [2] as an extension of the standard
economic theory of consumer behavior.
In essence the consumer theory states that when faced with a set of choices a consumer
makes a choice which has the highest utility ( a numeric measure of worth with arbitrary
zero and scale). It assumes that the consumer has a preference order on the list of choices
that satises reasonable criteria such as transitivity. The preference order can depend on
the individual (e.g. socioeconomic characteristics as in the Example 1 above) as well as
2

attributes of the choice. The random utility model considers the utility of a choice to
incorporate a random element. When we model the random element as coming from a
reasonable distribution, we can logically derive the logistic model for predicting choice
behavior.
If we let y = 1 represent choosing an option versus y = 0 for not choosing it, the
logistic regression model stipulates:

Probability(Y = 1|x1 , x2 xk ) =

exp(O + 1 x1 + k xk )
1 + exp(O + 1 x1 + k xk )

where 0 , 1 , 2 k are unknown constants analogous to the multiple linear regression


model.
The independent variables for our model would be:
x1 ( Education: High School or below = 0, Some College or above = 1
x2 (Residential Stability: No change over past ve years = 0, Change over past ve
years = 1
x3 Income: Low = 0 High = 1
The data in Table 1 is shown below in the format typically required by regression
programs.
x1 x2 x3 # in sample #adopters # Non-adopters Fraction adopters
0 0 0
2160
153
2007
.071
0 0 1
1363
147
1216
.108
0 1 0
1137
226
911
.199
0 1 1
547
139
408
.254
1 0 0
886
61
825
.069
1 1 0
1091
233
858
.214
1 0 1
1925
287
1638
.149
1 1 1
1415
382
1033
.270
10524
1628
8896
1.000
The logistic model for this example is:
3

P rob(Y = 1|x1 , x2 , x3 ) =

exp(0 + 1 xl + 2 x2 + 3 x3 )
.
1 + exp(0 + 1 xl + 2 x2 + 3 x3 )

We obtain a useful interpretation for the coecients by noting that:


exp(0 ) =
=

P rob(Y = 1|x1 = x2 = x3 = 0)
P rob(Y = 0|x1 = x2 = x3 = 0)
Odds of adopting in the base case (x1 = x2 = x3 = 0)

Odds of adopting when x1 = 1, x2 = x3 = 0


Odds of adopting in the base case
Odds of adopting when x2 = 1, x1 = x3 = 0
exp(2 ) =
Odds of adopting in the base case
Odds of adopting when x3 = 1, x1 = x2 = 0
exp(3 ) =
Odds of adopting in the base case

exp(1 ) =

The logistic model is multiplicative in odds in the following sense:


Odds of adopting for a given x1 , x2 , x3
= exp(0 ) exp(1 x1 ) exp(2 x2 ) exp(3 x3 )

Odds
F actor


F actor

F actor
f or
due
due
due
=

basecase
to x1
to x2
to x3

If x1 = 1 the odds of adoption get multiplied by the same factor for any given level of
x2 and x3 . Similarly the multiplicative factors for x2 and x3 do not vary with the levels
of the remaining factors. The factor for a variable gives us the impact of the presence of
that factor on the odds of adopting.
If i = 0, the presence of the corresponding factor has no eect (multiplication by
one). If i < 0, presence of the factor reduces the odds (and the probability) of adoption,
whereas if i > 0, presence of the factor increases the probability of adoption.
The computations required to produce these maximum likelihood estimates require
iterations using a computer program. The output of a typical program is shown below:

95% Conf. Intvl. for odds


Variable Coe. Std. Error p-Value Odds Lower Limit Upper Limit
Constant -2.500
0.058
0.000 0.082
0.071
0.095
x1
0.161
0.058
0.006 1.175
1.048
1.316
x2
0.992
0.056
0.000 2.698
2.416
3.013
x3
0.444
0.058
0.000 1.560
1.393
1.746
From the estimated values of the coecients, we see that the estimated probability of
adoption for a household with values x1 , x2 and x3 for the independent variables is:

P rob(Y = 1|x1 , x2 , x3 ) =

exp(2.500 + 0.161 x1 + 0.992 x2 + 0.444 x3 )


.
1 + exp(2.500 + 0.161 x1 + 0.992 x2 + 0.444 x3 )

The estimated number of adopters from this model will be the total number of households
with values x1 , x2 and x3 for the independent variables multiplied by the above probability.
The table below shows the estimated number of adopters for the various combinations
of the independent variables.

x1

x2

x3

0
0
0
0
1
1
1
1

0
0
1
1
0
1
0
1

0
1
0
1
0
0
1
1

# in # adopters
Estimated
Fraction
Estimated
sample
(# adopters) Adopters P rob(Y = l|x1 , x2 , x3 )
2160
153
164
0.071
0.076
1363
147
155
0.108
0.113
1137
226
206
0.199
0.181
547
139
140
0.254
0.257
886
61
78
0.069
0.088
1091
233
225
0.214
0.206
1925
287
252
0.149
0.131
1415
382
408
0.270
0.289

In data mining applications we will have validation data that is a hold-out sample not
used in tting the model.
Let us suppose we have the following validation data consisting of 598 households:
5

x1

x2

x3

0
0
0
0
1
1
1
1

0
0
1
1
0
1
0
1
Totals

0
1
0
1
0
0
1
1

# in
# adopters
Estimated
Error Absolute
validation in validation (# adopters) (Estimate
Value
sample
sample
-Actal) of Error
29
3
2.200
-0.800
0.800
23
7
2.610
-4.390
4.390
112
25
20.302
-4.698
4.698
143
27
36.705
9.705
9.705
27
2
2.374
0.374
0.374
54
12
11.145
-0.855
0.855
125
13
16.338
3.338
3.338
85
30
24.528
-5.472
5.472
598
119
116.202

The total error is -2.8 adopters or a percentage error in estimating adopters of -2.8/119
= 2.3%.
The average percentage absolute error is
0.800 + 4.390 + 4.698 + 9.705 + 0.374 + 0.855 + 3.338 + 5.472
119
= .249 = 24.9% adopters.
The confusion matrix for households in the validation data for set is given below:
Observed
Adopters Non-adopters Total

Predicted:
Adopters
103
Non-adopters 16
Total
119

13
466
479

116
482
598

As with multiple linear regression we can build more complex models that reect
interactions between independent variables by including factors that are calculated from
the interacting factors. For example if we felt that there is an interactive eect b etween
x1 and x2 we would add an interaction term x4 = x1 x2 .

Example 2: Financial Conditions of Banks [2]


Table 2 gives data on a sample of banks. The second column records the judgment of
an expert on the nancial condition of each bank. The last two columns give the values
of two commonly ratios commonly used in nancial analysis of banks.
Table 2: Financial Conditions of Banks
Financial
Obs Condition
(y)
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
1
10
1
11
0
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
0
20
0

Total Loans & Leases/ Total Expenses /


Total Assets
Total Assets
(x1 )
(x2 )
0.64
0.13
1.04
0.10
0.66
0.11
0.80
0.09
0.69
0.11
0.74
0.14
0.63
0.12
0.75
0.12
0.56
0.16
0.65
0.12
0.55
0.10
0.46
0.08
0.72
0.08
0.43
0.08
0.52
0.07
0.54
0.08
0.30
0.09
0.67
0.07
0.51
0.09
0.79
0.13

Financial Condition = 1 for nancially weak banks;


= 0 for nancially strong banks.
Let us rst consider a simple logistic regression model with just one independent
variable. This is analogous to the simple linear regression model in which we t a straight
line to relate the dependent variable, y, to a single independent variable, x.
7

Let us construct a simple logistic regression model for classication of banks using
the Total Loans & Leases to Total Assets ratio as the independent variable in our model.
This model would have the following variables:

Dependent variable:
Y

= 1,

if nancially distressed,

= 0,

otherwise.

Independent (or Explanatory) variable:


x1 =

Total Loans & Leases/Total Assets Ratio

The equation relating the dependent variable with the explanatory variable is:

P rob(Y = 1|x1 ) =

exp(0 + 1 xl )
1 + exp(0 + 1 xl )

or, equivalently,
Odds (Y = 1 versus Y = 0) = (0 + 1 xl ).

The Maximum Likelihood Estimates of the coecients for the model are: 0 = 6.926,
1 = 10.989
So that the tted model is:

P rob(Y = 1|x1 ) =

exp(6.926 + 10.989 x1 )
.
(1 + exp(6.926 + 10.989 x1 )
8

Figure 1 displays the data points and the tted logistic regression model.

We can think of the model as a multiplicative model of odds ratios as we did for
Example 1. The odds that a bank with a Loan & Leases/Assets Ratio that is zero will
be in nancial distress = exp(6.926) = 0.001. These are the base case odds. The
odds of distress for a bank with a ratio of 0.6 will increase by a multiplicative factor of
exp(10.9890.6) = 730 over the base case, so the odds that such a bank will be in nancial
distress = 0.730.
Notice that there is a small dierence in interpretation of the multiplicative factors for
this example compared to Example 1. While the interpretation of the sign of i remains
as before, its magnitude gives the amount by which the odds of Y = 1 against Y = 0
are changed for a unit change in xi . If we construct a simple logistic regression model for
classication of banks using the Total Expenses/ Total Assets ratio as the independent
variable we would have the following variables:
Dependent variable:
Y

= 1, if nancially distressed,
= 0,

otherwise.

Independent (or Explanatory) variables:


x2 =

Total Expenses/ Total Assets Ratio

The equation relating the dependent variable with the explanatory variable is:
P rob(Y = l|x1 ) =

exp(0 + 2 x2 )
1 + exp(0 + 2 x2 )

or, equivalently,
Odds (Y = 1 versus Y = 0) = (0 + 2 x2 ).
The Maximum Likelihood Estimates of the coecients for the model are: 0 = 9.587,
2 = 94.345
10

Figure 2 displays the data points and the tted logistic regression model.

Computation of Estimates
As illustrated in Examples 1 and 2, estimation of coecients is usually carried out
based on the principle of maximum likelihood which ensures good asymptotic (large sample) properties for the estimates. Under very general conditions maximum likelihood
estimators are:
Consistent : the probability of the estimator diering from the true value approaches
zero with increasing sample size;
Asymptotically Ecient : the variance is the smallest possible among consistent
estimators
Asymptotically Normally Distributed: This allows us to compute condence intervals
and perform statistical tests in a manner analogous to the analysis of linear multiple
regression models, provided the sample size is large.
11

Algorithms to compute the coecient estimates and condence intervals are iterative
and less robust than algorithms for linear regression. Computed estimates are generally
reliable for well-behaved datasets where the number of observations with depende nt
variable values of both 0 and 1 are large; their ratio is not too close to either zero or
one; and when the number of coecients in the logistic regression model is small relative
to the sample size (say, no more than 10%). As with linear regression collinearity (strong
correlation amongst the independent variables) can lead to computational diculties.
Computationally intensive algorithms have been developed recently that circumvent some
of these diculties [3].

12

Appendix A
Computing Maximum Likelihood Estimates and Condence Intervals
for Regression Coecients
We denote the coecients by the p 1 column vector with the row element i equal
to i , The n observed values of the dependent variable will be denoted by the n 1
column vector y with the row element j equal to yj ; and the corresponding values of the
independent variable i by xij for
i = 1 p; j = 1 n.
Data : yj , x1j , x2j , , xpj ,
Likelihood Function:

j = 1, 2, , n.
The likelihood function, L, is the probability of the observed

data viewed as a function of the parameters (2i in a logistic regression).

eyi (0 +1 x1j +2 x2j +p xpj )


0 +1 x1j +2 x2j +i xpj )
j=1 1 + e
n


n


ei yj i xij
i i xij
j=1 1 + e

ei (j yj xij )i
= 
n
[1 + ei i xij ]
j=1

= 
n

ei i ti
[1 + ei i xij ]

j=1

where ti = j yj xij
These are the sucient statistics for a logistic regression model analogous to y and S in
linear regression.
Loglikelihood Function: This is the logarithm of the likelihood function,
l = i i ti j log[1 + ei i xij ].
13

We nd the maximum likelihood estimates, i , of i by maximizing the loglikelihood


function for the observed values of yj and xij in our data. Since maximizing the log of
a function is equivalent to maximizing the function, we often work with the loglikelihood because it is generally less cumbersome to use for mathematical operations such as
dierentiation.
Since the likelihood function can be shown to be concave, we will nd the global maximum of the function (if it exists) by equating the partial derivatives of the loglikelihood
to zero and solving the resulting nonlinear equations for i .

l
i

= ti j

xij ei i xij
[1 + ei i xij ]

= ti j xij
j = 0, i = 1, 2, , p
j = ti
or i xij
where
j =

ei bbi xij
[1+ei i xij ]

= E(Yj )

An intuitive way to understand these equations is to note that


j xij E(Yj ) = j xij yj
In words, the maximum likelihood estimates are such that the expected value of the
sucient statistics are equal to their observed values.
Note : If the model includes the constant term xij = 1 for all j then j E(Yj ) = j yj , i.e.
the expected number of successes (responses of one) using MLE estimates of i equals the
observed number of successes. The i s are consistent, asymptotically ecient and follow
a multivariate Normal distribution (subject to mild regularity conditions).
Algorithm : A popular algorithm for computing i uses the Newton-Raphson method
for maximizing twice dierentiable functions of several variables (see Appendix B).
14

The Newton-Raphson method involves computing the following successive approximations to nd i , the likelihood function
t+1 = t + [I( t )]1 I( t )
where
Iij =

2l
i j j

On convergence, the diagonal elements of I( t )1 give squared standard errors (approximate variance) for i .
Condence intervals and hypothesis tests are based on asymptotic normal distribution of i .
The loglikelihood function is always negative and does not have a maximum when it
can be made arbitrary close to zero. In that case the likelihood function can be made
arbitrarily close to one and the rst term of the loglikelihood function given above
approaches innity. In this situation the predicted probabilities for observations
with yj = 0 can be made arbitrarily close to 0 and those for yj = 1 can be made
arbitrarily close to 1 by choosing suitable very large absolute values of some i . This
is the situation when we have a perfect model (at least in terms of the training data
set)! This phenomenon is more likely to occur when the number of parameters is a
large fraction (say > 20%) of the number of observations.

15

Appendix B
The Newton-Raphson Method
This method nds the values of i that maximize a twice dierentiable concave
function, g(). If the function is not concave, it nds a local maximum. The
method uses successive quadratic approximations to g based on Taylor series. It
converges rapidly if the starting value, 0 , is reasonably close to the maximizing
of .
value, ,
The gradient vector and the Hessian matrix, H, as dened below, are used to
update an estimate t to t+1 .

..
.

g( t ) =

g
i

..
.

H( t ) =

..
.

2g
i k

..
.

The Taylor series expansion around t gives us:


g() g( t ) + g( t )( t ) + 1/2( t ) H( t )( t )
Provided H( t ) is positive denite, the maximum of this approximation occurs when
its derivative is zero.
g( t ) H( t )( t ) = 0
or
= t [H( t )]1 g( t ).
This gives us a way to compute t+1 , the next value in our iterations.
t+1 = t [H( t ]1 g( t ).
To use this equation H should be non-singular. This is generally not a problem
although sometimes numerical diculties can arise due to collinearity.
16

Near the maximum the rate of convergence is quadratic as it can be shown that
|it+1 i | c|it i |2 for some c 0 when it is near i for all i.

17

Sales of Handloom Saris


An Application of Logistic
Regression

Objectives
Illustrate importance of interpretation, domain
insights from managers for interpretation and
implementation
Relevance to situations where too many products
(or services) but can define more stable underlying
characteristics of products (or services)
Logistic Regression as a tool that parallels
multiple linear regression in practice. Powerful
analysis in a spreadsheet

Handloom Industry in India


Decentralized, traditional, rural, co-ops
Direct employment of 10 million persons
Accounts for 30% of total textile production

Co-optex (Tamilnadu State)


Large: 700 outlets; $30million; 400,000 looms
Strengths:
Design variety, short run lengths
Majority sales through co-op shops

Weaknesses:
Competing with mills difficult
Large inventories, high discount sales

Study Question
Improve feedback of market to designs
through improved product codes
Assess economic impact of proposed code
Pilot restricted to saris
Most difficult
Most valuable

A Consumer-oriented Code for


Saris
Developed with National Institute of Design

Sari components

B ody

B order

Pallav

Sari Code
B ody:W arp Color & S hade (W R PC, W R PS)
W eft C olor & S hade (W FT C, W FT S)
B ody Design (B ODD)
B order: C olor, S hade, D esign, Size (B R DC,
B R DS, B R DD, B R DZ)
Pallav: C olor, S hade, D esign, Size (PL V C,
PL V S, P L V D, P L V Z)

Code Levels
Color (Warp, weft, border, pallav)
10 levels:0=red, 1=blue, 2=green, etc.

Shade (Warp, weft, border, pallav)


4 levels: 0=light, 1=medium, 2=dark, 3=shiny;

Design (Body, border, pallav)


23 levels: 0=plain, 1=star buttas, 2=chakra buttas, etc.

Size (Border, pallav)


3 levels: 0= broad, 1=medium, 2=narrow

Assessing Impact
Major Marketing Experiment
14 day high season period selected
18 largest retail shops selected
20,000 saris coded, sales during period
recorded
Logistic Regression models developed for
Pr(sale of sari during period) as function of
coded values.

Example data (Plain saris)


Sari# WrpCI BrdClr WftClr PlvClr WrpS BrdSh WftSh PlvSh BrdDs PlvDs BrdSz PlvSz Response
1
2
2
2
2
2
3
2
3
0
1
0
2
1
2
0
2
0
2
2
3
2
3
0
1
0
0
1
3
0
2
0
2
2
3
2
3
0
1
1
2
1
4
1
2
1
2
0
3
0
3
0
1
1
2
1
5
1
2
1
8
1
3
1
3
0
1
0
1
1
6
4
2
4
8
2
3
2
3
0
1
0
1
1
7
0
1
3
2
0
2
2
3
0
1
0
1
0
8
1
2
1
2
2
3
2
3
0
1
0
1
1
9
1
2
1
2
0
3
0
3
1
1
2
2
1
10
4
2
2
2
1
3
1
3
1
1
2
2
1
11
1
1
1
2
0
2
0
3
0
1
0
2
1

Logistic Regression Model


Odds(Sale)
=exp(0+ 1WRPC_1 + 2WRPC_2 + 3WRPC_3
+ 4WRPC_4 + 5PLVD_1
+ 6BRDZ_1+ 7BRDZ_2)

Coefficient Estimates
Coeff
-0.698
0.195
-2.220
-2.424
-0.072
1.866
-0.778
-0.384

Variable
Constant
WrpCI_1
WrpCI_2
WrpCI_3
WrpCI_4
PlvDs_1
BrdSz_1
BrdSz_2

Odds
1.215
0.109
0.089
0.931
6.462
0.459
0.681

Confusion Table
(Cut-off probability = 0.5)
Actual

Sale
Sale
Predicted

No Sale
Total

No Sale

Total

15

20

32

37

20

37

57

Impact
Producing only saris that have predicted
probability > 0.5 will reduce slow-moving
stock substantially. In the example, slowmoving stock will go down from 65% of
production to 25% of production
Even cut-off probability of 0.2 reduces slow
stock to 49% of production

Insights
Certain colors and combinations sold much worse
than average but were routinely produced (e.g.
green, border widths-body color interaction)
Converse of above (e.g. plain designs, light shade
body)
Above adjustments possible within weavers skill
and equipment constraints
Huge potential for cost savings in silk saris
Need for streamlining code, training to code.

Reasons for versatility of Logistic


Regression Models in Applications
Derivable from random utility theory of discrete choice
Intuitive model for choice-based samples and case-control
studies
Derivable from latent continuous variable model
Logistic Distribution indistinguishable from Normal within 2
standard deviations range
Derivable from Normal population models of discrimination
(pooled covariance matrix)
Fast algorithms
Extends to multiple choices (polytomous regression)
Small sample exact analysis useful for rare events (e.g. fraud,
accidents, lack of relevant data, small segment of data)

Lecture 6

Articial Neural Networks

Articial Neural Networks

In this note we provide an overview of the key concepts that have led to
the emergence of Articial Neural Networks as a major paradigm for Data
Mining applications. Neural nets have gone through two major development
periods -the early 60s and the mid 80s. They were a key development in
the eld of machine learning. Articial Neural Networks were inspired by
biological ndings relating to the behavior of the brain as a network of units
called neurons. The human brain is estimated to have around 10 billion
neurons each connected on average to 10,000 other neurons. Each neuron
receives signals through synapses that control the eects of the signal on
the neuron. These synaptic connections are believed to play a key role in
the behavior of the brain. The fundamental building block in an Articial
Neural Network is the mathematical model of a neuron as shown in Figure
1. The three basic components of the (articial) neuron are:
1. The synapses or connecting links that provide weights, wj , to the
input values, xj f or j = 1, ...m;
2. An adder that sums the weighted input values to compute the
input to the activation function v = w0 +

m


j=1

wj xj ,where w0 is called the

bias (not to be confused with statistical bias in prediction or estimation) is


a numerical value associated with the neuron. It is convenient to think of
the bias as the weight for an input x0 whose value is always equal to one,
so that v =

m


j=0

wj xj ;

3. An activation function g (also called a squashing function) that


maps v to g(v) the output value of the neuron. This function is a monotone
function.

Figure 1

While there are numerous dierent (articial) neural network architectures that have been studied by researchers, the most successful applications in data mining of neural networks have been multilayer feedforward
networks. These are networks in which there is an input layer consisting
of nodes that simply accept the input values and successive layers of nodes
that are neurons as depicted in Figure 1. The outputs of neurons in a layer
are inputs to neurons in the next layer. The last layer is called the output
layer. Layers between the input and output layers are known as hidden
layers. Figure 2 is a diagram for this architecture.
Figure 2

In a supervised setting where a neural net is used to predict a numerical


quantity there is one neuron in the output layer and its output is the prediction. When the network is used for classication, the output layer typically
has as many nodes as the number of classes and the output layer node with
3

the largest output value gives the networks estimate of the class for a given
input. In the special case of two classes it is common to have just one node
in the output layer, the classication between the two classes being made
by applying a cut-o to the output value at the node.

1.1

Single layer networks

Let us begin by examining neural networks with just one layer of neurons
(output layer only, no hidden layers). The simplest network consists of just
one neuron with the function g chosen to be the identity function, g(v) = v
for all v. In this case notice that the output of the network is

m


j=0

wj xj , a

linear function of the input vector x with components xj . If we are modeling


the dependent variable y using multiple linear regression, we can interpret
the neural network as a structure that predicts a value y for a given input
vector x with the weights being the coecients. If we choose these weights
to minimize the mean square error using observations in a training set, these
weights would simply be the least squares estimates of the coecients. The
weights in neural nets are also often designed to minimize mean square error
in a training data set. There is, however, a dierent orientation in the case
of neural nets: the weights are learned. The network is presented with
cases from the training data one at a time and the weights are revised after
each case in an attempt to minimize the mean square error. This process of
incremental adjustment of weights is based on the error made on training
cases and is known as training the neural net. The almost universally used
dynamic updating algorithm for the neural net version of linear regression is
known as the Widrow-Ho rule or the least-mean-square (LMS) algorithm.
It is simply stated. Let x(i) denote the input vector x for the ith case used to
train the network, and the weights before this case is presented to the net by
the vector w(i). The updating rule is w(i+1) = w(i)+(y(i) y(i))x(i) with
w(0) = 0. It can be shown that if the network is trained in this manner by
repeatedly presenting test data observations one-at-a-time then for suitably
small (absolute) values of the network will learn (converge to) the optimal
values of w. Note that the training data may have to be presented several
times for w(i) to be close to the optimal w. The advantage of dynamic
updating is that the network tracks moderate time trends in the underlying
linear model quite eectively.
If we consider using the single layer neural net for classication into c
classes, we would use c nodes in the output layer. If we think of classical
4

discriminant analysis in neural network terms, the coecients in Fishers


classication functions give us weights for the network that are optimal
if the input vectors come from Multivariate Normal distributions with a
common covariance matrix.
For classication into two classes, the linear optimization approach that
we examined in class, can be viewed as choosing optimal weights in a single
layer neural network using the appropriate objective function.
Maximum likelihood coecients for logistic regression can also be considered as weights in a neural network to minimize a function ofthe residuals

ev
called the deviance. In this case the logistic function g(v) = 1+e
is the
v
activation function for the output node.

1.2

Multilayer Neural networks

Multilayer neural networks are undoubtedly the most popular networks used
in applications. While it is possible to consider many activation functions, in
practice it hasbeen found that the logistic (also called the sigmoid) function
ev
g(v) = 1+e
as the activation function (or minor variants such as the
v
tanh function) works best. In fact the revival of interest in neural nets was
sparked by successes in training neural networks using this function in place
of the historically (biologically inspired) step function (the perceptron}.
Notice that using a linear function does not achieve anything in multilayer
networks that is beyond what can be done with single layer networks with
linear activation functions. The practical value of the logistic function arises
from the fact that it is almost linear in the range where g is between 0.1 and
0.9 but has a squashing eect on very small or very large values of v.
In theory it is sucient to consider networks with two layers of neurons
one hidden and one output layerand this is certainly the case for most
applications. There are, however, a number of situations where three and
sometimes four and ve layers have been more eective. For prediction the
output node is often given a linear activation function to provide forecasts
that are not limited to the zero to one range. An alternative is to scale the
output to the linear part (0.1 to 0.9) of the logistic function.
Unfortunately there is no clear theory to guide us on choosing the number
of nodes in each hidden layer or indeed the number of layers. The common
practice is to use trial and error, although there are schemes for combining
5

optimization methods such as genetic algorithms with network training for


these parameters.
Since trial and error is a necessary part of neural net applications it is
important to have an understanding of the standard method used to train
a multilayered network: backpropagation. It is no exaggeration to say that
the speed of the backprop algorithm made neural nets a practical tool in
the manner that the simplex method made linear optimization a practical
tool. The revival of strong interest in neural nets in the mid 80s was in large
measure due to the eciency of the backprop algorithm.

1.3

Example1: Fishers Iris data

Let us look at the Iris data that Fisher analyzed using Discriminant Analysis.
Recall that the data consisted of four measurements on three types of iris
owers. There are 50 observations for each class of iris. A part of the data
is reproduced below.

OBS#
1
2
3
4
5
6
7
8
9
10
...
51
52
53
54
55
56
57
58
59
60
...
101
102
103
104
105
106
107
108
109
110

SPECIES
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
...
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
...
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica

CLASSCODE
1
1
1
1
1
1
1
1
1
1
...
2
2
2
2
2
2
2
2
2
2
...
3
3
3
3
3
3
3
3
3
3

SEPLEN
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4
4.9
...
7
6.4
6.9
5.5
6.5
5.7
6.3
4.9
6.6
5.2
...
6.3
5.8
7.1
6.3
6.5
7.6
4.9
7.3
6.7
7.2

SEPW
3.5
3
3.2
3.1
3.6
3.9
3.4
3.4
2.9
3.1
...
3.2
3.2
3.1
2.3
2.8
2.8
3.3
2.4
2.9
2.7
...
3.3
2.7
3
2.9
3
3
2.5
2.9
2.5
3.6

PETLEN
1.4
1.4
1.3
1.5
1.4
1.7
1.4
1.5
1.4
1.5
...
4.7
4.5
4.9
4
4.6
4.5
4.7
3.3
4.6
3.9
...
6
5.1
5.9
5.6
5.8
6.6
4.5
6.3
5.8
6.1

If we use a neural net architecture for this classication problem we will


need 4 nodes (not counting the bias node) one for each of the 4 independent
variables in the input layer and 3 neurons (one for each class) in the output
layer. Let us select one hidden layer with 25 neurons. Notice that there will
be a total of 25 connections from each node in the input layer to nodes in
the hidden layer. This makes a total of 4 x 25 = 100 connections between
7

PETW
0.2
0.2
0.2
0.2
0.2
0.4
0.3
0.2
0.2
0.1
...
1.4
1.5
1.5
1.3
1.5
1.3
1.6
1
1.3
1.4
...
2.5
1.9
2.1
1.8
2.2
2.1
1.7
1.8
1.8
2.5

the input layer and the hidden layer. In addition there will be a total of 3
connections from each node in the hidden layer to nodes in the output layer.
This makes a total of 25 x 3 = 75 connections between the hidden layer and
the output layer. Using the standard logistic activation functions, the network was trained with a run consisting of 60,000 iterations. Each iteration
consists of presentation to the input layer of the independent variables in a
case, followed by successive computations of the outputs of the neurons of
the hidden layer and the output layer using the appropriate weights. The
output values of neurons in the output layer are used to compute the error. This error is used to adjust the weights of all the connections in the
network using the backward propagation (backprop) to complete the iteration. Since the training data has 150 cases, each case was presented to
the network 400 times. Another way of stating this is to say the network
was trained for 400 epochs where an epoch consists of one sweep through
the entire training data. The results for the last epoch of training the neural
net on this data are shown below:
Iris Output 1
Classication Confusion Matrix
Desired
Class
1
2
3
Total

Computed Class
1
50

50

49
1
50

1
49
50

Total
50
50
50
150

Patterns
50
50
50
150

# Errors
0
1
1
2

% Errors
0.00
2.00
2.00
1.3

StdDev
( 0.00)
( 1.98)
( 1.98)
( 0.92)

Error Report
Class
1
2
3
Overall

The classication error of 1.3% is better than the error using discriminant
analysis which was 2% (See lecture note on Discriminant Analysis). Notice
that had we stopped after only one pass of the data (150 iterations) the
8

error is much worse (75%) as shown below:


Iris Output 2
Classication Confusion Matrix
Desired
Class
1
2
3
Total

Computed Class
1
2
3
10
7
2
13
1
6
12
5
4
35 13
12

Total
19
20
21
60

The classication error rate of 1.3% was obtained by careful choice of


key control parameters for the training run by trial and error. If we set
the control parameters to poor values we can have terrible results. To understand the parameters involved we need to understand how the backward
propagation algorithm works.

1.4

The Backward Propagation Algorithm

We will discuss the backprop algorithm for classication problems. There is


a minor adjustment for prediction problems where we are trying to predict
a continuous numerical value. In that situation we change the activation
function for output layer neurons to the identity function that has output
value=input value. (An alternative is to rescale and recenter the logistic
function to permit the outputs to be approximately linear in the range of
dependent variable values).
The backprop algorithm cycles through two distinct passes, a forward
pass followed by a backward pass through the layers of the network. The
algorithm alternates between these passes several times as it scans the training data. Typically, the training data has to be scanned several times before
the networks learns to make good classications.
Forward Pass: Computation of outputs of all the neurons in
the network The algorithm starts with the rst hidden layer using as
input values the independent variables of a case (often called an exemplar
in the machine learning community) from the training data set. The neuron
outputs are computed for all neurons in the rst hidden layer by performing
9

the relevant sum and activation function evaluations. These outputs are
the inputs for neurons in the second hidden layer. Again the relevant sum
and activation function calculations are performed to compute the outputs
of second layer neurons. This continues layer by layer until we reach the
output layer and compute the outputs for this layer. These output values
constitute the neural nets guess at the value of the dependent variable. If
we are using the neural net for classication, and we have c classes, we will
have c neuron outputs from the activation functions and we use the largest
value to determine the nets classication. (If c = 2, we can use just one
output node with a cut-o value to map an numerical output value to one
of the two classes).
Let us denote by wij the weight of the connection from node i to node j.
The values of wij are initialized to small (generally random) numbers in the
range 0.00 0.05. These weights are adjusted to new values in the backward
pass as described below.
Backward pass: Propagation of error and adjustment of weights
This phase begins with the computation of error at each neuron in the output
layer. A popular error function is the squared dierence between ok the
output of node k and yk the target value for that node. The target value
is just 1 for the output node corresponding to the class of the exemplar
and zero for other output nodes.(In practice it has been found better to use
values of 0.9 and 0.1 respectively.) For each output layer node compute its
error term as k = ok (1 ok )(yk ok ). These errors are used to adjust the
weights of the connections between the last-but-one layer of the network and
the output layer. The adjustment is similar to the simple Widrow-Hu rule
that we saw earlier in this note. The new value of the weight wjk of the
new = w old + o . Here is
connection from node j to node k is given by: wjk
j k
jk
an important tuning parameter that is chosen by trial and error by repeated
runs on the training data. Typical values for are in the range 0.1 to 0.9.
Low values give slow but steady learning, high values give erratic learning
and may lead to an unstable network.
The process is repeated for the connections between nodes in the last
hidden layer and the last-but-one hidden layer. The weight for the connecnew = w old + o where =
tion between nodes i and j is given by: wij
i j
j
ij

oj (1 oj ) k wjk k , for each node j in the last hidden layer.
The backward propagation of weight adjustments along these lines continues until we reach the input layer. At this time we have a new set of
weights on which we can make a new forward pass when presented with a
training data observation.
10

1.4.1

Multiple Local Optima and Epochs

The backprop algorithm is a version of the steepest descent optimization


method applied to the problem of nding the weights that minimize the
error function of the network output. Due to the complexity of the function
and the large numbers of weights that are being trained as the network
learns, there is no assurance that the backprop algorithm (and indeed any
practical algorithm) will nd the optimum weights that minimize error. the
procedure can get stuck at a local minimum. It has been found useful to
randomize the order of presentation of the cases in a training set between
dierent scans. It is possible to speed up the algorithm by batching, that is
updating the weights for several exemplars in a pass. However, at least the
extreme case of using the entire training data set on each update has been
found to get stuck frequently at poor local minima.
A single scan of all cases in the training data is called an epoch. Most
applications of feedforward networks and backprop require several epochs
before errors are reasonably small. A number of modications have been
proposed to reduce the epochs needed to train a neural net. One commonly
employed idea is to incorporate a momentum term that injects some inertia
in the weight adjustment on the backward pass. This is done by adding a
term to the expression for weight adjustment for a connection that is a fraction of the previous weight adjustment for that connection. This fraction is
called the momentum control parameter. High values of the momentum parameter will force successive weight adjustments to be in similar directions.
Another idea is to vary the adjustment parameter so that it decreases as
the number of epochs increases. Intuitively this is useful because it avoids
overtting that is more likely to occur at later epochs than earlier ones.
1.4.2

Overtting and the choice of training epochs

A weakness of the neural network is that it can be easily overtted, causing


the error rate on validation data to be much larger than the error rate on the
training data. It is therefore important not to overtrain the data. A good
method for choosing the number of training epochs is to use the validation
data set periodically to compute the error rate for it while the network is
being trained. The validation error decreases in the early epochs of backprop
but after a while it begins to increase. The point of minimum validation
error is a good indicator of the best number of epochs for training and the
weights at that stage are likely to provide the best error rate in new data.

11

1.5

Adaptive Selection of Architecture

One of the time consuming and complex aspects of using backprop is that
we need to decide on an architecture before we can use backprop. The
usual procedure is to make intelligent guesses using past experience and to
do several trial and error runs on dierent architectures. Algorithms exist
that grow the number of nodes selectively during training or trim them in a
manner analogous to what we have seen with CART. Research continues on
such methods. However, as of now there seems to be no automatic method
that is clearly superior to the trial and error approach.

1.6

Successful Applications

There have been a number of very successful applications of neural nets in


engineering applications. One of the well known ones is ALVINN that is an
autonomous vehicle driving application for normal speeds on highways. The
neural net uses a 30x32 grid of pixel intensities from a xed camera on the
vehicle as input, the output is the direction of steering. It uses 30 output
units representing classes such as sharp left, straight ahead, and bear
right. It has 960 input units and a single layer of 4 hidden neurons. The
backprop algorithm is used to train ALVINN.
A number of successful applications have been reported in nancial applications (see reference 2) such as bankruptcy predictions, currency market
trading, picking stocks and commodity trading. Credit card and CRM applications have also been reported.

References
1. Bishop, Christopher: Neural Networks for Pattern Recognition, Oxford, 1995.
2. Trippi, Robert and Turban, Efraim (editors): Neural Networks in Finance and Investing, McGraw Hill 1996.

12

Multiple Linear Regression


Review

Outline
Outline
Simple Linear Regression
Multiple Regression
Understanding the Regression Output
Coefficient of Determination R2
Validating the Regression Model

A p p le g lo

R e g io n
M a in e
N e w H a m p s h ir e
V erm ont
M a s s a c h u s e t ts
C o n n e c t ic u t
R h o d e I s la n d
N ew Y ork
N ew Jersey
P e n n s y lv a n ia
D e la w a r e
M a r y la n d
W e s t V ir g in ia
V ir g in ia
O h io

F ir s t- Y e a r
A d v e r t is in g
E x p e n d it u r e s
($ m illio n s )
x
1 .8
1 .2
0 .4
0 .5
2 .5
2 .5
1 .5
1 .2
1 .6
1 .0
1 .5
0 .7
1 .0
0 .8

F ir s t -Y e a r
S a le s
($ m illio n s )
y
104
68
39
43
127
134
87
77
102
65
101
46
52
33

First Year Sales ($Millions)

Linear
Linear Regression:
Regression: An
AnExample
Example
160
120
80
40
0
0

0.5

1.5

2.5

Advertising Expenditures ($Millions)

Questions: a) How to relate advertising expenditure to sales?


b) What is expected firstfirst-year sales if advertising
expenditure is $2.2 million?
c) How confident is your estimate? How good is the fit?
fit?

The
The Basic
Basic Model:
Model: Simple
Simple Linear
Linear Regression
Regression
Data:
Data: (x1, y1), (x2, y2), . . . , (x
(xn, yn)
Model of the population:
population: Yi = 0 + 1 xi + i
1, 2, . . . , n are i.i.d. random variables, N(0, )
This is the true relation between Y and x, but we do
not know 0 and 1 and have to estimate them based
on the data.
Comments:

E (Yi | xi) = 0 + 1x i
SD(Yi | xi) =
Relationship is linear described by a line
0 = baseline value of Y (i.e., value of Y if x is 0)
1 = slope of line (average change in Y per unit change in x)

How do we choose the line that best fits the data?


First Year Sales ($M)

80

(xi, ^
yi)

Best choices:
bo = 13.82
b1 = 48.60

60

ei

bo=13.82

40

(xi, yi)
20

Slope b1 = 48.60
0
0

0.5

Advertising Expenditures ($M)

Regression coefficients:
coefficients: b0 and b1 are estimates of 0 and 1
^
Regression estimate for Y at xi : y
i = b0 + b1xi (prediction)

Residual (error):
(error): ei = yi - ^yi
The best
regression line is the one that chooses b0 and b1 to
best
minimize the total errors (residual sum of squares):
n

SSR = i=1ei2 =

i=1 (yi - y^i )2

Example:
Example: Sales
Sales of
ofNature-Bar
Nature($ million)
million)
Nature-Bar($

region

sales advertising promotions competitors


sales
Selkirk
101.8
1.3
0.2
20.40
Susquehanna 44.4
0.7
0.2
30.50
Kittery
108.3
1.4
0.3
24.60
Acton
85.1
0.5
0.4
19.60
Finger Lakes
77.1
0.5
0.6
25.50
Berkshire
158.7
1.9
0.4
21.70
Central
180.4
1.2
1.0
6.80
Providence
64.2
0.4
0.4
12.60
Nashua
74.6
0.6
0.5
31.30
Dunster
143.4
1.3
0.6
18.60
Endicott
120.6
1.6
0.8
19.90
Five-Towns
69.7
1.0
0.3
25.60
Waldeboro
67.8
0.8
0.2
27.40
Jackson
106.7
0.6
0.5
24.30
Stowe
119.6
1.1
0.3
13.70

Multiple
Multiple Regression
Regression
In general, there are many factors in addition to advertising
expenditures that affect sales
Multiple regression allows more than one x variables.
Independent variables:
Data:

x1, x2, . . . , xk

(k of them)

(y1, x11, x21, . . . , xk1), . . . , (y


(yn, x1n, x2n, . . . , xkn),
Yi = 0 + 1x1i + . . . + kxki + i
1, 2, . . . , n are iid random variables, ~ N(0, )

Population Model:

Regression coefficients: b0, b1,, bk are estimates of 0, 1,, k .


^
Regression Estimate of yi : yi =

b0 + b1x1i + . . . + bkxki

Goal: Choose b0, b1, ... , bk to minimize the residual sum of


squares. I.e., minimize:
n

SSR = i=1ei2 =

i=1 (yi - y^i )2

Regression
RegressionOutput
Output(from
(from Excel)
Excel)
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.913
0.833
0.787
17.600
15

Analysis of
Variance
df
Regression
Residual
Total

Sum of
Mean
F
Significance
F
Squares
Square
3 16997.537 5665.85 18.290
0.000
11 3407.473 309.77
14 20405.009

Coefficients Standard
Error
Intercept
Advertising
Promotions
Competitors
Sales

65.71
48.98
59.65
-1.84

27.73
10.66
23.63
0.81

PLower
t
Statistic value
95%
2.37
4.60
2.53
-2.26

Upper
95%

0.033
4.67 126.74
0.000 25.52
72.44
0.024
7.66 111.65
0.040 -3.63 -0.047

Understanding
UnderstandingRegression
RegressionOutput
Output
1) Regression coefficients:
coefficients: b0, b1, . . . , bk are estimates
of 0, 1, . . . , k based on sample data. Fact: E[b
E[bj ] =
= j .
Example:
b0 = 65.705 (its interpretation is context dependent .
b1 = 48.979 (an additional $1 million in advertising is
expected to result in an additional $49 million in sales)
b2 = 59.654 (an additional $1 million in promotions is
expected to result in an additional $60 million in sales)
b3 = -1.838 (an increase of $1 million in competitor sales
is expected to decrease sales by $1.8 million)

Understanding
UnderstandingRegression
RegressionOutput,
Output, Continued
Continued
2) Standard errors:
errors: an estimate of , the SD of each i.
It is a measure of the amount of noise in the model.
Example: s = 17.60
3) Degrees of freedom:
freedom: #cases - #parameters,
relates to overover-fitting phenomenon
4) Standard errors of the coefficients:
coefficients: sb0 , sb1 , . . . , sbk
They are just the standard deviations of the estimates
b0 , b1, . . . , bk.
They are useful in assessing the quality of the coefficient
coefficient
estimates and validating the model.

First Year Sales ($Millions)

R2 takes values between 0


and 1 (it is a percentage).
R2 = 0.833 in our
Appleglo Example

160
120
80
40
0
0

0.5

1.5

2.5

Advertising Expenditures ($Millions)

30

35
30
25
20
15
10
5
0

25
20
15
10
5
0

10

15

20

25

30

10

15

20

25

30

R2 = 1;
1; x values account for
all variation in the Y values

R2 = 0;
0; x values account for
none variation in the Y values

Understanding
UnderstandingRegression
RegressionOutput,
Output, Continued
Continued
5) Coefficient of determination:
determination: R2
It is a measure of the overall quality of the regression.
Specifically, it is the percentage of total variation exhibited in
the yi data that is accounted for by the sample regression line.
_
The sample mean of Y: y = (y1 + y2 + . . . + yn)/ n
Total variation in Y =

i=1 (yi - y )2

^
Residual (unaccounted) variation in Y = i=1 ei2 = i=1 (yi - yi )2

R2 =
=1-

variation accounted for by x variables


total variation
variation not accounted for by x variables
total variation
n

=1-

i=1 (yi - y^i )2


n

i=1 (yi - y )2

2
Coefficient
Coefficientof
ofDetermination:
Determination: RR2

A high R2 means that most of the variation we observe in


the yi data can be attributed to their corresponding x values
a desired property.
In simple regression, the R2 is higher if the data points are
better aligned along a line. But outliers Anscombe example.
How high a R2 is good enough depends on the situation
(for example, the intended use of the regression, and
complexity of the problem).
Users of regression tend to be fixated on R2, but its not the
whole story. It is important that the regression model is valid.
valid.

2
Coefficient
Coefficientof
ofDetermination:
Determination: RR2

One should not include x variables unrelated to Y in the model,


model,
just to make the R2 fictitiously high. (With more x variables
there will be more freedom in choosing the bis to make the
residual variation closer to 0).
Multiple R is just the square root of R2.

Validating
Validatingthe
theRegression
RegressionModel
Model
Assumptions about the population:
Yi = 0 + 1x1i + . . . + kxki + i (i = 1, . . . , n)
1, 2, . . . , n are iid random variables, ~ N(0, )
1) Linearity
If k = 1 (simple regression), one can check visually from scatter
scatter plot.
Sanity check: the sign of the coefficients, reason for nonnon-linearity?

2) Normality of i
^
Plot a histogram of the residuals (e
(ei = yi - yi ).

Usually, results are fairly robust with respect to this assumption.


assumption.

3) Heteroscedasticity
Do error terms have constant Std. Dev.? (i.e., SD(
SD(i ) = for all i?)
Check scatter plot of residuals vs. Y and x variables.
Residuals
20.00
10.00
R
es
0.00
id
0.0
u -10.00

Residuals

20.00
10.00
0.00

1.0

2.0

0.0

1.0

2.0

-10.00

-20.00
Advertising
Expenditures

-20.00
Advertising Expenditures

No evidence of heteroscedasticity

Evidence of heteroscedasticity

May be fixed by introducing a transformation


May be fixed by introducing or eliminating some independent variables
variables

4) Autocorrelation : Are error terms independent?


Plot residuals in order and check for patterns
Time Plot

Time Plot
6

Residual

Residual

6
4
2
0
0

10

15

20

-2

4
2
0
0

-4

-2

-6

-4

No evidence of autocorrelation

10

15

20

Evidence of autocorrelation

Autocorrelation may be present if observations have a natural


sequential order (for example, time).
May be fixed by introducing a variable or transforming a variable.
variable.

Pitfalls
Pitfallsand
andIssues
Issues
1) Overspecification
Including too many x variables to make R2 fictitiously high.

Rule of thumb: we should maintain that n >= 5(k+2).


2) Extrapolating beyond the range of data

120
90
60
30
0
0.0

1.0

2.0

3.0

Advertising

Validating
Validatingthe
theRegression
RegressionModel
Model
3) Multicollinearity
Occurs when two of the x variable are strongly correlated.
Can give very wrong estimates for is.
s.
TellTell-tale signs:
- Regression coefficients (b
(bis)
s) have the wrong sign.
- Addition/deletion of an independent variable results in
large changes of regression coefficients
- Regression coefficients (b
s) not significantly different from 0
(bis)
May be fixed by deleting one or more independent variables

Example
Example
Student Graduate
Number
GPA
1
4.0
2
4.0
3
3.1
4
3.1
5
3.0
6
3.5
7
3.1
8
3.5
9
3.1
10
3.2
11
3.8
12
4.1
13
2.9
14
3.7
15
3.8
16
3.9
17
3.6
18
3.1
19
3.3
20
4.0
21
3.1
22
3.7
23
3.7
24
3.9
25
3.8

College
GPA
3.9
3.9
3.1
3.2
3.0
3.5
3.0
3.5
3.2
3.2
3.7
3.9
3.0
3.7
3.8
3.9
3.7
3.0
3.2
3.9
3.1
3.7
3.7
4.0
3.8

GMAT
640
644
557
550
547
589
533
600
630
548
600
633
546
602
614
644
634
572
570
656
574
636
635
654
633

10

Regression
RegressionOutput
Output
R Square
Standard Error
Observations

Intercept
College GPA
GMAT

Graduate
College
GMAT

0.96
0.08
25

What happened?

Coefficients Standard Error


0.09540
0.28451
1.12870
0.10233
-0.00088
0.00092

Graduate College
1
0.98
1
0.86
0.90

GMAT

Eliminate GMAT

College GPA and GMAT


are highly correlated!

R Square
Standard Error
Observations

Intercept
College GPA

0.958
0.08
25

Coefficients Standard Error


-0.1287
0.1604
1.0413
0.0455

Regression
RegressionModels
Models

In linear regression, we choose the best coefficients


b0, b1, ... , bk as the estimates for 0, 1,, k .
We know on average each bj hits the right target j .
However, we also want to know how confident we are about
our estimates

11

Back
Back to
toRegression
RegressionOutput
Output
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
Analysis of Varianc
df
Regression
Residual
Total

11

Sum of
Squares
16997.537
3407.473
20405.009

0.913
0.833
0.787
17.600
15
Mean
Square
5665.85
309.77

t
PCoeffic Standard
Error
Statistic value
ients
Intercept
65.71
27.73
2.37
Advertising
48.98
10.66
4.60
Promotions
59.65
23.63
2.53
Compet.
-1.84
0.81
-2.26
Sales

Lower
95%
4.67
25.52
7.66
-3.63

Upper
95%
126.74
72.44
111.65
-0.047

Regression
RegressionOutput
Output Analysis
Analysis
1) Degrees of freedom (dof
(dof))
Residual dof = n - (k+1)

(We used up (k + 1) degrees of

freedom in forming (k+1) sample estimates b0, b1, . . . , bk .)

2) Standard errors of the coefficients:


coefficients: sb0 , sb1 , . . . , sbk
They are just the SDs of estimates b0, b1, . . . , bk .
Fact: Before we observe b j and sbj,

bj - j

sbj

obeys a

t-distribution with dof = (n - k - 1), the same dof as the residual.


We will use this fact to assess the quality of our estimates bj .
What is a 95% confidence interval for j?
Does the interval contain 0? Why do we care about this?

12

b
tj = j
sbj
A measure of the statistical significance of each individual xj

3) t-Statistic:

in accounting for the variability in Y.


Let c be that number for which
P(%,
P(- c < T < c) = %,
where T obeys a t-distribution with dof = (n - k - 1).

If tj > c, then the % C.I. for j does not contain zero


In this case, we are % confident that j different from zero.

Example:
Example:Executive
Executive Compensation
Compensation

Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Pay
($1,000)
1,530
1,117
602
1,170
1,086
2,536
300
670
250
2,413
2,707
341
734
2,368

Years in
Change in
position Stock Price (%)
7
48
6
35
3
9
6
37
6
34
9
81
2
-17
2
-15
0
-52
10
109
7
44
1
28
4
10
8
16

Change in
Sales (%)
89
19
24
8
28
-16
-17
-67
49
-27
26
-7
-7
-4

MBA?
YES
YES
NO
YES
NO
YES
NO
YES
NO
YES
YES
NO
NO
NO

13

Dummy variables:
Often, some of the explanatory variables in a regression are
categorical rather than numeric.
If we think whether an executive has an MBA or not affects his/her
pay, We create a dummy variable and let it be 1 if the executive
has an MBA and 0 otherwise.
If we think season of the year is an important factor to determine
sales, how do we create dummy variables? How many?
What is the problem with creating 4 dummy variables?
In general, if there are m categories an x variable can belong to,
then we need to create m-1 dummy variables for it.

OILPLUS
OILPLUSdata
data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Month
August, 1989
September, 1989
October, 1989
November, 1989
December, 1989
January, 1990
February, 1990
March, 1990
April, 1990
May, 1990
June, 1990
July, 1990
August, 1990
September, 1990
October, 1990
November, 1990
December, 1990

heating oil
24.83
24.69
19.31
59.71
99.67
49.33
59.38
55.17
55.52
25.94
20.69
24.33
22.76
24.69
22.76
50.59
79.00

temperature
73
67
57
43
26
41
38
46
54
60
71
75
74
66
61
49
41

14

Heating Oil Consumption


(1,000 gallons)

120
100
80
60
40
20
0
20

40

60

80

100

oil consumption

Average Temperature (degrees Fahrenheit)

120
90
60
30
0
0.01

0.02
0.03
0.04
inverse temperature

heating oil temperature


24.83
73
24.69
67
19.31
57
59.71
43
99.67
26
49.33
41

0.05

inverse temperature
0.0137
0.0149
0.0175
0.0233
0.0385
0.0244

15

The
The Practice
Practiceof
ofRegression
Regression

Choose which independent variables to include in the model,


based on common sense and context specific knowledge.

Collect data (create dummy variables in necessary).


Run regression the easy part.
Analyze the output and make changes in the model this is
where the action is.

Test the regression result on outout-ofof-sample data

The
The Post-Regression
PostChecklist
Post-RegressionChecklist
1) Statistics checklist:
Calculate the correlation between pairs of x variables
watch for evidence of multicollinearity
Check signs of coefficients do they make sense?
Check 95% C.I. (use t-statistics as quick scan) are coefficients
significantly different from zero?
R2 :overall quality of the regression, but not the only measure
2) Residual checklist:
Normality look at histogram of residuals
Heteroscedasticity plot residuals with each x variable
Autocorrelation if data has a natural order, plot residuals in
order and check for a pattern

16

The
The Grand
GrandChecklist
Checklist
Linearity: scatter plot, common sense, and knowing your problem,
transform including interactions if useful

t-statistics: are the coefficients significantly different from zero?


Look at width of confidence intervals
F-tests for subsets, equality of coefficients
R2: is it reasonably high in the context?
Influential observations, outliers in predictor space, dependent
dependent
variable space

Normality: plot histogram of the residuals


Studentized residuals

Heteroscedasticity:
Heteroscedasticity: plot residuals with each x variable, transform if
necessary, BoxBox-Cox transformations

Autocorrelation: time series plot


Multicollinearity:
Multicollinearity: compute correlations of the x variables, do
signs of coefficients agree with intuition?
Principal Components

Missing Values

17

Multiple Linear Regression in


Data Mining

Contents
2.1. A Review of Multiple Linear Regression
2.2. Illustration of the Regression Process
2.3. Subset Selection in Linear Regression

Chap. 2

Multiple Linear Regression

Perhaps the most popular mathematical model for making predictions is


the multiple linear regression model. You have already studied multiple regression models in the Data, Models, and Decisions course. In this note we
will build on this knowledge to examine the use of multiple linear regression
models in data mining applications. Multiple linear regression is applicable to
numerous data mining situations. Examples are: predicting customer activity
on credit cards from demographics and historical activity patterns, predicting
the time to failure of equipment based on utilization and environment conditions, predicting expenditures on vacation travel based on historical frequent
ier data, predicting stang requirements at help desks based on historical
data and product and sales information, predicting sales from cross selling of
products from historical information and predicting the impact of discounts on
sales in retail outlets.
In this note, we review the process of multiple linear regression. In this
context we emphasize (a) the need to split the data into two categories: the
training data set and the validation data set to be able to validate the multiple
linear regression model, and (b) the need to relax the assumption that errors
follow a Normal distribution. After this review, we introduce methods for
identifying subsets of the independent variables to improve predictions.

2.1

A Review of Multiple Linear Regression

In this section, we review briey the multiple regression model that you encountered in the DMD course. There is a continuous random variable called the
dependent variable, Y , and a number of independent variables, x1 , x2 , . . . , xp .
Our purpose is to predict the value of the dependent variable (also referred to
as the response variable) using a linear function of the independent variables.
The values of the independent variables(also referred to as predictor variables,
regressors or covariates) are known quantities for purposes of prediction, the
model is:
Y = 0 + 1 x1 + 2 x2 + + p xp + ,

(2.1)

where , the noise variable, is a Normally distributed random variable with


mean equal to zero and standard deviation whose value we do not know. We
also do not know the values of the coecients 0 , 1 , 2 , . . . , p . We estimate
all these (p + 2) unknown values from the available data.
The data consist of n rows of observations also called cases, which give
us values yi , xi1 , xi2 , . . . , xip ; i = 1, 2, . . . , n. The estimates for the coecients
are computed so as to minimize the sum of squares of dierences between the

Sec. 2.1

A Review of Multiple Linear Regression

tted (predicted) values at the observed values in the data. The sum of squared
dierences is given by
n


(yi 0 1 xi1 2 xi2 . . . p xip )2

i=1

Let us denote the values of the coecients that minimize this expression by
0 , 1 , 2 , . . . , p . These are our estimates for the unknown values and are called
OLS (ordinary least squares) estimates in the literature. Once we have computed the estimates 0 , 1 , 2 , . . . , p we can calculate an unbiased estimate
2
for 2 using the formula:

2 =

n

1
(yi 0 1 xi1 2 xi2 . . . p xip )2
n p 1 i=1

Sum of the residuals


.
#observations #coecients

We plug in the values of 0 , 1 , 2 , . . . , p in the linear regression model


(1) to predict the value of the dependent value from known values of the independent values, x1 , x2 , . . . , xp . The predicted value, Y , is computed from the
equation
Y = 0 + 1 x1 + 2 x2 + + p xp .
Predictions based on this equation are the best predictions possible in the sense
that they will be unbiased (equal to the true values on the average) and will
have the smallest expected squared error compared to any unbiased estimates
if we make the following assumptions:
1. Linearity The expected value of the dependent variable is a linear function of the independent variables, i.e.,
E(Y |x1 , x2 , . . . , xp ) = 0 + 1 x1 + 2 x2 + . . . + p xp .
2. Independence The noise random variables i are independent between
all the rows. Here i is the noise random variable in observation i for
i = 1, . . . , n.
3. Unbiasness The noise random variable i has zero mean, i.e., E(i ) = 0
for i = 1, 2, . . . , n.
4. Homoskedasticity The standard deviation of i equals the same (unknown) value, , for i = 1, 2, . . . , n.

Chap. 2

Multiple Linear Regression

5. Normality The noise random variables, i , are Normally distributed.


An important and interesting fact for our purposes is that even if we
drop the assumption of normality (Assumption 5) and allow the noise variables
to follow arbitrary distributions, these estimates are very good for prediction.
We can show that predictions based on these estimates are the best linear
predictions in that they minimize the expected squared error. In other words,
amongst all linear models, as dened by equation (1) above, the model using
the least squares estimates,
0 , 1 , 2 , . . . , p ,
will give the smallest value of squared error on the average. We elaborate on
this idea in the next section.
The Normal distribution assumption was required to derive condence intervals for predictions. In data mining applications we have two distinct sets of
data: the training data set and the validation data set that are both representative of the relationship between the dependent and independent variables. The
training data is used to estimate the regression coecients 0 , 1 , 2 , . . . , p .
The validation data set constitutes a hold-out sample and is not used in
computing the coecient estimates. This enables us to estimate the error in
our predictions without having to assume that the noise variables follow the
Normal distribution. We use the training data to t the model and to estimate
the coecients. These coecient estimates are used to make predictions for
each case in the validation data. The prediction for each case is then compared
to value of the dependent variable that was actually observed in the validation
data. The average of the square of this error enables us to compare dierent
models and to assess the accuracy of the model in making predictions.

2.2

Illustration of the Regression Process

We illustrate the process of Multiple Linear Regression using an example adapted


from Chaterjee, Hadi and Price from on estimating the performance of supervisors in a large nancial organization.
The data shown in Table 2.1 are from a survey of clerical employees in a
sample of departments in a large nancial organization. The dependent variable
is a performance measure of eectiveness for supervisors heading departments
in the organization. Both the dependent and the independent variables are
totals of ratings on dierent aspects of the supervisors job on a scale of 1 to
5 by 25 clerks reporting to the supervisor. As a result, the minimum value for

Sec. 2.3

Subset Selection in Linear Regression

each variable is 25 and the maximum value is 125. These ratings are answers
to survey questions given to a sample of 25 clerks in each of 30 departments.
The purpose of the analysis was to explore the feasibility of using a questionnaire for predicting eectiveness of departments thus saving the considerable
eort required to directly measure eectiveness. The variables are answers to
questions on the survey and are described below.
Y Measure of eectiveness of supervisor.
X1 Handles employee complaints
X2 Does not allow special privileges.
X3 Opportunity to learn new things.
X4 Raises based on performance.
X5 Too critical of poor performance.
X6 Rate of advancing to better jobs.
The multiple linear regression estimates as computed by the StatCalc addin to Excel are reported in Table 2.2. The equation to predict performance is
Y = 13.182 + 0.583X1 0.044X2 + 0.329X3 0.057X4 + 0.112X5 0.197X6.
In Table 2.3 we use ten more cases as the validation data. Applying the previous
equation to the validation data gives the predictions and errors shown in Table
2.3. The last column entitled error is simply the dierence of the predicted
minus the actual rating. For example for Case 21, the error is equal to 44.4650=-5.54
We note that the average error in the predictions is small (-0.52) and so
the predictions are unbiased. Further the errors are roughly Normal so that this
model gives prediction errors that are approximately 95% of the time within
14.34 (two standard deviations) of the true value.

2.3

Subset Selection in Linear Regression

A frequent problem in data mining is that of using a regression equation to


predict the value of a dependent variable when we have a number of variables
available to choose as independent variables in our model. Given the high speed
of modern algorithms for multiple linear regression calculations, it is tempting

Chap. 2

Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Y
43
63
71
61
81
43
58
71
72
67
64
67
69
68
77
81
74
65
65
50

X1
51
64
70
63
78
55
67
75
82
61
53
60
62
83
77
90
85
60
70
58

X2
30
51
68
45
56
49
42
50
72
45
53
47
57
83
54
50
64
65
46
68

X3
39
54
69
47
66
44
56
55
67
47
58
39
42
45
72
72
69
75
57
54

X4
61
63
76
54
71
54
66
70
71
62
58
59
55
59
79
60
79
55
75
64

Multiple Linear Regression

X5
92
73
86
84
83
49
68
66
83
80
67
74
63
77
77
54
79
80
85
78

X6
45
47
48
35
47
34
35
41
31
41
34
41
25
35
46
36
63
60
46
52

Table 2.1: Training Data (20 departments).

in such a situation to take a kitchen-sink approach: why bother to select a


subset, just use all the variables in the model. There are several reasons why
this could be undesirable.
It may be expensive to collect the full complement of variables for future
predictions.
We may be able to more accurately measure fewer variables (for example
in surveys).
Parsimony is an important property of good models. We obtain more
insight into the inuence of regressors in models with a few parameters.

Sec. 2.3

Subset Selection in Linear Regression

Multiple R-squared
Residual SS
Std. Dev. Estimate
Constant
X1
X2
X3
X4
X5
X6

Coefficient
13.182
0.583
-0.044
0.329
-0.057
0.112
-0.197

0.656
738.900
7.539
StdError
16.746
0.232
0.167
0.219
0.317
0.196
0.247

t-statistic
0.787
2.513
-0.263
1.501
-0.180
0.570
-0.798

p-value
0.445
0.026
0.797
0.157
0.860
0.578
0.439

Table 2.2: Output of StatCalc.

Estimates of regression coecients are likely to be unstable due to multicollinearity in models with many variables. We get better insights into the
inuence of regressors from models with fewer variables as the coecients
are more stable for parsimonious models.
It can be shown that using independent variables that are uncorrelated
with the dependent variable will increase the variance of predictions.
It can be shown that dropping independent variables that have small
(non-zero) coecients can reduce the average error of predictions.
Let us illustrate the last two points using the simple case of two independent variables. The reasoning remains valid in the general situation of more
than two independent variables.

2.3.1

Dropping Irrelevant Variables

Suppose that the true equation for Y, the dependent variable, is:
Y = 1 X1 +

(2.2)

Chap. 2

Case
21
22
23
24
25
26
27
28
29
30
Averages:
Std Devs:

Y
50
64
53
40
63
66
78
48
85
82

X1
40
61
66
37
54
77
75
57
85
82

X2
33
52
52
42
42
66
58
44
71
39

X3
34
62
50
58
48
63
74
45
71
59

X4
43
66
63
50
66
88
80
51
77
64

X5
64
80
80
57
75
76
78
83
74
78

Multiple Linear Regression

X6
33
41
37
49
33
72
49
38
55
39

Prediction
44.46
63.98
63.91
45.87
56.75
65.22
73.23
58.19
76.05
76.10
62.38
11.30

Error
-5.54
-0.02
10.91
5.87
-6.25
-0.78
-4.77
10.19
-8.95
-5.90
-0.52
7.17

Table 2.3: Predictions on the validation data.

and suppose that we estimate Y (using an additional variable X2 that is actually


irrelevant) with the equation:
Y = 1 X1 + 2 X2 + .

(2.3)

We use data yi , xi1 , xi2, i = 1, 2 . . . , n. We can show that in this situation


the least squares estimates 1 and 2 will have the following expected values
and variances:
E(1 ) = 1 ,

V ar(1 ) =

2
2 ) n x2
(1 R12
i=1 i1

E(2 ) = 0,

V ar(2 ) =

2
2 ) n x2 ,
(1 R12
i=1 i2

where R12 is the correlation coecient between X1 and X2 .


We notice that 1 is an unbiased estimator of 1 and 2 is an unbiased
estimator of 2 , since it has an expected value of zero. If we use Model (2) we
obtain that
2
V ar(1 ) = n
E(1 ) = 1 ,
2.
i=1 x1
Note that in this case the variance of 1 is lower.

Sec. 2.3

Subset Selection in Linear Regression

The variance is the expected value of the squared error for an unbiased
estimator. So we are worse o using the irrelevant estimator in making predic2 = 0 and the
tions. Even if X2 happens to be uncorrelated with X1 so that R12
variance of 1 is the same in both models, we can show that the variance of a
prediction based on Model (3) will be worse than a prediction based on Model
(2) due to the added variability introduced by estimation of 2 .
Although our analysis has been based on one useful independent variable
and one irrelevant independent variable, the result holds true in general. It is
always better to make predictions with models that do not include
irrelevant variables.

2.3.2

Dropping independent variables with small coecient values

Suppose that the situation is the reverse of what we have discussed above,
namely that Model (3) is the correct equation, but we use Model (2) for our
estimates and predictions ignoring variable X2 in our model. To keep our results
simple let us suppose that we have scaled the values of X1 , X2 , and Y so that
their variances are equal to 1. In this case the least squares estimate 1 has the
following expected value and variance:
E(1 ) = 1 + R12 2 ,

V ar(1 ) = 2 .

Notice that 1 is a biased estimator of 1 with bias equal to R12 2 and its
Mean Square Error is given by:
M SE(1 ) = E[(1 1 )2 ]
= E[{1 E(1 ) + E(1 ) 1 }2 ]
= [Bias(1 )]2 + V ar(1 )
= (R12 2 )2 + 2 .
If we use Model (3) the least squares estimates have the following expected
values and variances:
E(1 ) = 1 ,

V ar(1 ) =

2
2 ),
(1 R12

E(2 ) = 2 ,

V ar(2 ) =

2
2 ).
(1 R12

Now let us compare the Mean Square Errors for predicting Y at X1 =


u 1 , X2 = u 2 .

10

Chap. 2

Multiple Linear Regression

For Model (2), the Mean Square Error is:


M SE2(Y ) = E[(Y Y )2 ]
= E[(u1 1 u1 1 )2 ]
= u21 M SE2(1 ) + 2

= u21 (R12 2 )2 + u21 2 + 2


For Model (2), the Mean Square Error is:
M SE3(Y ) = E[(Y Y )2 ]
= E[(u1 1 + u2 2 u1 1 u2 2 )2 ]

isunbiased
= V ar(u1 1 + u2 2 ) + 2 ,
because now Y
= u2 V ar(1 ) + u2 V ar(2 ) + 2u1 u2 Covar(1 , 2 )
1

(u21 + u22 2u1 u2 R12 ) 2


+ 2.
2 )
(1 R12

Model (2) can lead to lower mean squared error for many combinations
of values for u1 , u2 , R12 , and (2 /)2 . For example, if u1 = 1, u2 = 0, then
M SE2(Y ) < M SE3(Y ), when
(R12 2 )2 + 2 <
i.e., when

2
2 ),
(1 R12

1
|2 |
<
.

2
1 R12

2 ; if, however, say R2 > .9,


If |2 | < 1, this will be true for all values of R12
12
then this will be true for ||/ < 2.
In general, accepting some bias can reduce MSE. This Bias-Variance tradeo generalizes to models with several independent variables and is particularly
important for large values of the number p of independent variables, since in
that case it is very likely that there are variables in the model that have small
coecients relative to the standard deviation of the noise term and also exhibit
at least moderate correlation with other variables. Dropping such variables will
improve the predictions as it will reduce the MSE.
This type of Bias-Variance trade-o is a basic aspect of most data mining
procedures for prediction and classication.

Sec. 2.3

2.3.3

Subset Selection in Linear Regression

11

Algorithms for Subset Selection

Selecting subsets to improve MSE is a dicult computational problem for large


number p of independent variables. The most common procedure for p greater
than about 20 is to use heuristics to select good subsets rather than to look for
the best subset for a given criterion. The heuristics most often used and available in statistics software are step-wise procedures. There are three common
procedures: forward selection, backward elimination and step-wise regression.
Forward Selection
Here we keep adding variables one at a time to construct what we hope is a
reasonably good subset. The steps are as follows:
1. Start with constant term only in subset S.
2. Compute the reduction in the sum of squares of the residuals (SSR) obtained by including each variable that is not presently in S. We denote
by SSR(S) the sum of square residuals given that the model consists of
the set S of variables. Let
2 (S) be an unbiased estimate for for the
model consisting of the set S of variables. For the variable, say, i, that
gives the largest reduction in SSR compute
Fi = M axiS
/

SSR(S) SSR(S {i})

2 (S {i})

If Fi > Fin , where Fin is a threshold (typically between 2 and 4) add i to


S
3. Repeat 2 until no variables can be added.
Backward Elimination
1. Start with all variables in S.
2. Compute the increase in the sum of squares of the residuals (SSR) obtained by excluding each variable that is presently in S. For the variable,
say, i, that gives the smallest increase in SSR compute
SSR(S{i})SSR(S)
Fi = M iniS
/

2 (S)
If Fi < Fout , where Fout is a threshold (typically between 2 and 4) then
drop i from S.
3. Repeat 2 until no variable can be dropped.

12

Chap. 2

Multiple Linear Regression

Backward Elimination has the advantage that all variables are included in
S at some stage. This addresses a problem of forward selection that will never
select a variable that is better than a previously selected variable that is strongly
correlated with it. The disadvantage is that the full model with all variables is
required at the start and this can be time-consuming and numerically unstable.
Step-wise Regression
This procedure is like Forward Selection except that at each step we consider
dropping variables as in Backward Elimination.
Convergence is guaranteed if the thresholds Fout and Fin satisfy: Fout <
Fin . It is possible, however, for a variable to enter S and then leave S at a
subsequent step and even rejoin S at a yet later step.
As stated above these methods pick one best subset. There are straightforward variations of the methods that do identify several close to best choices
for dierent sizes of independent variable subsets.
None of the above methods guarantees that they yield the best subset
for any criterion such as adjusted R2 . (Dened later in this note.) They are
reasonable methods for situations with large numbers of independent variables
but for moderate numbers of independent variables the method discussed next
is preferable.
All Subsets Regression
The idea here is to evaluate all subsets. Ecient implementations use branch
and bound algorithms of the type you have seen in DMD for integer programming to avoid explicitly enumerating all subsets. (In fact the subset selection
problem can be set up as a quadratic integer program.) We compute a criterion
2 , the adjusted R2 for all subsets to choose the best one. (This is
such as Radj
only feasible if p is less than about 20).

2.3.4

Identifying subsets of variables to improve predictions

The All Subsets Regression (as well as modications of the heuristic algorithms)
will produce a number of subsets. Since the number of subsets for even moderate
values of p is very large, we need some way to examine the most promising
subsets and to select from them. An intuitive metric to compare subsets is R2 .
However since R2 = 1 SSR
SST where SST , the Total Sum of Squares, is the
Sum of Squared Residuals for the model with just the constant term, if we use
it as a criterion we will always pick the full model with all p variables. One
approach is therefore to select the subset with the largest R2 for each possible

Sec. 2.3

Subset Selection in Linear Regression

13

size k, k = 2, . . . , p + 1. The size is the number of coecients in the model and


is therefore one more than the number of variables in the subset to account
for the constant term. We then examine the increase in R2 as a function of k
amongst these subsets and choose a subset such that subsets that are larger in
size give only insignicant increases in R2 .
Another, more automatic, approach is to choose the subset that maxi2 , a modication of R2 that makes an adjustment to account for size.
mizes, Radj
2 is
The formula for Radj
2
=1
Radj

n1
(1 R2 ).
nk1

2 to choose a subset is equivalent to picking


It can be shown that using Radj
2
the subset that minimizes
.
Table 2.4 gives the results of the subset selection procedures applied to
the training data in the Example on supervisor data in Section 2.2.
Notice that the step-wise method fails to nd the best subset for sizes of 4,
5, and 6 variables. The Forward and Backward methods do nd the best subsets
of all sizes and so give identical results as the All subsets algorithm. The best
2
subset of size 3 consisting of {X1, X3} maximizes Radj
for all the algorithms.
This suggests that we may be better o in terms of MSE of predictions if we
use this subset rather than the full model of size 7 with all six variables in the
model. Using this model on the validation data gives a slightly higher standard
deviation of error (7.3) than the full model (7.1) but this may be a small price to
pay if the cost of the survey can be reduced substantially by having 2 questions
instead of 6. This example also underscores the fact that we are basing our
analysis on small (tiny by data mining standards!) training and validation data
sets. Small data sets make our estimates of R2 unreliable.
A criterion that is often used for subset selection is known as Mallows
Cp . This criterion assumes that the full model is unbiased although it may have
variables that, if dropped, would improve the M SE. With this assumption
we can show that if a subset model is unbiased E(Cp ) equals k, the size of the
subset. Thus a reasonable approach to identifying subset models with small bias
is to examine those with values of Cp that are near k. Cp is also an estimate
of the sum of MSE (standardized by dividing by 2 ) for predictions (the tted
values) at the x-values observed in the training set. Thus good models are those
that have values of Cp near k and that have small k (i.e. are of small size). Cp
is computed from the formula:

Cp =

SSR
+ 2k n,

F2 ull

14

Chap. 2
SST= 2149.000

Multiple Linear Regression

Fin= 3.840
Fout= 2.710

Forward, backward, and all subsets selections


Models
Size
SSR
RSq RSq
Cp
1
(adj)
2
874.467 0.593 0.570 -0.615 Constant
3
786.601 0.634 0.591 -0.161 Constant
4
759.413 0.647 0.580 1.361
Constant
5
743.617 0.654 0.562 3.083
Constant
6
740.746 0.655 0.532 5.032
Constant
7
738.900 0.656 0.497 7.000
Constant
Stepwise Selection
Size

SSR

RSq

2
3
4
5
6
7

874.467
786.601
783.970
781.089
775.094
738.900

0.593
0.634
0.635
0.637
0.639
0.656

RSq
(adj)
0.570
0.591
0.567
0.540
0.511
0.497

Cp
-0.615
-0.161
1.793
3.742
5.637
7.000

X1
X1
X1
X1
X1
X1

X3
X3
X3
X2
X2

X6
X5 X6
X3 X5 X6
X3 X4 X5 X6

Models
1
2

Constant
Constant
Constant
Constant
Constant
Constant

X3
X2
X2
X2
X2

X3
X3 X4
X3 X4 X5
X3 X4 X5 X6

X1
X1
X1
X1
X1
X1

Table 2.4: Subset Selection for the example in Section 2.2

where
F2 ull is the estimated value of 2 in the full model that includes all the
variables. It is important to remember that the usefulness of this approach
depends heavily on the reliability of the estimate of 2 for the full model. This
requires that the training set contains a large number of observations relative to
the number of variables. We note that for our example only the subsets of size
6 and 7 seem to be unbiased as for the other models Cp diers substantially
from k. This is a consequence of having too few observations to estimate 2
accurately in the full model.

Lecture 1

k-Nearest Neighbor Algorithms


for Classication and Prediction

k-Nearest Neighbor Classication

The idea behind the k-Nearest Neighbor algorithm is to build a classication


method using no assumptions about the form of the function, y = f (x1 , x2 , ...xp )
that relates the dependent (or response) variable, y, to the independent (or
predictor) variables x1 , x2 , ...xp . The only assumption we make is that it is a
smooth function. This is a non-parametric method because it does not involve
estimation of parameters in an assumed function form such as the linear form
that we encountered in linear regression.
We have training data in which each observation has a y value which is
just the class to which the observation belongs. For example, if we have two
classes y is a binary variable. The idea in k-Nearest Neighbor methods is to
dynamically identify k observations in the training data set that are similar to
a new observation , say (u1 , u2 , ...up ), that we wish to classify and to use these
we knew the function f ,
observations to classify the observation into a class, v.If
we would simply compute v = f (u1 , u2 , ...up ). If all we are prepared to assume
is that f is a smooth function, a reasonable idea is to look for observations in
our training data that are near it (in terms of the independent variables) and
then to compute v from the values of y for these observations. This is similar
in spirit to the interpolation in a table of values that we are accustomed to
doing in using a table of the Normal distribution. When we talk about neigh
bors we are implying that there is a distance or dissimilarity measure that we
can compute between observations based on the independent variables. For the
moment we will conne ourselves to the most popular measure of distance: Eu
clidean distance.The Euclidean distance between the points (x1 , x2 , ...xp ) and
(u1 , u2 , ...up ) is (x1 u1 )2 + (x2 u2 )2 + + (xp up )2 . We will examine
other ways to dene distance between points in the space of predictor variables
when we discuss clustering methods.
The simplest case is k = 1 where we nd the observation that is closest (the
nearest neighbor) and set v = y where y is the class of the nearest neighbor.
It is a remarkable fact that this simple, intuitive idea of using a single nearest
neighbor to classify observations can be very powerful when we have a large
number of observations in our training set. It is possible to prove that the
misclassication error of the 1-NN scheme has a misclassication probability
that is no worse than twice that of the situation where we know the precise
probability density functions for each class. In other words if we have a large
amount of data and used an arbitrarily sophisticated classication rule, we would
be able to reduce the misclassication error at best to half that of the simple
1-NN rule.
For k-NN we extend the idea of 1-NN as follows. Find the nearest k neigh
bors and then use a majority decision rule to classify a new observation.The
advantage is that higher values of k provide smoothing that reduces the risk
of overtting due to noise in the training data. In typical applications k is in
units or tens rather than in hundreds or thousands. Notice that if k = n, the
number of observations in the training data set, we are merely predicting the
class that has the majority in the training data for all observations irrespective
2

of the values of (u1 , u2 , ...up ). This is clearly a case of oversmoothing unless


there is no information at all in the independent variables about the dependent
variable.
Example 1
A riding-mower manufacturer would like to nd a way of classifying families
in a city into those that are likely to purchase a riding mower and those who are
not likely to buy one. A pilot random sample of 12 owners and 12 non-owners
in the city is undertaken. The data are shown in Table I and
Figure 1 below:
Table 1
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Income ($000s)
60
85.5
64.8
61.5
87
110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63

Lot Size (000s sq. ft.)


Owners=1, Non-owners=2

18.4

16.8

21.6

20.8

23.6

19.2

17.6

22.4

20

20.8

22

20

19.6

20.8

17.2

20.4

17.6

17.6

16

18.4

16.4

18.8

14

14.8

How do we choose k? In data mining we use the training data to classify the
cases in the validation data to compute error rates for various choices of k. For
our example we have randomly divided the data into a training set with 18 cases
and a validation set of 6 cases. Of course, in a real data mining situation we
would have sets of much larger sizes. The validation set consists of observations
6, 7, 12, 14, 19, 20 of Table 1. The remaining 18 observations constitute the
training data. Figure 1 displays the observations in both training and validation

35
TrnOwn

TrnNonOwn

VldOwn

VldNonOwn

Lot Size (000's sq. ft.)

30

25

20

15

10
30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 10 11 11 12
0 5 0 5 0
Income ($ 000,s)

data sets. Notice that if we choose k=1 we will classify in a way that is very
sensitive to the local characteristics of our data. On the other hand if we choose
a large value of k we average over a large number of data points and average
out the variability due to the noise associated with individual data points. If
we choose k=18 we would simply predict the most frequent class in the data
set in all cases. This is a very stable prediction but it completely ignores the
information in the independent variables.

Table 2 shows the misclassication error rate for observations in the valida
tion data for dierent choices of k.
Table 2
k
Misclassication Error %

1
33

3
33

5
33

7
33

9
33

11
17

13
17

18
50

We would choose k=11 (or possibly 13) in this case. This choice optimally
4

trades o the variability associated with a low value of k against the oversmooth
ing associated with a high value of k. It is worth remarking that a useful way
to think of k is through the concept of eective number of parameters. The
eective number of parameters corresponding to k is n/k where n is the number
of observations in the training data set. Thus a choice of k=11 has an eec
tive number of parameters of about 2 and is roughly similar in the extent of
smoothing to a linear regression t with two coecients.

k-Nearest Neighbor Prediction

The idea of k-NN can be readily extended to predicting a continuous value


(as is our aim with multiple linear regression models), by simply predicting
the average value of the dependent variable for the k nearest neighbors. Often
this average is a weighted average with the weight decreasing with increasing
distance from the point at which the prediction is required.

Shortcomings of k-NN algorithms

There are two diculties with the practical exploitation of the power of the
k-NN approach. First, while there is no time required to estimate parameters
from the training data (as would be the case for parametric models such as
regression) the time to nd the nearest neighbors in a large training set can
be prohibitive. A number of ideas have been implemented to overcome this
diculty. The main ideas are:
1. Reduce the time taken to compute distances by working in a reduced
dimension using dimension reduction techniques such as principal compo
nents;
2. Use sophisticated data structures such as search trees to speed up identi
cation of the nearest neighbor. This approach often settles for an almost
nearest neighbor to improve speed.
3. Edit the training data to remove redundant or almost redundant points
in the training set to speed up the search for the nearest neighbor. an
example is to remove observations in the training data set that have no
eect on the classication because they are surrounded by observations
that all belong to the same class.
Second, the number of observations required in the training data set to
qualify as large increases exponentially with the number of dimensions p. This
is because the expected distance to the nearest neighbor goes up dramatically
with p unless the size of the training data set increases exponentially with p.
An illustration of this phenomenon, known as the curse of dimensionality, is
5

the fact that if the independent variables in the training data are distributed
uniformly in a hypercube of dimension p, the probability that a point is within
a distance of 0.5 units from the center is
p/2
2p1 p(p/2)
The table below is designed to show how rapidly this drops to near zero for
dierent combinations of p and n, the size of the training data set.

n
10,000
100,000
1,000,000
10,000,000

2
7854
78540
785398
7853982

3
5236
52360
523600
523600

4
3084
30843
308425
3084251

p
5
1645
16449
164493
1644934

10
25
249
2490
24904

20
0.0002
0.0025
0.0246
0.2461

30
21010
2109
2108
2107

The curse of dimensionality is a fundamental issue pertinent to all classica


tion, prediction and clustering techniques. This is why we often seek to reduce
the dimensionality of the space of predictor variables through methods such as
selecting subsets of the predictor variables for our model or by combining them
using methods such as principal components, singular value decomposition and
factor analysis. In the articial intelligence literature dimension reduction is
often referred to as factor selection.

40
31017
31016
31015
31014

Lecture 16

Discovering Association Rules

in

Transaction Databases

What are Association Rules?


The availability of detailed information on customer transactions has led to
the development of techniques that automatically look for associations
between items that are stored in the database. An example is data collected
using bar-code scanners in supermarkets. Such market basket databases
consist of a large number of transaction records. Each record lists all items
bought by a customer on a single purchase transaction. Managers would be
interested to know if certain groups of items are consistently purchased
together. They could use this data for store layouts to place items optimally
with respect to each other, they could use such information for cross-selling,
for promotions, for catalog design and to identify customer segments based
on buying patterns. Association rules provide information of this type in the
form of if-then statements. These rules are computed from the data and,
unlike the if-then rules of logic, association rules are probabilistic in nature.
In addition to the antecedent (the if part) and the consequent (the then
part) an association rule has two numbers that express the degree of
uncertainty about the rule. In association analysis the antecedent and
consequent are sets of items (called itemsets) that are disjoint (do not have
any items in common).
The first number is called the support for the rule. The support is simply the
number of transactions that include all items in the antecedent and
consequent parts of the rule. (The support is sometimes expressed as a
percentage of the total number of records in the database.)
The other number is known as the confidence of the rule. Confidence is the
ratio of the number of transactions that include all items in the consequent as
well as the antecedent (namely, the support) to the number of transactions
that include all items in the antecedent. For example if a supermarket
database has 100,000 point-of-sale transactions, out of which 2,000 include
both items A and B and 800 of these include item C, the association rule If
A and B are purchased then C is purchased on the same trip has a support
of 800 transactions (alternatively 0.8% = 800/100,000) and a confidence of
40% (=800/2,000).
One way to think of support is that it is the probability that a randomly
selected transaction from the database will contain all items in the
antecedent and the consequent, whereas the confidence is the conditional
2

probability that a randomly selected transaction will include all the items in
the consequent given that the transaction includes all the items in the
antecedent.
Example 1 (Han and Kamber)
The manager of the AllElectronics retail store would like to know what
items sell together. He has a database of transactions as shown below:

Transaction ID Item Codes


1
1
2
2
2
4
3
2
3
4
1
2
5
1
3
6
2
3

There are 9 transactions. Each transaction is a record of the items bought


together in that transaction. Transaction 1 is a point-of-sale purchase of
items 1, 2, and 5. Transaction 2 is a joint purchase of items 2 and 4, etc.
Suppose that we want association rules between items for this database that
have a support count of at least 2 (equivalent to a percentage support of
2/9=22%). By enumeration we can see that only the following itemsets have
a count of at least 2:
{1} with support count of 6;
{2} with support count of 7;
{3} with support count of 6;
{4} with support count of 2;
{5} with support count of 2;
{1, 2} with support count of 4;
{1, 3} with support count of 4;
{1, 5} with support count of 2;
{2, 3} with support count of 4;
{2, 4} with support count of 2;
{2, 5} with support count of 2;
{1, 2, 3} with support count of 2;
{1, 2, 5} with support count of 2.

Notice that once we have created a list of all itemsets that have the required
support, we can deduce the rules that meet the desired confidence ratio by
examining all subsets of each itemset in the list. Since any subset of a set
must occur at least as frequently as the set, each subset will also be in the
list. It is then straightforward to compute the confidence as the ratio of the
support for the itemset to the support for each subset of the itemset. We
retain the corresponding association rule only if it exceeds the desired cutoff value for confidence. For example, from the itemset {1,2,5} we get the
following association rules:
{1, 2} => {5} with confidence = support count of {1, 2, 5} divided by

support count of {1, 2} = 2/4 = 50%;

{1, 5} => {2} with confidence = support count of {1, 2, 5} divided by

support count of {1, 5} = 2/2 = 100%;

{2, 5} => {1} with confidence = support count of {1, 2, 5} divided by

support count of {2, 5} = 2/2 = 100%;

{1} => {2, 5} with confidence = support count of {1, 2, 5} divided by


support count of {1} = 2/6 = 33%;
{2} => {1,5} with confidence = support count of {1, 2, 5} divided by
support count of {2} = 2/7 = 29%;
{5} => {1,2} with confidence = support count of {1, 2, 5} divided by
support count of {5} = 2/2 = 100%.
If the desired confidence cut-off was 70%, we would report only the second,
third, and last rules.
We can see from the above that the problem of generating all association
rules that meet stipulated support and confidence requirements can be
decomposed into two stages. First we find all itemsets with the requisite
support (these are called frequent or large itemsets) ; and then we generate,
from each itemset so identified, association rules that meet the confidence
requirement. For most association analysis data, the computational challenge
is the first stage.
The Apriori Algorithm Although several algorithms have been proposed for
generating association rules, the classic algorithm is the Apriori algorithm of
Agrawal and Srikant. The key idea of the algorithm is to begin by generating
frequent itemsets with just one item (1-itemsets) and to recursively generate
frequent itemsets with 2 items, then frequent 3-itemsets and so on until we
4

have generated frequent itemsets of all sizes. Without loss of generality we


will denote items by unique, consecutive (positive) integers and that the
items in each itemset are in increasing order of this item number. The
example above illustrates this notation. When we refer to an item in a
computation we actually mean this item number.
It is easy to generate frequent 1-itemsets. All we need to do is to count, for
each item, how many transactions in the database include the item. These
transaction counts are the supports for the 1-itemsets. We drop 1-itemsets
that have support below the desired cut-off value to create a list of the
frequent 1-itemsets.
The general procedure to obtain k-itemsets from (k-1)-itemsets for k = 2, 3,
, is as follows. Create a candidate list of k-itemsets by performing a join
operation on pairs of (k-1)-itemsets in the list. The join is over the first (k-2)
items, i.e. a pair is combined if the first (k-2) items are the same in both
members of the pair. If this condition is met the join of pair is a k-itemset
that contains the common first (k-2) items and the two items that are not in
common, one from each member of the pair. All frequent k-itemsets must be
in this candidate list since every subset of size (k-1) of a frequent k-itemset
must be a frequent (k-1) itemset. However, some k-itemsets in the candidate
list may not be frequent k-itemsets. We need to delete these to create the list
of frequent k-itemsets. To identify the k-itemsets that are not frequent we
examine all subsets of size (k-1) of each candidate k-itemset. Notice that we
need examine only (k-1)-itemsets that contain the last two items of the
candidate k-itemset (Why?). If any one of these subsets of size (k-1) is not
present in the frequent (k-1) itemset list, we know that the candidate k
itemset cannot be a frequent itemset. We delete such k-itemsets from the
candidate list. Proceeding in this manner with every itemset in the candidate
list we are assured that at the end of our scan the k-itemset candidate list will
have been pruned to become the list of frequent k-itemsets. We repeat the
procedure recursively by incrementing k. We stop only when the candidate
list is empty.
A critical aspect for efficiency in this algorithm is the data structure of the
candidate and frequent itemset lists. Hash trees were used in the original
version but there have been several proposals to improve on this structure.
There are also other algorithms that can be faster than the Apriori algorithm
in practice.
5

Let us examine the output from an application of this algorithm to a small


randomly generated database of 50 records shown in Example 2.
Example 2

Tr#

Items

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

Association Rules Output


Input

Data:
$A$5:$E$54

Min.

Support

:
2=
Min.

Conf.

%:
70

4%

Confidence
Lift
Rule # Confidence Antecedent
Consequent Support Support Support If pr(c|a) = pr(c)
Ratio
%
(a)
(c)
(a)
(c)
(a U c)
%
(conf/prev.col. )
1
80
=>
54
2
9
5
27
4
1.5
2
100
5, 7
=>
9
3
27
3
54
1.9
3
100
6, 7
=>
8
3
29
3
58
1.7
4
100
1, 5
=>
8
2
29
2
58
1.7
5
100
2, 7
=>
9
2
27
2
54
1.9
6
100
3, 8
=>
4
2
11
2
22
4.5
7
100
3, 4
=>
8
2
29
2
58
1.7
8
100
3, 7
=>
9
2
27
2
54
1.9
9
100
4, 5
=>
9
2
27
2
54
1.9

A high value of confidence suggests a strong association rule. However this


can be deceptive because if the antecedent and/or the consequent have a high
support, we can have a high value for confidence even when they are
independent! A better measure to judge the strength of an association rule is
to compare the confidence of the rule with the benchmark value where we
assume that the occurrence of the consequent itemset in a transaction is
independent of the occurance of the antecedant for each rule. We can
compute this benchmark from the frequency counts of the frequent itemsets.
The benchmark confidence value for a rule is the support for the consequent
divided by the number of transactions in the database. This enables us to
compute the lift ratio of a rule. The lift ratio is the confidence of the rule
divided by the confidence assuming independence of consequent from
antecedent. A lift ratio greater than 1.0 suggests that there is some usefulness
7

to the rule. The larger the lift ratio, the greater is the strength of the
association. (What does a ratio less than 1.00 mean? Can it be useful to
know such rules?)
In our example the lift ratios highlight Rule 6 as most interesting in that it
suggests purchase of item 4 is almost 5 times as likely when items 3 and 8
are purchased than if item 4 was not associated with the itemset {3,8}.
Shortcomings
Association rules have not been as useful in practice as one would have
hoped. One major shortcoming is that the support confidence framework
often generates too many rules. Another is that often most of them are
obvious. Insights such as the celebrated on Friday evenings diapers and
beers are bought together story are not as common as might be expected.
There is need for skill in association analysis and it seems likely, as some
researchers have argued, that a more rigorous statistical discipline to cope
with rule proliferation would be beneficial.
Extensions
The general approach of association analysis utilizing support and
confidence concepts has been extended to sequences where one is looking
for patterns that evolve in time. The computation problems are even more
formidable, but there have been several successful applications.

Vous aimerez peut-être aussi