Académique Documents
Professionnel Documents
Culture Documents
My Favorite
Statistics at scale and speed
Darryl Pregibon
My extension:
. . . And simplicity
Gartner Group
Data mining is the process of discovering
meaningful new correlations, patterns and
trends by sifting through large amounts of
data stored in repositories, using pattern
recognition technologies as well as
statistical and mathematical techniques.
Drivers
Market: From focus on product/service to focus on
customer
IT: From focus on up-to-date balances to focus on
patterns in transactions - Data Warehouses OLAP
Dramatic drop in storage costs : Huge databases
e.g Walmart: 20 million transactions/day, 10 terabyte
database, Blockbuster: 36 million households
Core Disciplines
Statistics (adapted for 21st century data sizes and
speed requirements). Examples:
Descriptive: Visualization
Models (DMD): Regression, Cluster Analysis
Process
1. Develop understanding of application, goals
2. Create dataset for study (often from Data
Warehouse)
3. Data Cleaning and Preprocessing
4. Data Reduction and projection
5. Choose Data Mining task
6. Choose Data Mining algorithms
7. Use algorithms to perform task
8. Interpret and iterate thru 1-7 if necessary
9. Deploy: integrate into operational systems.
Data
Mining
Illustrative Applications
Customer Relationship Management
Finance
E-commerce and Internet
Customer Relationship
Management
Target Marketing
Attrition Prediction/Churn Analysis
Fraud Detection
Credit Scoring
Target marketing
Business problem: Use list of prospects for
direct mailing campaign
Solution: Use Data Mining to identify most
promising respondents combining
demographic and geographic data with data
on past purchase behavior
Benefit: Better response rate, savings in
campaign cost
Fraud Detection
Business problem: Fraud increases costs or
reduces revenue
Solution: Use logistic regression, neural
nets to identify characteristics of fraudulent
cases to prevent in future or prosecute more
vigorously
Benefit: Increased profits by reducing
undesirable customers
Risk Analysis
Business problem: Reduce risk of loans to
delinquent customers
Solution: Use credit scoring models using
discriminant analysis to create score
functions that separate out risky customers
Benefit: Decrease in cost of bad debts
Finance
Business problem: Pricing of corporate
bonds depends on several factors, risk
profile of company , seniority of debt,
dividends, prior history, etc.
Solution Approach: Through DM, develop
more accurate models of predicting prices.
10
Recommendation systems
Business opportunity: Users rate items
(Amazon.com, CDNOW.com, MovieFinder.com)
on the web. How to use information from other
users to infer ratings for a particular user?
Solution: Use of a technique known as
collaborative filtering
Benefit: Increase revenues by cross selling, up
selling
11
Clicks to Customers
Business problem: 50% of Dells clients order
their computer through the web. However, the
retention rate is 0.5%, i.e. of visitors of Dells web
page become customers.
Solution Approach: Through the sequence of their
clicks, cluster customers and design website,
interventions to maximize the number of
customers who eventually buy.
Benefit: Increase revenues
Spam
Bioinformatics/Genomics
Medical History Data Insurance Claims
Personalization of services in e-commerce
RF Tags : Gillette
Security :
Container Shipments
Network Intrusion Detection
12
Core Concepts
Types of Data:
Numeric
Continuous ratio and interval
Discrete
Need for Binning
13
14
Course Topics
Supervised Techniques
Classification:
k-Nearest Neighbors, Nave Bayes, Classification Trees
Discriminant Analysis, Logistic Regression, Neural Nets
Prediction (Estimation):
Regression, Regression Trees, k-Nearest Neighbors
Unsupervised Techniques
Cluster Analysis, Principal Components
Association Rules, Collaborative Filtering
15
Comparison of Data Mining techniques large data sets Guidelines ( and only guidelines)
H: high, M:medium, L:low.
Neural
Nets
Trees
k-Nearest
Neighbors
Accuracy
Logistic
Discriminant Nave
Multiple
Regression Analysis
Bayes
Linear
Regression
M
M
M
HM
HM
Intepretability
SpeedTraining
HM
HM
SpeedDeployment
HM
Effort in
choice and
transformation
of indep.Vars.
Effort to tune
performance
parameters
Robustness to
Outliers in
indep vars
Robustness to
irrelevant
variables
Ease of
handling of
missing
values
Natural
handling both
categorical
and
continuous
variables
HM
HM
HM
HM
ML
ML
ML
ML
ML
ML
ML
ML
HM
HM
HM
ML
ML
ML
Lecture 2
Classifiers
In this note we will examine the question of how to judge the usefulness of a classifier and how to
compare different classifiers. Not only do we have a wide choice of different types of classifiers
to choose from but within each type of classifier we have many options such as how many nearest
neighbors to use in a k-nearest neighbors classifier, the minimum number of cases we should
require in a leaf node in a tree classifier, which subsets of predictors to use in a logistic regression
model, and how many hidden layer neurons to use in a neural net.
A Two-class Classifier
Let us first look at a single classifier for two classes with options set at certain values. The twoclass situation is certainly the most common and occurs very frequently in practice. We will
extend our analysis to more than two classes later.
A natural criterion for judging the performance of a classifier is the probability that it makes a
misclassification. A classifier that makes no errors would be perfect but we do not expect to be
able to construct such classifiers in the real world due to noise and to not having all the
information needed to precisely classify cases. Is there a minimum probability of
misclassification we should require of a classifier?
Suppose that the two classes are denoted by C0 and C1. Let p(C0) and p(C1) be the apriori
probabilities that a case belongs to C0 and C1 respectively. The apriori probability is the
probability that a case belongs to a class without any more knowledge about it than that it belongs
to a population where the proportion of C0s is p(C0) and the proportion of C1s is p(C1) . In this
situation we will minimize the chance of a misclassification error by assigning class C1 to the
case if p(C1 ) > p(C0 ) and to C0 otherwise. The probability of making a misclassification would
be the minimum of p(C0) and p(C1). If we are using misclassification rate as our criterion any
classifier that uses predictor variables must have an error rate better than this.
What is the best performance we can expect from a classifier? Clearly the more training data
available to a classifier the more accurate it will be. Suppose we had a huge amount of training
data, would we then be able to build a classifier that makes no errors? The answer is no. The
accuracy of a classifier depends critically on how separated the classes are with respect to the
predictor variables that the classifier uses. We can use the well-known Bayes formula from
probability theory to derive the best performance we can expect from a classifier for a given set
of predictor variables if we had a very large amount of training data. Bayes' formula uses the
distributions of the decision variables in the two classes to give us a classifier that will have the
minimum error amongst all classifiers that use the same predictor variables. This classifier
follows the Minimum Error Bayes Rule.
Bayes Rule for Minimum Error
Let us take a simple situation where we have just one continuous predictor variable for
classification, say X. X is a random variable, since it's value depends on the individual case we
sample from the population consisting of all possible cases of the class to which the case belongs.
Suppose that we have a very large training data set. Then the relative frequency histogram of the
variable X in each class would be almost identical to the probability density function (p.d.f.) of X
for that class. Let us assume that we have a huge amount of training data and so we know the
p.d.f.s accurately. These p.d.f.s are denoted f0(x) and f1(x) for classes C0 and C1 in Fig. 1 below.
Figure 1
f1(x)
f0(x)
a
x
Now suppose we wish to classify an object for which the value of X is x0. Let us use Bayes
formula to predict the probability that the object belongs to class 1 conditional on the fact that it
has an X value of x0. Appling Bayes formula, the probability, denoted by p(C1|X= x0), is given
by:
p(C1 | X = x0 ) =
p( X = x0 |C1 ) p(C1 )
p( X = x0 |C0 ) p(C0 ) + p( X = x0 |C1 ) p(C1 )
p(C1 | X = x0 ) =
f 1 ( x0 ) p(C1 )
f 0 (x0 ) p(C0 ) + f 1 (x0 ) p(C1 )
Notice that to calculate p(C1|X= x0) we need to know the apriori probabilities p(C0) and p(C1).
Since there are only two possible classes, if we know p(C1) we can always compute p(C0) because
p(C0) = 1 - p(C1). The apriori probability p(C1) is the probability that an object belongs to C1
without any knowledge of the value of X associated with it. Bayes formula enables us to update
this apriori probability to the aposteriori probability, the probability of the object belonging to C1
that p(C1 | X = x0 ) > p(C0 | X = x0 ) if f 1 (x0 ) > f 0 (x0 ) . This means that if x0 is greater than a,
and we classify the object as belonging to C1 we will make a smaller misclassification error than
Figure 2
f1(x)
2f0(x)
ab
x
What if the prior class probabilities were not the same? Suppose C0 is twice as likely apriori as
C1. Then the formula says that p(C1 | X = x0 ) > p(C0 | X = x0 ) if f 1 (x0 ) > 2 f 0 ( x0 ) . The new
boundary value, b for classification will be to the right of a as shown in Fig.2. This is intuitively
what we would expect. If a class is more likely we would expect the cut-off to move in a direction
In general we will minimize the misclassification error rate if we classify a case as belonging to
C1 if p(C1 ) f 1 (x0 ) > p(C0 ) f 0 (x0 ) , and to C0 , otherwise. This rule holds even when X is a
vector consisting of several components, each of which is a random variable. In the remainder of
compute the conditional probability that the case belongs to each class. This has two advantages.
First, we can use this probability as a score for each case that we are classifying. The score
enables us to rank cases that we have predicted as belonging to a class in order of confidence that
we have made a correct classification. This capability is important in developing a lift curve
(explained later) that is important for many practical data mining applications.
Second, it enables us to compute the expected profit or loss for a given case. This gives us a
better decision criterion than misclassification error when the loss due to error is different for the
two classes.
In practice, we can estimate p(C1 ) and p(C0 ) from the data we are using to build the classifier
by simply computing the proportion of cases that belong to each class. Of course, these are
estimates and they can be incorrect, but if we have a large enough data set and neither class is
very rare our estimates will be reliable. Sometimes, we may be able to use public data such as
census data to estimate these proportions. However, in most practical business settings we will
not know f 1 (x) and f 0 ( x) . If we want to apply Bayes Rule we will need to estimate these
density functions in some way. Many classification methods can be interpreted as being methods
for estimating such density functions1. In practice X will almost always be a vector. This makes
the task difficult and subject to the curse of dimensionality we referred to when discussing the k-
To obtain an honest estimate of classification error, let us suppose that we have partitioned a data
set into training and validation data sets by random selection of cases. Let us assume that we have
constructed a classifier using the training data. When we apply it to the validation data, we will
classify each case into C0 or C1. The resulting misclassification errors can be displayed in what is
known as a confusion table, with rows and columns corresponding to the true and predicted
classes respectively. We can summarize our results in a confusion table for training data in a
similar fashion. The resulting confusion table will not give us an honest estimate of the
misclassification rate due to over-fitting. However such a table will be useful to signal over-
fitting when it has substantially lower misclassification rates than the confusion table for
validation data.
Predicted Class
Confusion Table
(Validation Cases)
True Class
C0
C1
C0
C1
If we denote the number in the cell at row i and column j by Nij, the estimated misclassification
rate Err = ( N 01 + N 10 ) / N val where N val ( N 00 + N 01 + N10 + N 11 ) , or the total number of
cases in the validation data set. If Nval is reasonably large, our estimate of the misclassification
There are classifiers that focus on simply finding the boundary between the regions to predict each class
without being concerned with estimating the density of cases within each region. For example, Support
rate is probably quite accurate. We can compute a confidence interval for Err using the standard
formula for estimating a population proportion from a random sample.
The table below gives an idea of how the accuracy of the estimate varies with Nval . The column
headings are values of the misclassification rate and the rows give the desired accuracy in
estimating the misclassification rate as measured by the half-width of the confidence interval at
the 99% confidence level. For example, if we think that the true misclassification rate is likely to
be around 0.05 and we want to be 99% confident that Err is within 0.01 of the true
misclassification rate, we need to have a validation data set with 3,152 cases.
0.01
0.05
504
250
0.025
3,152
657
0.010
0.005 2,628 12,608
0.10
956
5,972
23,889
0.15
1,354
8,461
33,842
0.20
1,699
10,617
42,469
0.30
2,230
13,935
55,741
0.40
2,548
15,926
63,703
0.50
2,654
16,589
66,358
Note that we are assuming that the cost (or benefit) of making correct classifications is zero. At
first glance, this may seem incomplete. After all, the benefit (negative cost) of correctly
classifying a buyer as a buyer would seem substantial. And, in other circumstances (e.g.
calculating the expected profit from having a new mailing list ), it will be appropriate to consider
the actual net dollar impact of classifying each case on the list. Here, however, we are attempting
to assess the value of a classifier in terms of misclassifications, so it greatly simplifies matters if
we can capture all cost/benefit information in the misclassification cells. So, instead of recording
the benefit of correctly classifying a buyer, we record the cost of failing to classify him as a
buyer. It amounts to the same thing. In fact the costs we are using are the opportunity costs.
p(C j | X = x0 ) =
f j (x0 ) p(C j )
k 1
f i ( x0 ) p(Ci )
i =1
The confusion table has k rows and k columns. The opportunity cost associated with the diagonal
cells is always zero. If the costs are asymmetric the Bayes Risk Classifier follows the rule:
Classify a case as belonging to C1
where C(~j|j) is the cost of misclassifying a case that belongs to Cj to any other class Ci, i j.
Predicted
Prob. of
Success
Actual
Value of
HICLASS
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
3.5993
0.9734
-6.5073
0.0015
0.4061
0.6002
-14.2910
0.0000
4.5273
0.9893
-1.2916
0.2156
-37.6119
0.0000
-1.1157
0.2468
-4.3290
0.0130
-24.5364
0.0000
-21.6854
0.0000
-19.8654
0.0000
-13.1040
0.0000
4.4472
0.9884
3.5294
0.9715
3.6381
0.9744
-2.6806
0.0641
-0.0402
0.4900
-10.0750
0.0000
-10.2859
0.0000
-14.6084
0.0000
8.9016
0.9999
0.0874
0.5218
-6.0590
0.0023
-1.9183
0.1281
-13.2349
0.0000
27
28
29
30
-9.6509
0.0001
-13.4562
0.0000
-13.9340
0.0000
1.7257
0.8489
The same 30 cases are shown below sorted in descending order of the predicted probability of
being a HCLASS=1 case.
Predicted
Log-odds of
Success
22
5
14
16
1
15
30
3
23
18
8
6
25
17
9
24
2
27
19
20
13
26
28
29
4
21
12
11
10
7
Predicted
Prob. of
Success
Actual
Value of
HICLASS
8.9016
0.9999
4.5273
0.9893
4.4472
0.9884
3.6381
0.9744
3.5993
0.9734
3.5294
0.9715
1.7257
0.8489
0.4061
0.6002
0.0874
0.5218
-0.0402
0.4900
-1.1157
0.2468
-1.2916
0.2156
-1.9183
0.1281
-2.6806
0.0641
-4.3290
0.0130
-6.0590
0.0023
-6.5073
0.0015
-9.6509
0.0001
-10.0750
0.0000
-10.2859
0.0000
-13.1040
0.0000
-13.2349
0.0000
-13.4562
0.0000
-13.9340
0.0000
-14.2910
0.0000
-14.6084
0.0000
-19.8654
0.0000
-21.6854
0.0000
-24.5364
0.0000
-37.6119
0.0000
First, we need to set a cutoff probability value, above which we will consider a case to be a
positive or "1," and below which we will consider a case to be a negative or "0." For any given
cutoff level, we can use the sorted table to compute a confusion table for a given cut-off
probability. For example, if we use a cut-off probability level of 0.400, we will predict 10
positives (7 true positives and 3 false positives); we will also predict 20 negatives (18 true
negatives and 2 false negatives). For each cut-off level, we can calculate the appropriate
confusion table. Instead of looking at a large number of confusion tables, it is much more
convenient to look at the cumulative lift curve (sometimes called a gains chart) which
10
summarizes all the information in these multiple confusion tables into a graph. The graph is
constructed with the cumulative number of cases (in descending order of probability) on the x
axis and the cumulative number of true positives on the y axis as shown below.
Probability
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Predicted
Prob. of
Success
Actual
cumulative
Value of
Actual Value
HICLASS
0.9999
0.9893
0.9884
0.9744
0.9734
0.9715
0.8489
0.6002
0.5218
0.4900
0.2468
0.2156
0.1281
0.0641
0.0130
0.0023
0.0015
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
1
2
3
4
5
6
7
7
7
7
7
7
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
11
Cumulative positives
Cumulative Lift
10
9
8
7
6
5
4
3
2
1
0
0
10
15
20
25
30
35
Score rank
The line joining the points (0,0) to (30,9) is a reference line. It represents the expected number of
positives we would predict if we did not have a model but simply selected cases at random. It
provides a benchmark against which we can see performance of the model. If we had to choose
10 neighborhoods as HICLASS=1 neighborhoods and used our model to pick the ones most
likely to be "1's,", the lift curve tells us that we would be right about 7 of them. If we simply
select 10 cases at random we expect to be right for 10 9/30 = 3 cases. The model gives us a
"lift" in predicting HICLASS of 7/3 = 2.33. The lift will vary with the number of cases we choose
to act on. A good classifier will give us a high lift when we act on only a few cases (i.e. use the
prediction for the ones at the top). As we include more cases the lift will decrease. The lift curve
for the best possible classifier is shown as a broken line.
XLMiner automatically creates lift charts from probabilities predicted by logistic regression for
both training and validation data. The charts created for the full Boston Housing data are shown
below.
12
Cumulative
35
30
25
20
Cumulative
HIGHCLASS using
average
15
10
5
0
0
100
200
300
# cases
Cumulative
35
30
25
20
Cumulative
HIGHCLASS using
average
15
10
5
0
0
100
200
300
400
# cases
It is worth mentioning that a curve that captures the same information as the lift curve in a
slightly different manner is also popular in data mining applications. This is the ROC (short for
Receiver Operating Characteristic) curve. It uses the same variable on the y axis as the lift curve
(but expressed as a percentage of the maximum) and on the x axis it shows the false positives
(also expressed as a percentage of the maximum) for differing cut-off levels.
13
The ROC curve for our 30 cases example above is shown below.
100.0%
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
14
f1(x)
f0(x)
Grey area
Clearly the grey area of greatest doubt in classification is the area around a. At a the ratio of the
conditional probabilities of belonging to the classes is one. A sensible rule way to define the grey
area is the set of x values such that:
t>
p(C1 ) f 1 (x0 )
> 1/ t
p(C0 ) f 0 (x0 )
where t is a threshold for the ratio. A typical value of t may in the range 1.05 or 1.2.
15
Lecture 3
Classification Trees
x3 xp. Recursive partitioning divides up the p dimensional space of the x variables into non-
overlapping rectangles. This division is accomplished recursively. First one of the variables is
selected, say xi and a value of xi, say si is chosen to split the p dimensional space into two parts:
one part is the p-dimensional hyper-rectangle that contains all the points with xi si and the other
part is the hyper-rectangle with all the points with xi > ci. Then one of these two parts is divided
in a similar manner by choosing a variable again (it could be xi or another variable) and a split
value for the variable. This results in three rectangular regions. (From here onwards we refer to
hyper-rectangles simply as rectangles.) This process is continued so that we get smaller and
smaller rectangles. The idea is to divide the entire x-space up into rectangles such that each
belong to just one class. (Of course, this is not always possible, as there may be points that belong
to different classes but have exactly the same values for every one of the independent variables.)
A riding-mower manufacturer would like to find a way of classifying families in a city into those
that are likely to purchase a riding mower and those who are not likely to buy one. A pilot
random sample of 12 owners and 12 non-owners in the city is undertaken. The data are shown in
Table I and plotted in Figure 1 below. The independent variables here are Income (x1) and Lot
Size (x2). The categorical y variable has two classes: owners and non-owners.
Table 1
Observation
1
2
3
4
5
Income
($ 000's)
60
85.5
64.8
61.5
87
Lot Size
(000's sq. ft.)
Owners=1,
Non-owners=2
18.4
16.8
21.6
20.8
23.6
1
1
1
1
1
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63
19.2
17.6
22.4
20
20.8
22
20
19.6
20.8
17.2
20.4
17.6
17.6
16
18.4
16.4
18.8
14
14.8
Figure 1
25.0
20.0
15.0
10.0
30.0
50.0
70.0
90.0
Income ($ 000's)
Owners
Non-owners
110.0
If we apply CART to this data it will choose x2 for the first split with a splitting value of 19. The
(x1,x2) space is now divided into two rectangles, one with the Lot Size variable, x2 19 and the
other with x2> 19. See Figure 2.
Figure 2
25.0
20.0
15.0
10.0
30.0
50.0
70.0
90.0
110.0
Income ($ 000's)
Owners
Non-owners
Notice how the split into two rectangles has created two rectangles each of which is much more
homogenous than the rectangle before the split. The upper rectangle contains points that are
mostly owners (9 owners and 3 non-owners) while the lower rectangle contains mostly nonowners (9 non-owners and 3 owners).
How did CART decide on this particular split? It examined each variable and all possible split
values for each variable to find the best split. What are the possible split values for a variable?
They are simply the mid-points between pairs of consecutive values for the variable. The possible
split points for x1 are {38.1, 45.3, 50.1, , 109.5} and those for x2 are {14.4, 15.4, 16.2, ,
23}. These split points are ranked according to how much they reduce impurity (heterogeneity of
composition). The reduction in impurity is defined as the impurity of the rectangle before the split
minus the sum of the impurities for the two rectangles that result from a split. There are a number
of ways we could measure impurity. We will describe the most popular measure of impurity: the
Gini index. If we denote the classes by k, k=1, 2, , C, where C is the total number of classes
for the y variable, the Gini impurity index for a rectangle A is defined by
C
k =1
k =1
25.0
20.0
15.0
10.0
30.0
50.0
70.0
90.0
Income ($ 000's)
Owners
Non-owners
110.0
Figure 4
25.0
20.0
15.0
10.0
30.0
50.0
70.0
90.0
110.0
Income ($ 000's)
Owners
Non-owners
We can see how the recursive partitioning is refining the set of constituent rectangles to become
purer as the algorithm proceeds. The final stage of the recursive partitioning is shown in Figure 5.
Figure 5
25.0
20.0
15.0
10.0
30.0
50.0
70.0
90.0
110.0
Income ($ 000's)
Owners
Non-owners
Notice that now each rectangle is pure it contains data points from just one of the two classes.
The reason the method is called a classification tree algorithm is that each split can be depicted as
a split of a node into two successor nodes. The first split is shown as a branching of the root node
of a tree in Figure 6.
Figure 6
1
x2<=19
19
Lot Size
12
x2>19
12
3
57.15
84.75
3
Income
Income
The tree representing the first three splits is shown in Figure 7 below.
Figure 7
x2<=19
19
x2>19
Lot Size
12
12
84.75
Income
x1<=84.75
10
x1<=57.15
2
x1> 84.75
57.15
Income
x1>57.15
9
The full tree is shown in Figure 8 below. We have represented the nodes that have successors by
circles. The numbers inside the circle are the splitting values and the name of the variable chosen
for splitting at that node is shown below the node. The numbers on the left fork at a decision node
shows the number of points in the decision node that had values less than or equal to the splitting
value while the number on the right fork shows the number that had a greater value. These are
called decision nodes because if we were to use a tree to classify a new observation for which we
knew only the values of the independent variables we would drop the observation down the tree
in such a way that at each decision node the appropriate branch is taken until we get to a node that
has no successors. Such terminal nodes are called the leaves of the tree. Each leaf node
corresponds to one of the final rectangles into which the x-space is partitioned and is depicted as
a rectangle -shaped node. When the observation has dropped down all the way to a leaf we can
predict a class for it by simply taking a vote of all the training data that belonged to the leaf
when the tree was grown. The class with the highest vote is the class that we would predict for the
new observation. The number below the leaf node is the class with the most votes in the
rectangle. The % value in a leaf node shows the percentage of the total number of training
observations that belonged to that node. It is useful to note that the type of trees grown by CART
(called binary trees) have the property that the number of leaf nodes is exactly one more than the
number of decision nodes.
Figure 8
x2<=19
19
x2>19
Lot Size
12
12
3
57.15
84.75
3
x1<=57.15
Income
10
x1<=84.75
x1> 84.75
8.33 %
18
21.40
19.8
Lot Size
29.1 %
1
7
x1>57.15
Income
Lot Size
3
18.6
Lot Size
1
8.33 %
4.16 %
1
Lot Size
63
29.1 %
19.4
Lot Size
2
4.16 %
4.16 %
4.16 %
Income
1
4.16 %
4.16 %
Pruning
The second key idea in the CART procedure, that of using the validation data to prune back the
tree that is grown from the training data using independent validation data, was the real
innovation. Previously, methods had been developed that were based on the idea of recursive
partitioning but they had used rules to prevent the tree from growing excessively and over-fitting
the training data. For example, CHAID (Chi-Squared Automatic Interaction Detection) is a
recursive partitioning method that predates CART by several years and is widely used in database
marketing applications to this day. It uses a well-known statistical test (the chi-square test for
independence) to assess if splitting a node improves the purity by a statistically significant
amount. If the test does not show a significant improvement the split is not carried out. By
contrast, CART uses validation data to prune back the tree that has been deliberately overgrown
using the training data.
The idea behind pruning is to recognize that a very large tree is likely to be over-fitting the
training data. In our example, the last few splits resulted in rectangles with very few points
(indeed four rectangles in the full tree have just one point). We can see intuitively that these last
splits are likely to be simply capturing noise in the training set rather than reflecting patterns that
would occur in future data such as the validation data. Pruning consists of successively selecting a
10
decision node and re-designating it as a leaf node (thereby lopping off the branches extending
beyond that decision node (its subtree) and thereby reducing the size of the tree).The pruning
process trades off misclassification error in the validation data set against the number of decision
nodes in the pruned tree to arrive at a tree that captures the patterns but not the noise in the
training data. It uses a criterion called the cost complexity of a tree to generate a sequence of
trees which are successively smaller to the point of having a tree with just the root node. (What is
the classification rule for a tree with just one node?). We then pick as our best tree the one tree in
the sequence that gives the smallest misclassification error in the validation data.
The cost complexity criterion that CART uses is simply the misclassification error of a tree
(based on the validation data) plus a penalty factor for the size of the tree. The penalty factor is
based on a parameter, let us call it a, that is the per node penalty. The cost complexity criterion
for a tree is thus Err(T) + a |L(T)| where Err(T) is the fraction of validation data observations that
are misclassified by tree T, L(T) is the number of leaves in tree T and a is the per node penalty
cost: a number that we will vary upwards from zero. When a = 0 there is no penalty for having
too many nodes in a tree and the best tree using the cost complexity criterion is the full grown
unpruned tree. When we increase a to a very large value the penalty cost component swamps the
misclassification error component of the cost complexity criterion function and the best tree is
simply the tree with the fewest leaves, namely the tree with simply one node. As we increase the
value of a from zero at some value we will first encounter a situation where for some tree T1
formed by cutting off the subtree at a decision node we just balance the extra cost of increased
misclassification error (due to fewer leaves) against the penalty cost saved from having fewer
leaves. We prune the full tree at this decision node by cutting off its subtree and redesignating this
decision node as a leaf node. Lets call this tree T1 . We now repeat the logic that we had applied
previously to the full tree, with the new tree T1 by further increasing the value of a. Continuing in
this manner we generate a succession of trees with diminishing number of nodes all the way to
the trivial tree consisting of just one node.
From this sequence of trees it seems natural to pick the one that gave the minimum
misclassification error on the validation data set. We call this the Minimum Error Tree.
Let us use the Boston Housing data to illustrate. Shown below is the output that XLMiner
generates when it is using the training data in the tree growing phase of the algorithm
11
Training Log
Growing the Tree
#Nodes
0
Error
36.18
1
2
3
15.64
5.75
3.29
4
5
2.94
1.88
6
7
8
1.42
1.26
1.2
9
10
11
0.63
0.59
0.49
12
13
14
0.42
0.35
0.34
15
16
17
0.32
0.25
0.22
18
19
20
0.21
0.15
0.09
21
22
0.09
0.09
23
24
25
0.08
0.05
0.03
26
27
28
0.03
0.02
0.01
29
30
0
0
59
2
3
0
0
194
0
0
51
Error Report
Class
1
2
3
Overall
# Cases
59
# Errors
0
194
51
304
0
0
0
12
% Error
0.00 These are
0.00 cases in
0.00 the training
0.00 data
The top table logs the tree-growing phase by showing in each row the number of decision nodes
in the tree at each stage and the corresponding (percentage) misclassification error for the training
data applying the voting rule at the leaves. We see that the error steadily decreases as the number
of decision nodes increases from zero (where the tree consists of just the root node) to thirty. The
error drops steeply in the beginning, going from 36% to 3% with just an increase of decision
nodes from 0 to 3. Thereafter the improvement is slower as we increase the size of the tree.
Finally we stop at a full tree of 30 decision nodes (equivalently, 31 leaves) with no error in the
training data, as is also shown in the confusion table and the error report by class.
The output generated by XLMiner during the pruning phase is shown below.
# Decision
Nodes
30
Training
Validation
Error
Error
0.00%
15.84%
29
28
27
0.00%
0.01%
0.02%
15.84%
15.84%
15.84%
26
25
0.03%
0.03%
15.84%
15.84%
24
23
0.05%
0.08%
15.84%
15.84%
22
21
0.09%
0.09%
15.84%
16.34%
20
19
0.09%
0.15%
15.84%
15.84%
18
17
0.21%
0.22%
15.84%
15.84%
16
15
0.25%
0.32%
15.84%
15.84%
14
13
0.34%
0.35%
15.35%
14.85%
12
11
0.42%
0.49%
14.85%
15.35%
10
9
0.59%
0.63%
8
7
1.20%
1.26%
15.84%
16.83%
6
5
1.42%
1.88%
16.83%
15.84% <-- Best Prune
4
3
2.94%
3.29%
21.78%
21.78%
2
1
5.75%
15.64%
30.20%
33.66%
Std. Err.
0.02501957
Notice now that as the number of decision nodes decreases the error in the validation data has a
slow decreasing trend (with some fluctuation) up to a 14.85% error rate for the tree with 10
nodes. This is more readily visible from the graph below. Thereafter the error increases, going up
13
sharply when the tree is quite small. The Minimum Error Tree is selected to be the one with 10
decision nodes (why not the one with 13 decision nodes?)
Training Error
Validation Error
35%
30%
%error
25%
20%
15%
10%
5%
0%
0
10
15
20
# Decision Nodes
14
25
30
35
Figure 9
15.145
LSTAT
137
65
6.5545
5.86
RM
83
CRIM
54
35
30
41.0 %
14.8 %
7.11
19.3
RM
LSTAT
1
34
20
25
9.90 %
12.3 %
10
5.69
26.62
LSTAT
14
LSTAT
20
9.90 %
2.97 %
2.5
19.6
RAD
1
PTRATIO
13
0.49 %
0.99 %
0.99 %
11.30
2
AGE
1
12
0.49 %
5.94 %
You will notice that the XLMiner output from the pruning phase highlights another tree besides
the Minimum Error Tree. This is the Best Pruned Tree, the tree with 5 decision nodes. The reason
this tree is important is that it is the smallest tree in the pruning sequence that has an error that is
within one standard error of the Minimum Error Tree. The estimate of error that we get from the
validation data is just that: it is an estimate. If we had had another set of validation data the
minimum error would have been different. The minimum error rate we have computed can be
viewed as an observed value of a random variable with standard error (estimated standard
deviation) equal to Emin (1 - Emin ) Nval where E min is the error rate (as a fraction) for the
15
minimum error tree and N val is the number of observations in the validation data set. For our
example E min = 0.1485 and N val = 202, so that the standard error is 0.025. The Best Pruned Tree
is shown in Figure 10.
Figure 10
15.145
LSTAT
137
65
6.5545
5.86
RM
CRIM
83
54
41.0 %
35
7.11
30
17.3 %
14.8 %
RM
34
20
9.90 %
5.69
3
LSTAT
14
20
6.93 %
9.90 %
We show the confusion table and summary of classification errors for the Best Pruned Tree
below.
16
25
10
2
3
5
0
120
8
9
25
# Cases
# Errors
% Error
35
10
28.57
2
3
134
33
14
8
10.45
24.24
Overall
202
32
15.84
Error Report
Class
It is important to note that since we have used the validation data in classification, strictly
speaking it is not fair to compare the above error rates directly with the other classification
procedures that use only the training data to construct classification rules. A fair comparison
would be to partition the training data (TDother) for other procedures into a further partition into
training (TDtree) and validation data (VDtree) for classification trees. The error rate for the
classification tree created from using TDtree to grow the tree and then VDtree to prune it can
now be compared using the validation data used by the other classifiers (VDother) as new hold
out data..
17
incorporate an importance ranking for the variables in terms of their impact on quality of the
classification.
Notes:
1. We have not described how categorical independent variables are handled in CART. In
principle there is no difficulty. The split choices for a categorical variable are all ways in
which the set of categorical values can be divided into two subsets. For example a
categorical variable with 4 categories, say {1,2,3,4} can be split in 7 ways into two
subsets: {1} and {2,3,4}; {2} and {1,3,4}; {3} and {1,2,4}; {4} and {1,2,3};{1.2} and
{3,4}; {1,3} and {2,4}; {1,4} and {2,3}. When the number of categories is large the
number of splits becomes very large. XLMiner supports only binary categorical variables
(coded as numbers). If you have a categorical independent variable that takes more than
two values, you will need to replace the variable with several dummy variables each of
which is binary in a manner that is identical to the use of dummy variables in regression.
2. There are several variations on the basic recursive partitioning scheme described above.
A common variation is to permit splitting of the x variable space using straight lines
(planes for p =3 and hyperplanes for p> 3) that are not perpendicular to the co-ordinate
axes. This can result in a full tree that is pure with far fewer nodes particularly when the
classes lend themselves to linear separation functions. However this improvement comes
at a price. First, the simplicity of interpretation of classification trees is lost because we
now split on the basis of weighted sums of independent variables where the weights may
not be easy to interpret. Second the important property remarked on earlier of the trees
being invariant to monotone transformations of independent variables is no longer true.
3. Besides CHAID, another popular tree classification method is ID3 (and its successor
C4.5). This method was developed by Quinlan a leading researcher in machine learning
and is popular with developers of classifiers who come from a background in machine
learning.
As usual for regression we denote the dependent (categorical) variable by y and the independent
variables by x1, x2, x3 xp. Both key ideas underlying classification trees carry over with small
modifications for regression trees..
18
Lecture 4
Discriminant Analysis
Income
Lot Size
Owners=1,
($ 000's) (000's sq. ft.) Non-owners=2
60
85.5
64.8
61.5
87
110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63
18.4
16.8
21.6
20.8
23.6
19.2
17.6
22.4
20
20.8
22
20
19.6
20.8
17.2
20.4
17.6
17.6
16
18.4
16.4
18.8
14
14.8
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
Figure 1
25
20
15
10
30
50
70
90
110
Income ($ 000,s)
Owners
Non-owners
We can think of a linear classification rule as a line that separates the x1-x2 region into
two parts where most of the owners are in one half-plane and the non-owners are in the
complementary half-space. A good classification rule would separate out the data so that
the fewest points are misclassified: the line shown in Fig.1 seems to do a good job in
discriminating between the two groups as it makes 4 misclassifications out of 24 points.
Can we do better?
We can obtain linear classification functions that were suggested by Fisher using
statistical software. You can use XLMiner to find Fishers linear classification functions.
Output 1 shows the results of invoking the discriminant routine.
Output 1
Prior Class Probabilities
Prior class probabilities
Class
Probability
0.5
0.5
Classification Functions
Classification Function
Variables
Constant
Income
($ 000's)
Lot Size
-73.160202
-51.4214439
0.42958561
0.32935533
5.46674967
4.68156528
($
Variate1
0.01032889
0.08091455
11
10
Error Report
Class
# Cases
# Errors
12
% Error
8.33
12
16.67
Overall
24
12.50
We note that it is possible to have a misclassification rate that is lower (3 in 24) by using
the classification functions specified in the output. These functions are specified in a way
that can be easily generalized to more than two classes. A family is classified into Class 1
of owners if Function 1 is higher than Function 2, and into Class 2 if the reverse is the
case. The values given for the functions are simply the weights to be associated with each
4
variable in the linear function in a manner analogous to multiple linear regression. For
example, the value of the Classification function for class1 is 53.20. This is calculated
using the coefficients of classification function1 shown in Output 1 above as 73.1602 +
0.4296 60 + 5.4667 18.4. XLMiner computes these functions for the observations in
our dataset. The results are shown in Table 3 below.
Table 3
Classes
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Predicted
Class
Max Value
Value for
Class - 1
Input Variables
Value for
Class - 2
Income
Lot Size
($ 000's) (000's sq. ft.)
54.48067856
53.203125 54.48067856
60
18.4
85.5
16.8
64.8
21.6
61.5
20.8
87
23.6
110.1
19.2
108
17.6
82.8
22.4
69
20
80.49964905 80.49964905
76.5851593
93
20.8
51
22
81
20
75
19.6
63.3450737
52.8
20.8
64.8
17.2
58.31063843
56.91959 58.31063843
43.2
20.4
84
17.6
49.2
17.6
63.3450737 63.23031235
59.4
16
56.45681
66
18.4
47.4
16.4
33
18.8
56.45681 55.78063965
30.917593 25.28316116
30.917593
51
14
63
14.8
Notice that observations 1, 13 and 17 are misclassified as we would expect from the
output shown in Table 2.
Let us describe the reasoning behind Fishers linear classification rules. Figure 3 depicts
the logic.
25
D2
20
15
D1
P2
P1
10
30
50
70
90
110
Income ($ 000,s)
Owners
Non-owners
Consider various directions such as directions D1 and D2 shown in Figure 2. One way to
identify a good linear discriminant function is to choose amongst all possible directions
the one that has the property that when we project (drop a perpendicular line from) the
means of the two groups onto a line in the chosen direction the projections of the group
means (feet of the perpendiculars, e.g. P1 and P2 in direction D1) are separated by the
maximum possible distance. The means of the two groups are:
Mean1
Mean2
Income
79.5
57.4
Area
20.3
17.6
We still need to decide how to measure the distance. We could simply use Euclidean
distance. This has two drawbacks. First, the distance would depend on the units we
choose to measure the variables. We will get different answers if we decided to measure
area in say, square yards instead of thousands of square feet. Second , we would not be
taking any account of the correlation structure. This is often a very important
consideration especially when we are using many variables to separate groups. In this
case often there will be variables which by themselves are useful discriminators between
groups but in the presence of other variables are practically redundant as they capture the
same effects as the other variables.
Fishers method gets over these objections by using a measure of distance that is a
generalization of Euclidean distance known as Mahalanobis distance. This distance is
defined with respect to a positive definite matrix . The squared Mahalanobis distance
between two p-dimensional (column) vectors y1 and y2 is (y1 y2) -1 (y1 y2) where
is a symmetric positive definite square matrix with dimension p. Notice that if is the
identity matrix the Mahalanobis distance is the same as Euclidean distance. In linear
discriminant analysis we use the pooled sample variance matrix of the different groups. If
X1 and X2 are the n1 x p and n2 x p matrices of observations for groups 1 and 2, and the
respective sample variance matrices are S1 and S2, the pooled matrix S is equal to
{(n1-1) S1 + (n2-1) S2}/(n1 +n2 2). The matrix S defines the optimum direction
(actually the eigenvector associated with its largest eigenvalue) that we referred to when
we discussed the logic behind Figure 2. This choice of Mahalanobis distance can also be
shown to be optimal* in the sense of minimizing the expected misclassification error
when the variable values of the populations in the two groups (from which we have
drawn our samples) follow a multivariate normal distribution with a common covariance
matrix. In fact it is optimal for the larger family of elliptical distributions with equal
variance-covariance matrices. In practice the robustness of the method is quite
remarkable in that even for situations that are only roughly normal it performs quite well.
If we had a prospective customer list with data on income and area, we could use the
classification functions in Output 1 to identify the sub-list of families that are classified as
group 1. This sub-list would consist of owners (within the classification accuracy of our
functions) and therefore prospective purchasers of the product.
Classification Error
What is the accuracy we should expect from our classification functions? We have an
training data error rate (often called the re-substitution error rate) of 12.5% in our
example. However this is a biased estimate as it is overly optimistic. This is because we
have used the same data for fitting the classification parameters as well for estimating the
error. In data mining applications we would randomly partition our data into training and
validation subsets. We would use the training part to estimate the classification functions
and hold out the validation part to get a more reliable, unbiased estimate of classification
error.
So far we have assumed that our objective is to minimize the classification error and that
the chances of encountering an item from either group requiring classification is the
same. . If the probability of encountering an item for classification in the future is not
equal for both groups we should modify our functions to reduce our expected (long run
average) error rate. Also we may not want to minimize misclassifaction rate in certain
situations. If the cost of mistakenly classifying a group 1 item as group 2 is very different
from the cost of classifying a group 2 item as a group 1 item, we may want to minimize
the expect cost of misclassification rather than the error rate that does not take cognizance
of unequal misclassification costs. It is simple to incorporate these situations into our
framework for two classes. All we need to provide are estimates of the ratio of the
*
This is true asymptotically, i.e. for large training samples. Large training samples are
required for S, the pooled sample variance matrix, to be a good estimate of the population
variance matrix.
7
Example 2: Fishers Iris Data This is a classic example used by Fisher to illustrate his
method for computing clasification functions. The data consists of four length
measurements on different varieties of iris flowers. Fifty different flowers were measured
for each species of iris. A sample of the data are given in Table 4 below:
Table 4
OBS#
SPECIES
CLASSCODE SEPLEN SEPW
1
5.1
1 Iris-setosa
1
4.9
2 Iris-setosa
1
4.7
3 Iris-setosa
1
4.6
4 Iris-setosa
1
5
5 Iris-setosa
1
5.4
6 Iris-setosa
1
4.6
7 Iris-setosa
1
5
8 Iris-setosa
1
4.4
9 Iris-setosa
1
4.9
10 Iris-setosa
...
51 Iris-versicolor
2
7
52 Iris-versicolor
2
6.4
53 Iris-versicolor
2
6.9
54 Iris-versicolor
2
5.5
55 Iris-versicolor
2
6.5
56 Iris-versicolor
2
5.7
57 Iris-versicolor
2
6.3
58 Iris-versicolor
2
4.9
59 Iris-versicolor
2
6.6
60 Iris-versicolor
2
5.2
...
101 Iris-virginica
3
6.3
102 Iris-virginica
3
5.8
103 Iris-virginica
3
7.1
104 Iris-virginica
3
6.3
PETLEN PETW
3.5
1.4
3
1.4
3.2
1.3
3.1
1.5
3.6
1.4
3.9
1.7
3.4
1.4
3.4
1.5
2.9
1.4
3.1
1.5
3.2
4.7
3.2
4.5
3.1
4.9
2.3
4
2.8
4.6
2.8
4.5
3.3
4.7
2.4
3.3
2.9
4.6
2.7
3.9
3.3
6
2.7
5.1
3
5.9
2.9
5.6
0.2
0.2
0.2
0.2
0.2
0.4
0.3
0.2
0.2
0.1
1.4
1.5
1.5
1.3
1.5
1.3
1.6
1
1.3
1.4
2.5
1.9
2.1
1.8
105 Iris-virginica
106 Iris-virginica
107 Iris-virginica
108 Iris-virginica
109 Iris-virginica
110 Iris-virginica
3
3
3
3
3
3
6.5
7.6
4.9
7.3
6.7
7.2
3
3
2.5
2.9
2.5
3.6
5.8
6.6
4.5
6.3
5.8
6.1
2.2
2.1
1.7
1.8
1.8
2.5
The results from applying the discriminant analysis procedure of Xlminer are shown in
Output 2:
Output 2
Classification Functions
Classification Function
Variables
Constant
-86.3084793
-72.8526154
-104.368332
SEPLEN
23.5441742
15.6982136
12.4458504
SEPW
23.5878677
7.07251072
3.68528175
PETLEN
PETW
-16.4306431
5.21144867
12.7665491
-17.398407
6.43422985
21.0791111
Variate1
Variate2
0.06840593
0.00198865
0.12656119
0.17852645
PETLEN
-0.18155289
-0.0768638
PETW
-0.23180288
0.23417209
50
48
49
Error Report
Class
1
# Cases
# Errors
% Error
50
0.00
50
4.00
50
2.00
150
2.00
Overall
For illustration the computations of the classification function values for observations 40
to 55 and 125 to 135 are shown in Table 5.
Table 5
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
125
126
127
128
129
130
131
132
133
134
135
85.83991241 85.83991241
40.3588295
-4.99889183
5.1
3.4
1.5
0.2
-6.32034588
3.5
1.3
0.3
47.3130455 22.76127243
-16.9656143
4.5
2.3
1.3
0.3
47.3130455
-17.0013542
4.4
3.2
1.3
0.2
3.5
1.6
0.6
5.1
3.8
1.9
0.4
69.24474335 69.24474335
4.8
1.4
0.3
32.9426384
-9.37550068
-2.24812603
5.1
3.8
1.6
0.2
70.99331665 70.99331665
30.5740757
-13.2355289
4.6
3.2
1.4
0.2
97.62510681 97.62510681
45.620224
-1.40413904
5.3
3.7
1.5
0.2
-7.88865852
3.3
1.4
0.2
3.2
4.7
1.4
6.4
3.2
4.5
1.5
6.9
3.1
4.9
1.5
5.5
2.3
1.3
6.5
2.8
4.6
1.5
6.7
3.3
5.7
2.1
7.2
3.2
1.8
82.33656311 15.52721214
83.10569
80.8759079 82.33656311
16.2473011 81.24172974
83.10569
6.2
2.8
4.8
1.8
6.1
4.9
1.8
101.362709
6.4
2.8
5.6
2.1
7.2
5.8
1.6
7.4
2.8
6.1
1.9
7.9
3.8
6.4
6.4
2.8
5.6
2.2
6.3
2.8
5.1
1.5
6.1
2.6
5.6
1.4
10
3
41
51
61
71
81
91
10 1
11 1
12 1
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 2
22 2
23 2
24 2
Canonical
Score 1
2.10856112
2.242484535
2.417066352
2.318249375
2.80819681
2.690770149
2.5396162
2.667718012
2.33098441
2.64360941
2.30689349
2.45493109
2.36059193
2.228388032
2.061042332
2.096864868
2.29172284
1.932277468
1.908168866
2.17053446
1.816588006
1.86204691
1.65957709
1.84825541
Obs 17 &13
Obs 1
In the case of the iris we would condense the separability information into 2 dimensions.
If we had c classes and p variables, and Min(c-1,p) > 2 , we can only plot the first two
canonical values for each observation. In such datasets sometimes we still get insight into
the separation of the observations in the data by plotting the observations in these two coordinates.
11
12
LOGISTIC REGRESSION
Nitin R Patel
Logistic regression extends the ideas of multiple linear regression to the situation
where the dependent variable, y, is binary (for convenience we often code these values as
0 and 1). As with multiple linear regression the independent variables x1 , x2 xk may
be categorical or continuous variables or a mixture of these two types.
Let us take some examples to illustrate [1]:
Example 1: Market Research
The data in Table 1 were obtained in a survey conducted by AT & T in the US
from a national sample of co-operating households. Interest was centered on the adoption
of a new telecommunications service as it related to education, residential stability and
income.
Table 1: Adoption of New Telephone Service
Low
Income
High
Income
147/1363 = 0.108
287/1925 = 0.149
382/1415 = 0.270
(For fractions in cells above, the numerator is the number of adopters out of the
number in the denominator).
Note that the overall probability of adoption in the sample is 1628/10524 = 0.155.
However, the adoption probability varies depending on the categorical independent variables education, residential stability and income. The lowest value is 0.069 for low- income
no-residence-change households with some college education while the highest is 0.270 for
1
attributes of the choice. The random utility model considers the utility of a choice to
incorporate a random element. When we model the random element as coming from a
reasonable distribution, we can logically derive the logistic model for predicting choice
behavior.
If we let y = 1 represent choosing an option versus y = 0 for not choosing it, the
logistic regression model stipulates:
Probability(Y = 1|x1 , x2 xk ) =
exp(O + 1 x1 + k xk )
1 + exp(O + 1 x1 + k xk )
P rob(Y = 1|x1 , x2 , x3 ) =
exp(0 + 1 xl + 2 x2 + 3 x3 )
.
1 + exp(0 + 1 xl + 2 x2 + 3 x3 )
P rob(Y = 1|x1 = x2 = x3 = 0)
P rob(Y = 0|x1 = x2 = x3 = 0)
Odds of adopting in the base case (x1 = x2 = x3 = 0)
exp(1 ) =
Odds
F actor
F actor
F actor
f or
due
due
due
=
basecase
to x1
to x2
to x3
If x1 = 1 the odds of adoption get multiplied by the same factor for any given level of
x2 and x3 . Similarly the multiplicative factors for x2 and x3 do not vary with the levels
of the remaining factors. The factor for a variable gives us the impact of the presence of
that factor on the odds of adopting.
If i = 0, the presence of the corresponding factor has no eect (multiplication by
one). If i < 0, presence of the factor reduces the odds (and the probability) of adoption,
whereas if i > 0, presence of the factor increases the probability of adoption.
The computations required to produce these maximum likelihood estimates require
iterations using a computer program. The output of a typical program is shown below:
P rob(Y = 1|x1 , x2 , x3 ) =
The estimated number of adopters from this model will be the total number of households
with values x1 , x2 and x3 for the independent variables multiplied by the above probability.
The table below shows the estimated number of adopters for the various combinations
of the independent variables.
x1
x2
x3
0
0
0
0
1
1
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
1
1
# in # adopters
Estimated
Fraction
Estimated
sample
(# adopters) Adopters P rob(Y = l|x1 , x2 , x3 )
2160
153
164
0.071
0.076
1363
147
155
0.108
0.113
1137
226
206
0.199
0.181
547
139
140
0.254
0.257
886
61
78
0.069
0.088
1091
233
225
0.214
0.206
1925
287
252
0.149
0.131
1415
382
408
0.270
0.289
In data mining applications we will have validation data that is a hold-out sample not
used in tting the model.
Let us suppose we have the following validation data consisting of 598 households:
5
x1
x2
x3
0
0
0
0
1
1
1
1
0
0
1
1
0
1
0
1
Totals
0
1
0
1
0
0
1
1
# in
# adopters
Estimated
Error Absolute
validation in validation (# adopters) (Estimate
Value
sample
sample
-Actal) of Error
29
3
2.200
-0.800
0.800
23
7
2.610
-4.390
4.390
112
25
20.302
-4.698
4.698
143
27
36.705
9.705
9.705
27
2
2.374
0.374
0.374
54
12
11.145
-0.855
0.855
125
13
16.338
3.338
3.338
85
30
24.528
-5.472
5.472
598
119
116.202
The total error is -2.8 adopters or a percentage error in estimating adopters of -2.8/119
= 2.3%.
The average percentage absolute error is
0.800 + 4.390 + 4.698 + 9.705 + 0.374 + 0.855 + 3.338 + 5.472
119
= .249 = 24.9% adopters.
The confusion matrix for households in the validation data for set is given below:
Observed
Adopters Non-adopters Total
Predicted:
Adopters
103
Non-adopters 16
Total
119
13
466
479
116
482
598
As with multiple linear regression we can build more complex models that reect
interactions between independent variables by including factors that are calculated from
the interacting factors. For example if we felt that there is an interactive eect b etween
x1 and x2 we would add an interaction term x4 = x1 x2 .
Let us construct a simple logistic regression model for classication of banks using
the Total Loans & Leases to Total Assets ratio as the independent variable in our model.
This model would have the following variables:
Dependent variable:
Y
= 1,
if nancially distressed,
= 0,
otherwise.
The equation relating the dependent variable with the explanatory variable is:
P rob(Y = 1|x1 ) =
exp(0 + 1 xl )
1 + exp(0 + 1 xl )
or, equivalently,
Odds (Y = 1 versus Y = 0) = (0 + 1 xl ).
The Maximum Likelihood Estimates of the coecients for the model are: 0 = 6.926,
1 = 10.989
So that the tted model is:
P rob(Y = 1|x1 ) =
exp(6.926 + 10.989 x1 )
.
(1 + exp(6.926 + 10.989 x1 )
8
Figure 1 displays the data points and the tted logistic regression model.
We can think of the model as a multiplicative model of odds ratios as we did for
Example 1. The odds that a bank with a Loan & Leases/Assets Ratio that is zero will
be in nancial distress = exp(6.926) = 0.001. These are the base case odds. The
odds of distress for a bank with a ratio of 0.6 will increase by a multiplicative factor of
exp(10.9890.6) = 730 over the base case, so the odds that such a bank will be in nancial
distress = 0.730.
Notice that there is a small dierence in interpretation of the multiplicative factors for
this example compared to Example 1. While the interpretation of the sign of i remains
as before, its magnitude gives the amount by which the odds of Y = 1 against Y = 0
are changed for a unit change in xi . If we construct a simple logistic regression model for
classication of banks using the Total Expenses/ Total Assets ratio as the independent
variable we would have the following variables:
Dependent variable:
Y
= 1, if nancially distressed,
= 0,
otherwise.
The equation relating the dependent variable with the explanatory variable is:
P rob(Y = l|x1 ) =
exp(0 + 2 x2 )
1 + exp(0 + 2 x2 )
or, equivalently,
Odds (Y = 1 versus Y = 0) = (0 + 2 x2 ).
The Maximum Likelihood Estimates of the coecients for the model are: 0 = 9.587,
2 = 94.345
10
Figure 2 displays the data points and the tted logistic regression model.
Computation of Estimates
As illustrated in Examples 1 and 2, estimation of coecients is usually carried out
based on the principle of maximum likelihood which ensures good asymptotic (large sample) properties for the estimates. Under very general conditions maximum likelihood
estimators are:
Consistent : the probability of the estimator diering from the true value approaches
zero with increasing sample size;
Asymptotically Ecient : the variance is the smallest possible among consistent
estimators
Asymptotically Normally Distributed: This allows us to compute condence intervals
and perform statistical tests in a manner analogous to the analysis of linear multiple
regression models, provided the sample size is large.
11
Algorithms to compute the coecient estimates and condence intervals are iterative
and less robust than algorithms for linear regression. Computed estimates are generally
reliable for well-behaved datasets where the number of observations with depende nt
variable values of both 0 and 1 are large; their ratio is not too close to either zero or
one; and when the number of coecients in the logistic regression model is small relative
to the sample size (say, no more than 10%). As with linear regression collinearity (strong
correlation amongst the independent variables) can lead to computational diculties.
Computationally intensive algorithms have been developed recently that circumvent some
of these diculties [3].
12
Appendix A
Computing Maximum Likelihood Estimates and Condence Intervals
for Regression Coecients
We denote the coecients by the p 1 column vector with the row element i equal
to i , The n observed values of the dependent variable will be denoted by the n 1
column vector y with the row element j equal to yj ; and the corresponding values of the
independent variable i by xij for
i = 1 p; j = 1 n.
Data : yj , x1j , x2j , , xpj ,
Likelihood Function:
j = 1, 2, , n.
The likelihood function, L, is the probability of the observed
n
ei yj i xij
i i xij
j=1 1 + e
ei (j yj xij )i
=
n
[1 + ei i xij ]
j=1
=
n
ei i ti
[1 + ei i xij ]
j=1
where ti = j yj xij
These are the sucient statistics for a logistic regression model analogous to y and S in
linear regression.
Loglikelihood Function: This is the logarithm of the likelihood function,
l = i i ti j log[1 + ei i xij ].
13
l
i
= ti j
xij ei i xij
[1 + ei i xij ]
= ti j xij
j = 0, i = 1, 2, , p
j = ti
or i xij
where
j =
ei bbi xij
[1+ei i xij ]
= E(Yj )
The Newton-Raphson method involves computing the following successive approximations to nd i , the likelihood function
t+1 = t + [I( t )]1 I( t )
where
Iij =
2l
i j j
On convergence, the diagonal elements of I( t )1 give squared standard errors (approximate variance) for i .
Condence intervals and hypothesis tests are based on asymptotic normal distribution of i .
The loglikelihood function is always negative and does not have a maximum when it
can be made arbitrary close to zero. In that case the likelihood function can be made
arbitrarily close to one and the rst term of the loglikelihood function given above
approaches innity. In this situation the predicted probabilities for observations
with yj = 0 can be made arbitrarily close to 0 and those for yj = 1 can be made
arbitrarily close to 1 by choosing suitable very large absolute values of some i . This
is the situation when we have a perfect model (at least in terms of the training data
set)! This phenomenon is more likely to occur when the number of parameters is a
large fraction (say > 20%) of the number of observations.
15
Appendix B
The Newton-Raphson Method
This method nds the values of i that maximize a twice dierentiable concave
function, g(). If the function is not concave, it nds a local maximum. The
method uses successive quadratic approximations to g based on Taylor series. It
converges rapidly if the starting value, 0 , is reasonably close to the maximizing
of .
value, ,
The gradient vector and the Hessian matrix, H, as dened below, are used to
update an estimate t to t+1 .
..
.
g( t ) =
g
i
..
.
H( t ) =
..
.
2g
i k
..
.
Near the maximum the rate of convergence is quadratic as it can be shown that
|it+1 i | c|it i |2 for some c 0 when it is near i for all i.
17
Objectives
Illustrate importance of interpretation, domain
insights from managers for interpretation and
implementation
Relevance to situations where too many products
(or services) but can define more stable underlying
characteristics of products (or services)
Logistic Regression as a tool that parallels
multiple linear regression in practice. Powerful
analysis in a spreadsheet
Weaknesses:
Competing with mills difficult
Large inventories, high discount sales
Study Question
Improve feedback of market to designs
through improved product codes
Assess economic impact of proposed code
Pilot restricted to saris
Most difficult
Most valuable
Sari components
B ody
B order
Pallav
Sari Code
B ody:W arp Color & S hade (W R PC, W R PS)
W eft C olor & S hade (W FT C, W FT S)
B ody Design (B ODD)
B order: C olor, S hade, D esign, Size (B R DC,
B R DS, B R DD, B R DZ)
Pallav: C olor, S hade, D esign, Size (PL V C,
PL V S, P L V D, P L V Z)
Code Levels
Color (Warp, weft, border, pallav)
10 levels:0=red, 1=blue, 2=green, etc.
Assessing Impact
Major Marketing Experiment
14 day high season period selected
18 largest retail shops selected
20,000 saris coded, sales during period
recorded
Logistic Regression models developed for
Pr(sale of sari during period) as function of
coded values.
Coefficient Estimates
Coeff
-0.698
0.195
-2.220
-2.424
-0.072
1.866
-0.778
-0.384
Variable
Constant
WrpCI_1
WrpCI_2
WrpCI_3
WrpCI_4
PlvDs_1
BrdSz_1
BrdSz_2
Odds
1.215
0.109
0.089
0.931
6.462
0.459
0.681
Confusion Table
(Cut-off probability = 0.5)
Actual
Sale
Sale
Predicted
No Sale
Total
No Sale
Total
15
20
32
37
20
37
57
Impact
Producing only saris that have predicted
probability > 0.5 will reduce slow-moving
stock substantially. In the example, slowmoving stock will go down from 65% of
production to 25% of production
Even cut-off probability of 0.2 reduces slow
stock to 49% of production
Insights
Certain colors and combinations sold much worse
than average but were routinely produced (e.g.
green, border widths-body color interaction)
Converse of above (e.g. plain designs, light shade
body)
Above adjustments possible within weavers skill
and equipment constraints
Huge potential for cost savings in silk saris
Need for streamlining code, training to code.
Lecture 6
In this note we provide an overview of the key concepts that have led to
the emergence of Articial Neural Networks as a major paradigm for Data
Mining applications. Neural nets have gone through two major development
periods -the early 60s and the mid 80s. They were a key development in
the eld of machine learning. Articial Neural Networks were inspired by
biological ndings relating to the behavior of the brain as a network of units
called neurons. The human brain is estimated to have around 10 billion
neurons each connected on average to 10,000 other neurons. Each neuron
receives signals through synapses that control the eects of the signal on
the neuron. These synaptic connections are believed to play a key role in
the behavior of the brain. The fundamental building block in an Articial
Neural Network is the mathematical model of a neuron as shown in Figure
1. The three basic components of the (articial) neuron are:
1. The synapses or connecting links that provide weights, wj , to the
input values, xj f or j = 1, ...m;
2. An adder that sums the weighted input values to compute the
input to the activation function v = w0 +
m
j=1
m
j=0
wj xj ;
Figure 1
While there are numerous dierent (articial) neural network architectures that have been studied by researchers, the most successful applications in data mining of neural networks have been multilayer feedforward
networks. These are networks in which there is an input layer consisting
of nodes that simply accept the input values and successive layers of nodes
that are neurons as depicted in Figure 1. The outputs of neurons in a layer
are inputs to neurons in the next layer. The last layer is called the output
layer. Layers between the input and output layers are known as hidden
layers. Figure 2 is a diagram for this architecture.
Figure 2
the largest output value gives the networks estimate of the class for a given
input. In the special case of two classes it is common to have just one node
in the output layer, the classication between the two classes being made
by applying a cut-o to the output value at the node.
1.1
Let us begin by examining neural networks with just one layer of neurons
(output layer only, no hidden layers). The simplest network consists of just
one neuron with the function g chosen to be the identity function, g(v) = v
for all v. In this case notice that the output of the network is
m
j=0
wj xj , a
1.2
Multilayer neural networks are undoubtedly the most popular networks used
in applications. While it is possible to consider many activation functions, in
practice it hasbeen found that the logistic (also called the sigmoid) function
ev
g(v) = 1+e
as the activation function (or minor variants such as the
v
tanh function) works best. In fact the revival of interest in neural nets was
sparked by successes in training neural networks using this function in place
of the historically (biologically inspired) step function (the perceptron}.
Notice that using a linear function does not achieve anything in multilayer
networks that is beyond what can be done with single layer networks with
linear activation functions. The practical value of the logistic function arises
from the fact that it is almost linear in the range where g is between 0.1 and
0.9 but has a squashing eect on very small or very large values of v.
In theory it is sucient to consider networks with two layers of neurons
one hidden and one output layerand this is certainly the case for most
applications. There are, however, a number of situations where three and
sometimes four and ve layers have been more eective. For prediction the
output node is often given a linear activation function to provide forecasts
that are not limited to the zero to one range. An alternative is to scale the
output to the linear part (0.1 to 0.9) of the logistic function.
Unfortunately there is no clear theory to guide us on choosing the number
of nodes in each hidden layer or indeed the number of layers. The common
practice is to use trial and error, although there are schemes for combining
5
1.3
Let us look at the Iris data that Fisher analyzed using Discriminant Analysis.
Recall that the data consisted of four measurements on three types of iris
owers. There are 50 observations for each class of iris. A part of the data
is reproduced below.
OBS#
1
2
3
4
5
6
7
8
9
10
...
51
52
53
54
55
56
57
58
59
60
...
101
102
103
104
105
106
107
108
109
110
SPECIES
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
...
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
...
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
CLASSCODE
1
1
1
1
1
1
1
1
1
1
...
2
2
2
2
2
2
2
2
2
2
...
3
3
3
3
3
3
3
3
3
3
SEPLEN
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4
4.9
...
7
6.4
6.9
5.5
6.5
5.7
6.3
4.9
6.6
5.2
...
6.3
5.8
7.1
6.3
6.5
7.6
4.9
7.3
6.7
7.2
SEPW
3.5
3
3.2
3.1
3.6
3.9
3.4
3.4
2.9
3.1
...
3.2
3.2
3.1
2.3
2.8
2.8
3.3
2.4
2.9
2.7
...
3.3
2.7
3
2.9
3
3
2.5
2.9
2.5
3.6
PETLEN
1.4
1.4
1.3
1.5
1.4
1.7
1.4
1.5
1.4
1.5
...
4.7
4.5
4.9
4
4.6
4.5
4.7
3.3
4.6
3.9
...
6
5.1
5.9
5.6
5.8
6.6
4.5
6.3
5.8
6.1
PETW
0.2
0.2
0.2
0.2
0.2
0.4
0.3
0.2
0.2
0.1
...
1.4
1.5
1.5
1.3
1.5
1.3
1.6
1
1.3
1.4
...
2.5
1.9
2.1
1.8
2.2
2.1
1.7
1.8
1.8
2.5
the input layer and the hidden layer. In addition there will be a total of 3
connections from each node in the hidden layer to nodes in the output layer.
This makes a total of 25 x 3 = 75 connections between the hidden layer and
the output layer. Using the standard logistic activation functions, the network was trained with a run consisting of 60,000 iterations. Each iteration
consists of presentation to the input layer of the independent variables in a
case, followed by successive computations of the outputs of the neurons of
the hidden layer and the output layer using the appropriate weights. The
output values of neurons in the output layer are used to compute the error. This error is used to adjust the weights of all the connections in the
network using the backward propagation (backprop) to complete the iteration. Since the training data has 150 cases, each case was presented to
the network 400 times. Another way of stating this is to say the network
was trained for 400 epochs where an epoch consists of one sweep through
the entire training data. The results for the last epoch of training the neural
net on this data are shown below:
Iris Output 1
Classication Confusion Matrix
Desired
Class
1
2
3
Total
Computed Class
1
50
50
49
1
50
1
49
50
Total
50
50
50
150
Patterns
50
50
50
150
# Errors
0
1
1
2
% Errors
0.00
2.00
2.00
1.3
StdDev
( 0.00)
( 1.98)
( 1.98)
( 0.92)
Error Report
Class
1
2
3
Overall
The classication error of 1.3% is better than the error using discriminant
analysis which was 2% (See lecture note on Discriminant Analysis). Notice
that had we stopped after only one pass of the data (150 iterations) the
8
Computed Class
1
2
3
10
7
2
13
1
6
12
5
4
35 13
12
Total
19
20
21
60
1.4
the relevant sum and activation function evaluations. These outputs are
the inputs for neurons in the second hidden layer. Again the relevant sum
and activation function calculations are performed to compute the outputs
of second layer neurons. This continues layer by layer until we reach the
output layer and compute the outputs for this layer. These output values
constitute the neural nets guess at the value of the dependent variable. If
we are using the neural net for classication, and we have c classes, we will
have c neuron outputs from the activation functions and we use the largest
value to determine the nets classication. (If c = 2, we can use just one
output node with a cut-o value to map an numerical output value to one
of the two classes).
Let us denote by wij the weight of the connection from node i to node j.
The values of wij are initialized to small (generally random) numbers in the
range 0.00 0.05. These weights are adjusted to new values in the backward
pass as described below.
Backward pass: Propagation of error and adjustment of weights
This phase begins with the computation of error at each neuron in the output
layer. A popular error function is the squared dierence between ok the
output of node k and yk the target value for that node. The target value
is just 1 for the output node corresponding to the class of the exemplar
and zero for other output nodes.(In practice it has been found better to use
values of 0.9 and 0.1 respectively.) For each output layer node compute its
error term as k = ok (1 ok )(yk ok ). These errors are used to adjust the
weights of the connections between the last-but-one layer of the network and
the output layer. The adjustment is similar to the simple Widrow-Hu rule
that we saw earlier in this note. The new value of the weight wjk of the
new = w old + o . Here is
connection from node j to node k is given by: wjk
j k
jk
an important tuning parameter that is chosen by trial and error by repeated
runs on the training data. Typical values for are in the range 0.1 to 0.9.
Low values give slow but steady learning, high values give erratic learning
and may lead to an unstable network.
The process is repeated for the connections between nodes in the last
hidden layer and the last-but-one hidden layer. The weight for the connecnew = w old + o where =
tion between nodes i and j is given by: wij
i j
j
ij
oj (1 oj ) k wjk k , for each node j in the last hidden layer.
The backward propagation of weight adjustments along these lines continues until we reach the input layer. At this time we have a new set of
weights on which we can make a new forward pass when presented with a
training data observation.
10
1.4.1
11
1.5
One of the time consuming and complex aspects of using backprop is that
we need to decide on an architecture before we can use backprop. The
usual procedure is to make intelligent guesses using past experience and to
do several trial and error runs on dierent architectures. Algorithms exist
that grow the number of nodes selectively during training or trim them in a
manner analogous to what we have seen with CART. Research continues on
such methods. However, as of now there seems to be no automatic method
that is clearly superior to the trial and error approach.
1.6
Successful Applications
References
1. Bishop, Christopher: Neural Networks for Pattern Recognition, Oxford, 1995.
2. Trippi, Robert and Turban, Efraim (editors): Neural Networks in Finance and Investing, McGraw Hill 1996.
12
Outline
Outline
Simple Linear Regression
Multiple Regression
Understanding the Regression Output
Coefficient of Determination R2
Validating the Regression Model
A p p le g lo
R e g io n
M a in e
N e w H a m p s h ir e
V erm ont
M a s s a c h u s e t ts
C o n n e c t ic u t
R h o d e I s la n d
N ew Y ork
N ew Jersey
P e n n s y lv a n ia
D e la w a r e
M a r y la n d
W e s t V ir g in ia
V ir g in ia
O h io
F ir s t- Y e a r
A d v e r t is in g
E x p e n d it u r e s
($ m illio n s )
x
1 .8
1 .2
0 .4
0 .5
2 .5
2 .5
1 .5
1 .2
1 .6
1 .0
1 .5
0 .7
1 .0
0 .8
F ir s t -Y e a r
S a le s
($ m illio n s )
y
104
68
39
43
127
134
87
77
102
65
101
46
52
33
Linear
Linear Regression:
Regression: An
AnExample
Example
160
120
80
40
0
0
0.5
1.5
2.5
The
The Basic
Basic Model:
Model: Simple
Simple Linear
Linear Regression
Regression
Data:
Data: (x1, y1), (x2, y2), . . . , (x
(xn, yn)
Model of the population:
population: Yi = 0 + 1 xi + i
1, 2, . . . , n are i.i.d. random variables, N(0, )
This is the true relation between Y and x, but we do
not know 0 and 1 and have to estimate them based
on the data.
Comments:
E (Yi | xi) = 0 + 1x i
SD(Yi | xi) =
Relationship is linear described by a line
0 = baseline value of Y (i.e., value of Y if x is 0)
1 = slope of line (average change in Y per unit change in x)
80
(xi, ^
yi)
Best choices:
bo = 13.82
b1 = 48.60
60
ei
bo=13.82
40
(xi, yi)
20
Slope b1 = 48.60
0
0
0.5
Regression coefficients:
coefficients: b0 and b1 are estimates of 0 and 1
^
Regression estimate for Y at xi : y
i = b0 + b1xi (prediction)
Residual (error):
(error): ei = yi - ^yi
The best
regression line is the one that chooses b0 and b1 to
best
minimize the total errors (residual sum of squares):
n
SSR = i=1ei2 =
Example:
Example: Sales
Sales of
ofNature-Bar
Nature($ million)
million)
Nature-Bar($
region
Multiple
Multiple Regression
Regression
In general, there are many factors in addition to advertising
expenditures that affect sales
Multiple regression allows more than one x variables.
Independent variables:
Data:
x1, x2, . . . , xk
(k of them)
Population Model:
b0 + b1x1i + . . . + bkxki
SSR = i=1ei2 =
Regression
RegressionOutput
Output(from
(from Excel)
Excel)
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.913
0.833
0.787
17.600
15
Analysis of
Variance
df
Regression
Residual
Total
Sum of
Mean
F
Significance
F
Squares
Square
3 16997.537 5665.85 18.290
0.000
11 3407.473 309.77
14 20405.009
Coefficients Standard
Error
Intercept
Advertising
Promotions
Competitors
Sales
65.71
48.98
59.65
-1.84
27.73
10.66
23.63
0.81
PLower
t
Statistic value
95%
2.37
4.60
2.53
-2.26
Upper
95%
0.033
4.67 126.74
0.000 25.52
72.44
0.024
7.66 111.65
0.040 -3.63 -0.047
Understanding
UnderstandingRegression
RegressionOutput
Output
1) Regression coefficients:
coefficients: b0, b1, . . . , bk are estimates
of 0, 1, . . . , k based on sample data. Fact: E[b
E[bj ] =
= j .
Example:
b0 = 65.705 (its interpretation is context dependent .
b1 = 48.979 (an additional $1 million in advertising is
expected to result in an additional $49 million in sales)
b2 = 59.654 (an additional $1 million in promotions is
expected to result in an additional $60 million in sales)
b3 = -1.838 (an increase of $1 million in competitor sales
is expected to decrease sales by $1.8 million)
Understanding
UnderstandingRegression
RegressionOutput,
Output, Continued
Continued
2) Standard errors:
errors: an estimate of , the SD of each i.
It is a measure of the amount of noise in the model.
Example: s = 17.60
3) Degrees of freedom:
freedom: #cases - #parameters,
relates to overover-fitting phenomenon
4) Standard errors of the coefficients:
coefficients: sb0 , sb1 , . . . , sbk
They are just the standard deviations of the estimates
b0 , b1, . . . , bk.
They are useful in assessing the quality of the coefficient
coefficient
estimates and validating the model.
160
120
80
40
0
0
0.5
1.5
2.5
30
35
30
25
20
15
10
5
0
25
20
15
10
5
0
10
15
20
25
30
10
15
20
25
30
R2 = 1;
1; x values account for
all variation in the Y values
R2 = 0;
0; x values account for
none variation in the Y values
Understanding
UnderstandingRegression
RegressionOutput,
Output, Continued
Continued
5) Coefficient of determination:
determination: R2
It is a measure of the overall quality of the regression.
Specifically, it is the percentage of total variation exhibited in
the yi data that is accounted for by the sample regression line.
_
The sample mean of Y: y = (y1 + y2 + . . . + yn)/ n
Total variation in Y =
i=1 (yi - y )2
^
Residual (unaccounted) variation in Y = i=1 ei2 = i=1 (yi - yi )2
R2 =
=1-
=1-
i=1 (yi - y )2
2
Coefficient
Coefficientof
ofDetermination:
Determination: RR2
2
Coefficient
Coefficientof
ofDetermination:
Determination: RR2
Validating
Validatingthe
theRegression
RegressionModel
Model
Assumptions about the population:
Yi = 0 + 1x1i + . . . + kxki + i (i = 1, . . . , n)
1, 2, . . . , n are iid random variables, ~ N(0, )
1) Linearity
If k = 1 (simple regression), one can check visually from scatter
scatter plot.
Sanity check: the sign of the coefficients, reason for nonnon-linearity?
2) Normality of i
^
Plot a histogram of the residuals (e
(ei = yi - yi ).
3) Heteroscedasticity
Do error terms have constant Std. Dev.? (i.e., SD(
SD(i ) = for all i?)
Check scatter plot of residuals vs. Y and x variables.
Residuals
20.00
10.00
R
es
0.00
id
0.0
u -10.00
Residuals
20.00
10.00
0.00
1.0
2.0
0.0
1.0
2.0
-10.00
-20.00
Advertising
Expenditures
-20.00
Advertising Expenditures
No evidence of heteroscedasticity
Evidence of heteroscedasticity
Time Plot
6
Residual
Residual
6
4
2
0
0
10
15
20
-2
4
2
0
0
-4
-2
-6
-4
No evidence of autocorrelation
10
15
20
Evidence of autocorrelation
Pitfalls
Pitfallsand
andIssues
Issues
1) Overspecification
Including too many x variables to make R2 fictitiously high.
120
90
60
30
0
0.0
1.0
2.0
3.0
Advertising
Validating
Validatingthe
theRegression
RegressionModel
Model
3) Multicollinearity
Occurs when two of the x variable are strongly correlated.
Can give very wrong estimates for is.
s.
TellTell-tale signs:
- Regression coefficients (b
(bis)
s) have the wrong sign.
- Addition/deletion of an independent variable results in
large changes of regression coefficients
- Regression coefficients (b
s) not significantly different from 0
(bis)
May be fixed by deleting one or more independent variables
Example
Example
Student Graduate
Number
GPA
1
4.0
2
4.0
3
3.1
4
3.1
5
3.0
6
3.5
7
3.1
8
3.5
9
3.1
10
3.2
11
3.8
12
4.1
13
2.9
14
3.7
15
3.8
16
3.9
17
3.6
18
3.1
19
3.3
20
4.0
21
3.1
22
3.7
23
3.7
24
3.9
25
3.8
College
GPA
3.9
3.9
3.1
3.2
3.0
3.5
3.0
3.5
3.2
3.2
3.7
3.9
3.0
3.7
3.8
3.9
3.7
3.0
3.2
3.9
3.1
3.7
3.7
4.0
3.8
GMAT
640
644
557
550
547
589
533
600
630
548
600
633
546
602
614
644
634
572
570
656
574
636
635
654
633
10
Regression
RegressionOutput
Output
R Square
Standard Error
Observations
Intercept
College GPA
GMAT
Graduate
College
GMAT
0.96
0.08
25
What happened?
Graduate College
1
0.98
1
0.86
0.90
GMAT
Eliminate GMAT
R Square
Standard Error
Observations
Intercept
College GPA
0.958
0.08
25
Regression
RegressionModels
Models
11
Back
Back to
toRegression
RegressionOutput
Output
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
Analysis of Varianc
df
Regression
Residual
Total
11
Sum of
Squares
16997.537
3407.473
20405.009
0.913
0.833
0.787
17.600
15
Mean
Square
5665.85
309.77
t
PCoeffic Standard
Error
Statistic value
ients
Intercept
65.71
27.73
2.37
Advertising
48.98
10.66
4.60
Promotions
59.65
23.63
2.53
Compet.
-1.84
0.81
-2.26
Sales
Lower
95%
4.67
25.52
7.66
-3.63
Upper
95%
126.74
72.44
111.65
-0.047
Regression
RegressionOutput
Output Analysis
Analysis
1) Degrees of freedom (dof
(dof))
Residual dof = n - (k+1)
bj - j
sbj
obeys a
12
b
tj = j
sbj
A measure of the statistical significance of each individual xj
3) t-Statistic:
Example:
Example:Executive
Executive Compensation
Compensation
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Pay
($1,000)
1,530
1,117
602
1,170
1,086
2,536
300
670
250
2,413
2,707
341
734
2,368
Years in
Change in
position Stock Price (%)
7
48
6
35
3
9
6
37
6
34
9
81
2
-17
2
-15
0
-52
10
109
7
44
1
28
4
10
8
16
Change in
Sales (%)
89
19
24
8
28
-16
-17
-67
49
-27
26
-7
-7
-4
MBA?
YES
YES
NO
YES
NO
YES
NO
YES
NO
YES
YES
NO
NO
NO
13
Dummy variables:
Often, some of the explanatory variables in a regression are
categorical rather than numeric.
If we think whether an executive has an MBA or not affects his/her
pay, We create a dummy variable and let it be 1 if the executive
has an MBA and 0 otherwise.
If we think season of the year is an important factor to determine
sales, how do we create dummy variables? How many?
What is the problem with creating 4 dummy variables?
In general, if there are m categories an x variable can belong to,
then we need to create m-1 dummy variables for it.
OILPLUS
OILPLUSdata
data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Month
August, 1989
September, 1989
October, 1989
November, 1989
December, 1989
January, 1990
February, 1990
March, 1990
April, 1990
May, 1990
June, 1990
July, 1990
August, 1990
September, 1990
October, 1990
November, 1990
December, 1990
heating oil
24.83
24.69
19.31
59.71
99.67
49.33
59.38
55.17
55.52
25.94
20.69
24.33
22.76
24.69
22.76
50.59
79.00
temperature
73
67
57
43
26
41
38
46
54
60
71
75
74
66
61
49
41
14
120
100
80
60
40
20
0
20
40
60
80
100
oil consumption
120
90
60
30
0
0.01
0.02
0.03
0.04
inverse temperature
0.05
inverse temperature
0.0137
0.0149
0.0175
0.0233
0.0385
0.0244
15
The
The Practice
Practiceof
ofRegression
Regression
The
The Post-Regression
PostChecklist
Post-RegressionChecklist
1) Statistics checklist:
Calculate the correlation between pairs of x variables
watch for evidence of multicollinearity
Check signs of coefficients do they make sense?
Check 95% C.I. (use t-statistics as quick scan) are coefficients
significantly different from zero?
R2 :overall quality of the regression, but not the only measure
2) Residual checklist:
Normality look at histogram of residuals
Heteroscedasticity plot residuals with each x variable
Autocorrelation if data has a natural order, plot residuals in
order and check for a pattern
16
The
The Grand
GrandChecklist
Checklist
Linearity: scatter plot, common sense, and knowing your problem,
transform including interactions if useful
Heteroscedasticity:
Heteroscedasticity: plot residuals with each x variable, transform if
necessary, BoxBox-Cox transformations
Missing Values
17
Contents
2.1. A Review of Multiple Linear Regression
2.2. Illustration of the Regression Process
2.3. Subset Selection in Linear Regression
Chap. 2
2.1
In this section, we review briey the multiple regression model that you encountered in the DMD course. There is a continuous random variable called the
dependent variable, Y , and a number of independent variables, x1 , x2 , . . . , xp .
Our purpose is to predict the value of the dependent variable (also referred to
as the response variable) using a linear function of the independent variables.
The values of the independent variables(also referred to as predictor variables,
regressors or covariates) are known quantities for purposes of prediction, the
model is:
Y = 0 + 1 x1 + 2 x2 + + p xp + ,
(2.1)
Sec. 2.1
tted (predicted) values at the observed values in the data. The sum of squared
dierences is given by
n
i=1
Let us denote the values of the coecients that minimize this expression by
0 , 1 , 2 , . . . , p . These are our estimates for the unknown values and are called
OLS (ordinary least squares) estimates in the literature. Once we have computed the estimates 0 , 1 , 2 , . . . , p we can calculate an unbiased estimate
2
for 2 using the formula:
2 =
n
1
(yi 0 1 xi1 2 xi2 . . . p xip )2
n p 1 i=1
Chap. 2
2.2
Sec. 2.3
each variable is 25 and the maximum value is 125. These ratings are answers
to survey questions given to a sample of 25 clerks in each of 30 departments.
The purpose of the analysis was to explore the feasibility of using a questionnaire for predicting eectiveness of departments thus saving the considerable
eort required to directly measure eectiveness. The variables are answers to
questions on the survey and are described below.
Y Measure of eectiveness of supervisor.
X1 Handles employee complaints
X2 Does not allow special privileges.
X3 Opportunity to learn new things.
X4 Raises based on performance.
X5 Too critical of poor performance.
X6 Rate of advancing to better jobs.
The multiple linear regression estimates as computed by the StatCalc addin to Excel are reported in Table 2.2. The equation to predict performance is
Y = 13.182 + 0.583X1 0.044X2 + 0.329X3 0.057X4 + 0.112X5 0.197X6.
In Table 2.3 we use ten more cases as the validation data. Applying the previous
equation to the validation data gives the predictions and errors shown in Table
2.3. The last column entitled error is simply the dierence of the predicted
minus the actual rating. For example for Case 21, the error is equal to 44.4650=-5.54
We note that the average error in the predictions is small (-0.52) and so
the predictions are unbiased. Further the errors are roughly Normal so that this
model gives prediction errors that are approximately 95% of the time within
14.34 (two standard deviations) of the true value.
2.3
Chap. 2
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Y
43
63
71
61
81
43
58
71
72
67
64
67
69
68
77
81
74
65
65
50
X1
51
64
70
63
78
55
67
75
82
61
53
60
62
83
77
90
85
60
70
58
X2
30
51
68
45
56
49
42
50
72
45
53
47
57
83
54
50
64
65
46
68
X3
39
54
69
47
66
44
56
55
67
47
58
39
42
45
72
72
69
75
57
54
X4
61
63
76
54
71
54
66
70
71
62
58
59
55
59
79
60
79
55
75
64
X5
92
73
86
84
83
49
68
66
83
80
67
74
63
77
77
54
79
80
85
78
X6
45
47
48
35
47
34
35
41
31
41
34
41
25
35
46
36
63
60
46
52
Sec. 2.3
Multiple R-squared
Residual SS
Std. Dev. Estimate
Constant
X1
X2
X3
X4
X5
X6
Coefficient
13.182
0.583
-0.044
0.329
-0.057
0.112
-0.197
0.656
738.900
7.539
StdError
16.746
0.232
0.167
0.219
0.317
0.196
0.247
t-statistic
0.787
2.513
-0.263
1.501
-0.180
0.570
-0.798
p-value
0.445
0.026
0.797
0.157
0.860
0.578
0.439
Estimates of regression coecients are likely to be unstable due to multicollinearity in models with many variables. We get better insights into the
inuence of regressors from models with fewer variables as the coecients
are more stable for parsimonious models.
It can be shown that using independent variables that are uncorrelated
with the dependent variable will increase the variance of predictions.
It can be shown that dropping independent variables that have small
(non-zero) coecients can reduce the average error of predictions.
Let us illustrate the last two points using the simple case of two independent variables. The reasoning remains valid in the general situation of more
than two independent variables.
2.3.1
Suppose that the true equation for Y, the dependent variable, is:
Y = 1 X1 +
(2.2)
Chap. 2
Case
21
22
23
24
25
26
27
28
29
30
Averages:
Std Devs:
Y
50
64
53
40
63
66
78
48
85
82
X1
40
61
66
37
54
77
75
57
85
82
X2
33
52
52
42
42
66
58
44
71
39
X3
34
62
50
58
48
63
74
45
71
59
X4
43
66
63
50
66
88
80
51
77
64
X5
64
80
80
57
75
76
78
83
74
78
X6
33
41
37
49
33
72
49
38
55
39
Prediction
44.46
63.98
63.91
45.87
56.75
65.22
73.23
58.19
76.05
76.10
62.38
11.30
Error
-5.54
-0.02
10.91
5.87
-6.25
-0.78
-4.77
10.19
-8.95
-5.90
-0.52
7.17
(2.3)
V ar(1 ) =
2
2 ) n x2
(1 R12
i=1 i1
E(2 ) = 0,
V ar(2 ) =
2
2 ) n x2 ,
(1 R12
i=1 i2
Sec. 2.3
The variance is the expected value of the squared error for an unbiased
estimator. So we are worse o using the irrelevant estimator in making predic2 = 0 and the
tions. Even if X2 happens to be uncorrelated with X1 so that R12
variance of 1 is the same in both models, we can show that the variance of a
prediction based on Model (3) will be worse than a prediction based on Model
(2) due to the added variability introduced by estimation of 2 .
Although our analysis has been based on one useful independent variable
and one irrelevant independent variable, the result holds true in general. It is
always better to make predictions with models that do not include
irrelevant variables.
2.3.2
Suppose that the situation is the reverse of what we have discussed above,
namely that Model (3) is the correct equation, but we use Model (2) for our
estimates and predictions ignoring variable X2 in our model. To keep our results
simple let us suppose that we have scaled the values of X1 , X2 , and Y so that
their variances are equal to 1. In this case the least squares estimate 1 has the
following expected value and variance:
E(1 ) = 1 + R12 2 ,
V ar(1 ) = 2 .
Notice that 1 is a biased estimator of 1 with bias equal to R12 2 and its
Mean Square Error is given by:
M SE(1 ) = E[(1 1 )2 ]
= E[{1 E(1 ) + E(1 ) 1 }2 ]
= [Bias(1 )]2 + V ar(1 )
= (R12 2 )2 + 2 .
If we use Model (3) the least squares estimates have the following expected
values and variances:
E(1 ) = 1 ,
V ar(1 ) =
2
2 ),
(1 R12
E(2 ) = 2 ,
V ar(2 ) =
2
2 ).
(1 R12
10
Chap. 2
isunbiased
= V ar(u1 1 + u2 2 ) + 2 ,
because now Y
= u2 V ar(1 ) + u2 V ar(2 ) + 2u1 u2 Covar(1 , 2 )
1
Model (2) can lead to lower mean squared error for many combinations
of values for u1 , u2 , R12 , and (2 /)2 . For example, if u1 = 1, u2 = 0, then
M SE2(Y ) < M SE3(Y ), when
(R12 2 )2 + 2 <
i.e., when
2
2 ),
(1 R12
1
|2 |
<
.
2
1 R12
Sec. 2.3
2.3.3
11
2 (S {i})
2 (S)
If Fi < Fout , where Fout is a threshold (typically between 2 and 4) then
drop i from S.
3. Repeat 2 until no variable can be dropped.
12
Chap. 2
Backward Elimination has the advantage that all variables are included in
S at some stage. This addresses a problem of forward selection that will never
select a variable that is better than a previously selected variable that is strongly
correlated with it. The disadvantage is that the full model with all variables is
required at the start and this can be time-consuming and numerically unstable.
Step-wise Regression
This procedure is like Forward Selection except that at each step we consider
dropping variables as in Backward Elimination.
Convergence is guaranteed if the thresholds Fout and Fin satisfy: Fout <
Fin . It is possible, however, for a variable to enter S and then leave S at a
subsequent step and even rejoin S at a yet later step.
As stated above these methods pick one best subset. There are straightforward variations of the methods that do identify several close to best choices
for dierent sizes of independent variable subsets.
None of the above methods guarantees that they yield the best subset
for any criterion such as adjusted R2 . (Dened later in this note.) They are
reasonable methods for situations with large numbers of independent variables
but for moderate numbers of independent variables the method discussed next
is preferable.
All Subsets Regression
The idea here is to evaluate all subsets. Ecient implementations use branch
and bound algorithms of the type you have seen in DMD for integer programming to avoid explicitly enumerating all subsets. (In fact the subset selection
problem can be set up as a quadratic integer program.) We compute a criterion
2 , the adjusted R2 for all subsets to choose the best one. (This is
such as Radj
only feasible if p is less than about 20).
2.3.4
The All Subsets Regression (as well as modications of the heuristic algorithms)
will produce a number of subsets. Since the number of subsets for even moderate
values of p is very large, we need some way to examine the most promising
subsets and to select from them. An intuitive metric to compare subsets is R2 .
However since R2 = 1 SSR
SST where SST , the Total Sum of Squares, is the
Sum of Squared Residuals for the model with just the constant term, if we use
it as a criterion we will always pick the full model with all p variables. One
approach is therefore to select the subset with the largest R2 for each possible
Sec. 2.3
13
n1
(1 R2 ).
nk1
Cp =
SSR
+ 2k n,
F2 ull
14
Chap. 2
SST= 2149.000
Fin= 3.840
Fout= 2.710
SSR
RSq
2
3
4
5
6
7
874.467
786.601
783.970
781.089
775.094
738.900
0.593
0.634
0.635
0.637
0.639
0.656
RSq
(adj)
0.570
0.591
0.567
0.540
0.511
0.497
Cp
-0.615
-0.161
1.793
3.742
5.637
7.000
X1
X1
X1
X1
X1
X1
X3
X3
X3
X2
X2
X6
X5 X6
X3 X5 X6
X3 X4 X5 X6
Models
1
2
Constant
Constant
Constant
Constant
Constant
Constant
X3
X2
X2
X2
X2
X3
X3 X4
X3 X4 X5
X3 X4 X5 X6
X1
X1
X1
X1
X1
X1
where
F2 ull is the estimated value of 2 in the full model that includes all the
variables. It is important to remember that the usefulness of this approach
depends heavily on the reliability of the estimate of 2 for the full model. This
requires that the training set contains a large number of observations relative to
the number of variables. We note that for our example only the subsets of size
6 and 7 seem to be unbiased as for the other models Cp diers substantially
from k. This is a consequence of having too few observations to estimate 2
accurately in the full model.
Lecture 1
Income ($000s)
60
85.5
64.8
61.5
87
110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63
18.4
16.8
21.6
20.8
23.6
19.2
17.6
22.4
20
20.8
22
20
19.6
20.8
17.2
20.4
17.6
17.6
16
18.4
16.4
18.8
14
14.8
How do we choose k? In data mining we use the training data to classify the
cases in the validation data to compute error rates for various choices of k. For
our example we have randomly divided the data into a training set with 18 cases
and a validation set of 6 cases. Of course, in a real data mining situation we
would have sets of much larger sizes. The validation set consists of observations
6, 7, 12, 14, 19, 20 of Table 1. The remaining 18 observations constitute the
training data. Figure 1 displays the observations in both training and validation
35
TrnOwn
TrnNonOwn
VldOwn
VldNonOwn
30
25
20
15
10
30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 10 11 11 12
0 5 0 5 0
Income ($ 000,s)
data sets. Notice that if we choose k=1 we will classify in a way that is very
sensitive to the local characteristics of our data. On the other hand if we choose
a large value of k we average over a large number of data points and average
out the variability due to the noise associated with individual data points. If
we choose k=18 we would simply predict the most frequent class in the data
set in all cases. This is a very stable prediction but it completely ignores the
information in the independent variables.
Table 2 shows the misclassication error rate for observations in the valida
tion data for dierent choices of k.
Table 2
k
Misclassication Error %
1
33
3
33
5
33
7
33
9
33
11
17
13
17
18
50
We would choose k=11 (or possibly 13) in this case. This choice optimally
4
trades o the variability associated with a low value of k against the oversmooth
ing associated with a high value of k. It is worth remarking that a useful way
to think of k is through the concept of eective number of parameters. The
eective number of parameters corresponding to k is n/k where n is the number
of observations in the training data set. Thus a choice of k=11 has an eec
tive number of parameters of about 2 and is roughly similar in the extent of
smoothing to a linear regression t with two coecients.
There are two diculties with the practical exploitation of the power of the
k-NN approach. First, while there is no time required to estimate parameters
from the training data (as would be the case for parametric models such as
regression) the time to nd the nearest neighbors in a large training set can
be prohibitive. A number of ideas have been implemented to overcome this
diculty. The main ideas are:
1. Reduce the time taken to compute distances by working in a reduced
dimension using dimension reduction techniques such as principal compo
nents;
2. Use sophisticated data structures such as search trees to speed up identi
cation of the nearest neighbor. This approach often settles for an almost
nearest neighbor to improve speed.
3. Edit the training data to remove redundant or almost redundant points
in the training set to speed up the search for the nearest neighbor. an
example is to remove observations in the training data set that have no
eect on the classication because they are surrounded by observations
that all belong to the same class.
Second, the number of observations required in the training data set to
qualify as large increases exponentially with the number of dimensions p. This
is because the expected distance to the nearest neighbor goes up dramatically
with p unless the size of the training data set increases exponentially with p.
An illustration of this phenomenon, known as the curse of dimensionality, is
5
the fact that if the independent variables in the training data are distributed
uniformly in a hypercube of dimension p, the probability that a point is within
a distance of 0.5 units from the center is
p/2
2p1 p(p/2)
The table below is designed to show how rapidly this drops to near zero for
dierent combinations of p and n, the size of the training data set.
n
10,000
100,000
1,000,000
10,000,000
2
7854
78540
785398
7853982
3
5236
52360
523600
523600
4
3084
30843
308425
3084251
p
5
1645
16449
164493
1644934
10
25
249
2490
24904
20
0.0002
0.0025
0.0246
0.2461
30
21010
2109
2108
2107
40
31017
31016
31015
31014
Lecture 16
in
Transaction Databases
probability that a randomly selected transaction will include all the items in
the consequent given that the transaction includes all the items in the
antecedent.
Example 1 (Han and Kamber)
The manager of the AllElectronics retail store would like to know what
items sell together. He has a database of transactions as shown below:
Notice that once we have created a list of all itemsets that have the required
support, we can deduce the rules that meet the desired confidence ratio by
examining all subsets of each itemset in the list. Since any subset of a set
must occur at least as frequently as the set, each subset will also be in the
list. It is then straightforward to compute the confidence as the ratio of the
support for the itemset to the support for each subset of the itemset. We
retain the corresponding association rule only if it exceeds the desired cutoff value for confidence. For example, from the itemset {1,2,5} we get the
following association rules:
{1, 2} => {5} with confidence = support count of {1, 2, 5} divided by
Tr#
Items
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Data:
$A$5:$E$54
Min.
Support
:
2=
Min.
Conf.
%:
70
4%
Confidence
Lift
Rule # Confidence Antecedent
Consequent Support Support Support If pr(c|a) = pr(c)
Ratio
%
(a)
(c)
(a)
(c)
(a U c)
%
(conf/prev.col. )
1
80
=>
54
2
9
5
27
4
1.5
2
100
5, 7
=>
9
3
27
3
54
1.9
3
100
6, 7
=>
8
3
29
3
58
1.7
4
100
1, 5
=>
8
2
29
2
58
1.7
5
100
2, 7
=>
9
2
27
2
54
1.9
6
100
3, 8
=>
4
2
11
2
22
4.5
7
100
3, 4
=>
8
2
29
2
58
1.7
8
100
3, 7
=>
9
2
27
2
54
1.9
9
100
4, 5
=>
9
2
27
2
54
1.9
to the rule. The larger the lift ratio, the greater is the strength of the
association. (What does a ratio less than 1.00 mean? Can it be useful to
know such rules?)
In our example the lift ratios highlight Rule 6 as most interesting in that it
suggests purchase of item 4 is almost 5 times as likely when items 3 and 8
are purchased than if item 4 was not associated with the itemset {3,8}.
Shortcomings
Association rules have not been as useful in practice as one would have
hoped. One major shortcoming is that the support confidence framework
often generates too many rules. Another is that often most of them are
obvious. Insights such as the celebrated on Friday evenings diapers and
beers are bought together story are not as common as might be expected.
There is need for skill in association analysis and it seems likely, as some
researchers have argued, that a more rigorous statistical discipline to cope
with rule proliferation would be beneficial.
Extensions
The general approach of association analysis utilizing support and
confidence concepts has been extended to sequences where one is looking
for patterns that evolve in time. The computation problems are even more
formidable, but there have been several successful applications.