Vous êtes sur la page 1sur 37

Types of Attributes

Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
Example: calendar dates
Ratio
Examples: temperature in Kelvin, length, time, counts
Q2. Classify the following attributes as binary, discrete, or continuous.
Also classify them as qualitative (nominal or ordinal) or quantitative
(interval or ratio). Some cases may have more than one
interpretation, so briefly indicate your reasoning if you think there
may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM.
Binary, qualitative, ordinal
(b) Brightness as measured by a light meter.
Continuous, quantitative, ratio
(c) Brightness as measured by peoples judgments.
Discrete, qualitative, ordinal
(d) Angles as measured in degrees between 0 and 360.
Continuous, quantitative, ratio
(e) Bronze, Silver, and Gold medals as awarded at the Olympics.
Discrete, qualitative, ordinal
(f) Height above sea level.
Continuous, quantitative, interval/ratio (depends on whether sea
level is regarded as an arbitrary origin)
(g) Number of patients in a hospital.
Discrete, quantitative, ratio
(h) ISBN numbers for books. (Look up the format on the Web.)
Discrete, qualitative, nominal (ISBN numbers do have order
information, though)
(i) Ability to pass light in terms of the following values: opaque,
translucent, transparent.
Discrete, qualitative, ordinal
(j) Military rank.
Discrete, qualitative, ordinal
(k) Distance from the center of campus.
Continuous, quantitative, interval/ratio
(l) Density of a substance in grams per cubic centimeter.
Discrete, quantitative, ratio
(m) Coat check number. (When you attend an event, you can often
give your coat to someone who, in turn, gives you a number that
you can use to claim your coat when you leave.)
Discrete, qualitative, nominal



18. This exercise compares and contrasts some similarity and
distance measures.
(a) For binary data, the L1 distance corresponds to the
Hamming distance; that is, the number of bits that are
different between two binary vectors. The Jaccard similarity
is a measure of the similarity between two binary vectors.
Compute the Hamming distance and the Jaccard similarity
between the following two binary vectors.
x = 0101010001
y = 0100011000
Hamming distance = number of different bits = 3
Jaccard Similarity = number of 1-1 matches /( number of bits
number 0-0 matches) = 2 / 5 = 0.4
(b) Which approach, Jaccard or Hamming distance, is
more similar to the Simple Matching Coefficient,
and which approach is more similar to the cosine
measure? Explain
The Hamming distance is similar to the SMC. In fact,
SMC = Hamming distance / number of bits.
The Jaccard measure is similar to the cosine measure
because both ignore 0-0 matches.

Cosine Similarity
Example:
x = 3 , 2 , 0 , 5 , 0 , 0 , 0 , 2 , 0, 0
y = 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0, 2
x y= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||x|| = sqrt (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
= 6.481
|| y|| = sqrt (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
= 2.245
cos( x, y ) =0 .3150
If cosine similarity =1 ; angle between x and y = 0
0

If cosine similarity =0 ; angle between x and y =
90
0

Correlation
Measures the linear relationship between attributes of
objects
To compute correlation, we standardize data objects, p
and q, and then take their dot product

corr ( x, y) =

= s
xy

s
x
s
y
covariance (x , y) =

standard deviation = s
x
=


standard deviation = s
y
=

19. For the following vectors, x and y, calculate the
indicated similarity or distance measures.
(a) x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine,
correlation, Jaccard
cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6
(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation,
Euclidean, Jaccard
cos(x, y) = 0, corr(x, y) = 1, Euclidean(x, y) = 2,
Jaccard(x, y) = 0
(c) x = (0,1, 0, 1), y = (1, 0,1, 0) cosine, correlation,
Euclidean
cos(x, y) = 0, corr(x, y)=0, Euclidean(x, y) = 2
(d) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation,
Euclidean
cos(x, y) = 1, corr(x, y) = 0/0 (undefined),
Euclidean(x, y) = 2
(e) x = (2,1, 0, 2, 0,3), y = (1, 1,1, 0, 0,1) cosine,
correlation
cos(x, y) = 0, corr(x, y) = 0
Nominal Attributes
Multi-way split: Use as many partitions as
distinct values.


Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Decision tree algorithm like CART produce 2
k
-1
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury}
{Sports}
CarType
{Sports,
Luxury}
{Family}
OR
Measures for selecting the Best Split
) | ( t j p
) | ( t i p
Fraction of records belonging to class i at a given node t
Measures for selecting the best split is based on the degree of
impurity of child nodes
Smaller the degree of impurity the more skewed the class
distribution
Node with class distribution (0,1) has impurity=0; node with
uniform distribution (0.5, 0.5) has highest impurity

Measures of Node Impurity
Gini Index

Entropy

Misclassification error

=
i
t i p t GINI
2
)] | ( [ 1 ) (

=
i
t i p t i p t Entropy ) | ( log ) | ( ) ( 2
(NOTE: p( j | t) is the relative frequency of class j at node t).
)] / ( [ max 1 ) ( t i p t tionerror Classifica i =
Where c is the number of classes and 0 log2 0 = 0 is entropy calculations

Measure of Impurity: GINI
Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/n
c
) when records are equally distributed among all classes,
implying least interesting information
Minimum (0.0) when all records belong to one class, implying most
interesting information

C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278

=
i
t i p t GINI
2
)] | ( [ 1 ) (
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 P(C1)
2
P(C2)
2
= 1 0 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 (1/6)
2
(5/6)
2
= 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 (2/6)
2
(4/6)
2
= 0.444
P(C1) = 3/6 P(C2) = 3/6
Gini = 1 (3/6)
2
(3/6)
2
=0.5
Alternative Splitting Criteria based on INFO
Entropy at a given node t:

(NOTE: p( j | t) is the relative frequency of class j at node t).
Measures homogeneity of a node.
Maximum (log n
c
) when records are equally distributed among all
classes implying least information
Minimum (0.0) when all records belong to one class, implying most
information
Entropy based computations are similar to the GINI index computations

=
i
t i p t i p t Entropy ) | ( log ) | ( ) ( 2
Examples for computing Entropy
C1 0
C2 6

C1 2
C2 4

C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = 0 log
2
0

1 log
2
1 = 0 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = (1/6) log
2
(1/6)

(5/6) log
2
(1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = (2/6) log
2
(2/6)

(4/6) log
2
(4/6) = 0.92

=
i
t i p t i p t Entropy ) | ( log ) | ( ) ( 2
Splitting Criteria based on Classification Error
Classification error at a node t :



Measures misclassification error made by a node.
Maximum (1 - 1/n
c
) when records are equally distributed among all
classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying most
interesting information
)] / ( [ max 1 ) ( t i p t tionerror Classifica i =
Examples for Computing Error
C1 0
C2 6

C1 2
C2 4

C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 max (0, 1) = 1 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3
)] / ( [ max 1 ) ( t i p t tionerror Classifica i =
Information Gain to find best split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11

C0 N20
C1 N21

C0 N30
C1 N31

C0 N40
C1 N41

C0 N00
C1 N01

M0
M1
M2 M3 M4
M12
M34
Difference in entropy gives information Gain
Gain = M0 M12 vs M0 M34
Splitting Based on INFO...
Information Gain:


Parent Node, p is split into k partitions;
n
i
is number of records in partition i
Measures Reduction in Entropy achieved because of the split. Choose the
split that achieves most reduction (maximizes GAIN)
Used in ID3 and C4.5
Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
|
.
|

\
|
=
=
k
i
i
split
i Entropy
n
n
p Entropy GAIN
1
) ( ) (
a) Compute the Gini index for the overall collection of training
examples.
Answer: Gini = 1 2 0.52 = 0.5.
(b) Compute the Gini index for the Customer ID attribute.
Answer: The gini for each Customer ID value is 0. Therefore, the
overall gini for Customer ID is 0.
(c) Compute the Gini index for the Gender attribute.
Answer:
The gini for Male is 1 2 0.52 = 0.5. The gini for Female is also 0.5.
Therefore, the overall gini for Gender is 0.5 0.5 + 0.5 0.5 = 0.5.
(d) Compute the Gini index for the Car Type attribute using multiway
split.
Answer: The gini for Family car is 0.375, Sports car is 0, and Luxury
car is 0.2188. The overall gini is 0.1625.



(e) Compute the Gini index for the Shirt Size attribute using
multiway split.
Answer: The gini for Small shirt size is 0.48, Medium shirt size is
0.4898, Large shirt size is 0.5, and Extra Large shirt size is 0.5.
The overall gini for Shirt Size attribute is 0.4914.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
Answer:
Car Type because it has the lowest gini among the three
attributes.
(g) Explain why Customer ID should not be used as the attribute
test condition even though it has the lowest Gini.
Answer:
The attribute has no predictive power since new customers are
assigned to new Customer IDs.

3. Consider the training examples shown in Table 4.2 for a binary
classification problem.
(a) What is the entropy of this collection of training examples
with respect to the positive class?
Answer: There are four positive examples and five negative
examples. Thus, P(+) = 4/9 and P() = 5/9. The entropy of the
training examples is 4/9 log2(4/9) 5/9 log2(5/9) = 0.9911.
(b) What are the information gains of a1 and a2 relative to
these training examples?
Answer: For attribute a1, the corresponding counts and
probabilities are:

Vous aimerez peut-être aussi