Académique Documents
Professionnel Documents
Culture Documents
It helps me to think (visually) of a confusion matrix when thinking about precision and recall.
Let's use the example of cancer diagnosis, since (as other answerers have pointed out), it's a good
example of a case where precision/recall measures are more useful than simple "accuracy" (i.e., a
"needle in the haystack" problem).
In this confusion matrix, the "correct" cells are:
TN: the number of true negatives, i.e., patients who did not have cancer whom we correctly
diagnosed as not having cancer.
TP: the number of true positives, i.e., patients who did have cancer whom we correctly
diagnosed as having cancer and the "error" cells are:
FN: the number of false negatives, i.e., patients who did have cancer whom we incorrectly
diagnosed as not having cancer
FP: the number of false positives, i.e., patients who did not have cancer whom we
incorrectly diagnosed as having cancer
Precision is
(TP)/(TP+FP)
which tells us what proportion of patients we diagnosed as having cancer actually had cancer. In
other words, proportion of TP in the set of positive cancer diagnoses. This is given by
the rightmost column in the confusion matrix.
Recall is
(TP)/(TP+FN)
which tells us what proportion of patients that actually had cancer were diagnosed by us as
3
having cancer. In other words, proportion of TP in the set of true cancer states. This is given by
the bottom row in the confusion matrix.
Like so:
Number of correctclassifications
Prediction Methods
Use some variables to predict unknown or future values of other variables.
Description Methods
Find human-interpretable patterns that describe the data.
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Classification: Definition
o Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
o Find a model for class attribute as a function of the values of other attributes.
o Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is
divided into training and test sets, with training set used to build the model and test set used to
validate it.
Predicting a class label using nave Bayesian classification, we wish to predict the class label of a
tuple using Naive Bayesian classification from the training data as in the above table. The data
tuples are described by the attributes age, income, student and credit rating. The class label
attribute, buys_computer, has two distinct values (namely, {yes, no}). Let C1 correspond to the
class buys_computer=yes and C2 correspond to buys_computer=no. The tuple we wish to classify
is
X = (age=youth, income=medium, student=yes, credit_rating=fair)
We need to maximize P(X|Ci)P(Ci), for i=1, 2. P(Ci), the prior probability of each class,
can be computed based on the training tuples:
P(buys_computer=yes) = 9/14=0.643
P(buys_computer=no) = 5/14=0.357
To compute P(X|Ci), for i=1, 2, we compute the following conditional probabilities:
P(age=youth|buys_computer=yes) =2/9=0.222
P(age=youth|buys_computer=no) =3/5=0.600
P(income=medium|buys_computer=yes) =4/9=0.444
P(income=medium|buys_computer=no) =2/5=0.400
P(student=yes|buys_computer=yes) =6/9=0.667
P(student=yes|buys_computer=no) =1/5=0.200
P(credit_rating=fair|buys_computer=yes) =6/9=0.667
P(credit_rating=fair|buys_computer=no) =2/5=0.400
9
P(X|buys_computer=yes) = P(age=youth|buys_computer=yes) *
P(income=medium|buys_computer=yes) *
P(student=yes|buys_computer=yes) *
P(credit_rating=fair|buys_computer=yes)
=0.222 * 0.444 * 0.667 *0.667=0.044
Similarly,
P(X|buys_computer=no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019.
P(X|buys_computer=yes) * P(buys_computer=yes)=
0.044 * 0.643 = 0.028
P(X|buys_computer=no) * P(buys_computer=no) =
0.019 * 0.357 = 0.007
Therefore, the Nave Bayesian classifier predicts buys_computer = yes for tuple X.
10
Assignment (1)
Answer
11
Assignment (2)
12
Clustering
Clustering can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be the process of organizing objects into groups whose
members are similar in some way.
A cluster is therefore a collection of objects which are similar between them and are
dissimilar to the objects belonging to other clusters.
The Goals of Clustering
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute best
criterion which would be independent of the final aim of the clustering. Consequently, it is the
user which must supply this criterion, in such a way that the result of the clustering will suit their
needs.
For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding natural clusters and describe their unknown properties (natural data
types), in finding useful and suitable groupings (useful data classes) or in finding unusual data
objects (outlier detection)
K-Means Clustering
13
14
15
16
17
18
Confidence (c) Measures how often items in Y appear in transactions that contain XTID
19
Support count: The support count of an itemset X, denoted by X.count, in a data set T is
the number of transactions in T that contain X. Assume T has n transactions.
support
( X Y ).count
n
confidence
( X Y ).count
X .count
20
AprioriAlgorithm
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
o Generate length (k+1) candidate itemsets from length k frequent itemsets
o Prune candidate itemsets containing subsets of length k that are infrequent
o Count the support of each candidate by scanning the DB
o Eliminate candidates that are infrequent, leaving only those that are frequent
21
22
Example :
Dataset T , minsup=2
Finding frequent itemsets
23
24