Vous êtes sur la page 1sur 24

Data Mining

What Is Data Mining?


Data mining refers to extracting or mining knowledge from large amounts of data.
Remember that the mining of gold from rocks or sand is referred to as gold mining rather than
rock or sand mining.
Data mining is defined as the process of discovering patterns in data. The process must be
automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that
they lead to some advantage, usually an economic one. The data is invariably present in substantial
quantities.
Data mining is a process that uses a variety of data analysis methods to discover the unknown,
unexpected, interesting and relevant patterns and relationships in data that may be used to make
valid and accurate predictions.
Data mining is a Knowledge Discovery from Data, or KDD.
The steps involved in data mining when viewed as a process of knowledge discovery are as
follows:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques are used
to present the mined knowledge to the user)
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The
data mining step may interact with the user or a knowledge base.

Architecture of a typical data mining system

Precision and recall measures

It helps me to think (visually) of a confusion matrix when thinking about precision and recall.

Let's use the example of cancer diagnosis, since (as other answerers have pointed out), it's a good
example of a case where precision/recall measures are more useful than simple "accuracy" (i.e., a
"needle in the haystack" problem).
In this confusion matrix, the "correct" cells are:

TN: the number of true negatives, i.e., patients who did not have cancer whom we correctly
diagnosed as not having cancer.

TP: the number of true positives, i.e., patients who did have cancer whom we correctly
diagnosed as having cancer and the "error" cells are:

FN: the number of false negatives, i.e., patients who did have cancer whom we incorrectly
diagnosed as not having cancer

FP: the number of false positives, i.e., patients who did not have cancer whom we
incorrectly diagnosed as having cancer

Precision is
(TP)/(TP+FP)
which tells us what proportion of patients we diagnosed as having cancer actually had cancer. In
other words, proportion of TP in the set of positive cancer diagnoses. This is given by
the rightmost column in the confusion matrix.
Recall is
(TP)/(TP+FN)
which tells us what proportion of patients that actually had cancer were diagnosed by us as
3

having cancer. In other words, proportion of TP in the set of true cancer states. This is given by
the bottom row in the confusion matrix.
Like so:

In this representation, it is clearer that recall gives us information about a classifier's


performance with respect to false negatives (how many did we miss), while precision gives us
information about its performance with respect to false positives.

Machine learning and our focus


Like human learning from past experiences.
A computer does not have experiences.
A computer system learns from data, which represent some past experiences of an
application domain.
Our focus: learn a target function that can be used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and high-risk or low risk.
The task is commonly called: Supervised learning, classification, or inductive learning.
Goal: To learn a classification model from the data that can be used to predict the classes of
new (future, or test) cases/instances.
Supervised learning: classification is seen as supervised learning from examples.
Supervision: The data (observations, measurements, etc.) are labeled with predefined classes. It is like that a teacher gives the classes (supervision).
Test data are classified into these classes too.
Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the existence of classes or clusters in
the data

Supervised learning process: two steps


Learning (training): Learn a model using the training data.
Testing: Test the model using unseen test data to assess the model accuracy.
Accuracy

Number of correctclassifications

Total number of test cases

What do we mean by learning?


Given
a data set D,
a task T, and
a performance measure M,
A computer system is said to learn from D to perform the task T if after learning the systems
performance on T improves as measured by M.
In other words, the learned model helps the system to perform T better as compared to no
learning.
An example
Data: Loan application data
Task: Predict whether a loan should be approved or not.
Performance measure: accuracy
Fundamental assumption of learning
Assumption: The distribution of training examples is identical to the distribution of test examples
(including future unseen examples).
6

Data Mining Tasks

Prediction Methods
Use some variables to predict unknown or future values of other variables.
Description Methods
Find human-interpretable patterns that describe the data.

Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Classification: Definition
o Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
o Find a model for class attribute as a function of the values of other attributes.
o Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is
divided into training and test sets, with training set used to build the model and test set used to
validate it.

Naive Bayesian classification


Naive Bayesian classification is called naive because it assumes class conditional
independence. That is, the effect of an attribute value on a given class is independent of the
values of the other attributes. This assumption is made to reduce computational costs, and
hence is considered naive.
The major idea behind naive Bayesian classification is to try and classify data by
maximizing P (X |Ci )P (Ci )
(Where i is an index of the class) using the Bayes theorem of posterior probability. In
general:

Naive Bayesian Classifiers with an Example


The following example is a simple demonstration of applying the Nave Bayes Classifier.
Table 4.1: Class-labeled training tuples from the all electronics customer database

Predicting a class label using nave Bayesian classification, we wish to predict the class label of a
tuple using Naive Bayesian classification from the training data as in the above table. The data
tuples are described by the attributes age, income, student and credit rating. The class label
attribute, buys_computer, has two distinct values (namely, {yes, no}). Let C1 correspond to the
class buys_computer=yes and C2 correspond to buys_computer=no. The tuple we wish to classify
is
X = (age=youth, income=medium, student=yes, credit_rating=fair)
We need to maximize P(X|Ci)P(Ci), for i=1, 2. P(Ci), the prior probability of each class,
can be computed based on the training tuples:

P(buys_computer=yes) = 9/14=0.643
P(buys_computer=no) = 5/14=0.357
To compute P(X|Ci), for i=1, 2, we compute the following conditional probabilities:

P(age=youth|buys_computer=yes) =2/9=0.222
P(age=youth|buys_computer=no) =3/5=0.600
P(income=medium|buys_computer=yes) =4/9=0.444
P(income=medium|buys_computer=no) =2/5=0.400
P(student=yes|buys_computer=yes) =6/9=0.667
P(student=yes|buys_computer=no) =1/5=0.200
P(credit_rating=fair|buys_computer=yes) =6/9=0.667
P(credit_rating=fair|buys_computer=no) =2/5=0.400
9

Using the above probabilities, we obtain

P(X|buys_computer=yes) = P(age=youth|buys_computer=yes) *
P(income=medium|buys_computer=yes) *
P(student=yes|buys_computer=yes) *
P(credit_rating=fair|buys_computer=yes)
=0.222 * 0.444 * 0.667 *0.667=0.044

Similarly,
P(X|buys_computer=no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019.

To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute

P(X|buys_computer=yes) * P(buys_computer=yes)=
0.044 * 0.643 = 0.028
P(X|buys_computer=no) * P(buys_computer=no) =
0.019 * 0.357 = 0.007

Therefore, the Nave Bayesian classifier predicts buys_computer = yes for tuple X.

10

Assignment (1)

x=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

Answer

11

Assignment (2)

12

Clustering
Clustering can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be the process of organizing objects into groups whose
members are similar in some way.
A cluster is therefore a collection of objects which are similar between them and are
dissimilar to the objects belonging to other clusters.
The Goals of Clustering
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute best
criterion which would be independent of the final aim of the clustering. Consequently, it is the
user which must supply this criterion, in such a way that the result of the clustering will suit their
needs.
For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding natural clusters and describe their unknown properties (natural data
types), in finding useful and suitable groupings (useful data classes) or in finding unusual data
objects (outlier detection)

K-Means Clustering

13

14

15

16

17

How the K-Mean Clustering algorithm works?


If the number of data is less than the number of cluster then we assign each data as the centroid
of the cluster. Each centroid will have a cluster number. If the number of data is bigger than the
number of cluster, for each data, we calculate the distance to all centroid and get the minimum
distance. This data is said belong to the cluster that has minimum distance from this data.
Since we are not sure about the location of the centroid, we need to adjust the centroid location
based on the current updated data. Then we assign all the data to this new centroid. This process
is repeated until no data is moving to another cluster anymore. Mathematically this loop can be
proved convergent.
What are the applications of K-mean clustering?
There are a lot of applications of the K-mean clustering, range from unsupervised learning of
neural network, Pattern recognitions, Classification analysis, Artificial intelligent, image
processing, machine vision, etc. In principle, you have several objects and each object have
several attributes and you want to classify the objects based on the attributes, then you can apply
this algorithm.

18

Association Rule Mining


Given a set of transactions, find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction

Example of Association Rules


{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An item set that contains k items
Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that contain an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold
Association Rule
An implication expression of the form X Y, where X and Y are itemsets
Example :{Milk, Diaper} {Beer}
Rule Evaluation Metrics
Support (s)

Fraction of transactions that contain both X and Y

Confidence (c) Measures how often items in Y appear in transactions that contain XTID

19

Support count: The support count of an itemset X, denoted by X.count, in a data set T is
the number of transactions in T that contain X. Assume T has n transactions.
support

( X Y ).count
n

confidence

( X Y ).count
X .count

Association Rule Mining Task


Given a set of transactions T, the goal of association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold

20

AprioriAlgorithm
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
o Generate length (k+1) candidate itemsets from length k frequent itemsets
o Prune candidate itemsets containing subsets of length k that are infrequent
o Count the support of each candidate by scanning the DB
o Eliminate candidates that are infrequent, leaving only those that are frequent

21

22

Example :
Dataset T , minsup=2
Finding frequent itemsets

23

24

Vous aimerez peut-être aussi