Vous êtes sur la page 1sur 89

Data Mining

An Introduction

Setia Pramana

1
About Me M.Sc in BioStatistics Hasselt
Univ. Belgium
Head of PPPM
Politeknik Statistika
Research Assistant
STIS

Journey to Europe 2011 @ STIS


1999-2000 2006 2015

2005 2007 2014 2018

M.Sc in App. Statistics Associate Professor


BSc in Statistics
Hasselt Univ. Belgium
Brawijaya Univ.
PhD Mathematics
Start working @ STIS Statistical Bioinformatics
Hasselt Univ. Belgium

Postdoc @ MEB
Karolinska Institutet
Board Member
• UN Global Working Group Big Data for Official Statistics
• Asosiasi Ilmuan Data Indonesia
• Ikatan Statistisi Indonesia
• Forum Pendidikan Tinggi Statistika
• Masyarakat Biodiversiti dan Bioinformatika Indonesia
• Asosiasi Artificial Intelegent Indonesia
Data, data and data everywhere……

5
http://mycervello.com 6
How Much Data Do We have?
• Google processes 20 PB a day (2008)
• Facebook has 60 TB of daily logs
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 1000 genomes project: 200 TB

• Cost of 1 TB of disk: $35


• Time to read 1 TB disk: 3 hrs
(100 MB/s)

7
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can afford to scan the data once

8
What to do?
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling

9
Analytics Approaches

• Descriptive: What happened or what is happening now?


• Diagnostic: Why did it happen or Why is it happening now?
• Predictive: What will happen next? What will happen under various
conditions?
• Prescriptive: What are the options to create the most optimal/high
value result/outcome?

10
Data Science

“Applying advanced statistical tools to existing data to solve problems, generate


new insights, improve products/services”

“Everything that has something to do with data: Collecting, analyzing,


modeling...... yet the most important part is its applications --- all sorts of
application”

11
What is Data Science?
• Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High performance computing,
Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic modeling, Probability.

12
Data Science
• A Mashed Up Discipline
• A multi-disciplinary field
that uses scientific
methods, processes,
algorithms and systems to
extract knowledge and
insights from structured
and unstructured data

13
Data Science

14
Data Science

• New Discipline
• Very few books covering the
discipline as a whole
• Interdisciplinary fields like
business analysis that incorporate
computer science, modeling,
statistics, analytics, and
mathematics.

15
Monica Rogati https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 16
Data Engineers Data Scientist

17
Source: datacamp 18
19
Current Data Scientists Profile

20
21
Methods
Sentiment Time series analysis
analysis Data mining

Multilevel
Missing data
modeling imputations Classification and
clustering
Survival analysis
Pattern recognition
Principal component
and factor analysis
AB testing Machine learning
Forecasting
Propensity score Logistic, multinomial
matching and multiple linear
regression techniques Network analysis
22
https://medium.com/@StepUpAnalytics/ai-vs-machine-learning-vs-deep-learning-vs-data-science-572b34452c3 23
https://www.datasciencecentral.com 24
Data Mining, AI and Machine Learning
• Data Mining: extract existing information to highlight patterns, and
serves as foundation for AI and machine learning.
• Artificial Intelligence: creating machines that perform functions that
require intelligence when performed by people.
• Machine Learning: Offers data necessary for a machine to learn &
adapt. The machine must automatically learn the parameters of
models from the data. It uses self-learning algorithms to improve its
performance at a task with experience over time

25
AI in Sci-Fi Movies
• Terminator • Iron Man Marvel

J.A.R.V.I.S (Iron Man)


Just A Rather Very Intelligent
http://starwars.com/ System

Politeknik Statistika STIS 26


AI in Life AI assistants
Vacuum Cleaning Robot

Politeknik Statistika STIS 27


AI in Life
• Kiva warehouse robot

Politeknik Statistika STIS 28


What is Artificial Intelligence ?
• The art of creating machines that perform functions that require
intelligence when performed by people (Kurzweil, 1990)
• The study of how to make computers do things at which, at the
moment, people are better (Rich and Knight, 1991)
• AI: acting humanly

Politeknik Statistika STIS 29


Machine Learning
Learning from Learning from Follow
Experience Data Instructions

30
Machine Learning
• Machine learning is aimed to optimize a certain task using
example data or past experience
• The extraction of knowledge from data
• Machine learning is preferred approach to
• Business Intelligence
• Speech recognition, Natural language processing
• Computer vision
• Robot control
• Computational biology
• Crime predictions
• Etc..

31
Types of Learning
• Supervised (inductive) learning
• Training data includes desired outputs
• Unsupervised learning
• Training data does not include desired outputs
• Reinforcement learning
• Rewards from sequence of actions

32
33
Methods
• Supervised learning • Unsupervised learning
• Decision tree induction • Clustering
• Rule induction • Dimensionality reduction
• Naïve Bayes
• Neural networks • Reinforcement learning
• Support vector machines • Decision making (robot, chess machine)
• Model ensembles
• Etc.

34
35
Classification
What is Classification?
• Assigning an object to a certain class based on its similarity to
previous examples of other objects
• Can be done with reference to original data or based on a model of
that data
• E.g: Me: “Its round, green, delicious and crunchy ” You: “It’s an
apple!”
Examples
• Classifying transactions as genuine or fraud – e.g credit card usage,
insurance claims, cell phone calls •
• Classifying prospects as good or bad customers
• Classifying engine faults by their symptoms
• Classifying healthy and sick people based on the symptoms
• Classifying tumor and normal cell line based on the DNA mutation,
Gene expression
(Un)Certainty
• As with most data mining solutions, a classification usually comes
with a degree of certainty.
• It might be the probability of the object belonging to the class or it
might be some other measure of how closely the object resembles
other examples from that class
Techniques
• Non-parametric, e.g. K nearest neighbour
• Mathematical models, LDA, logistic regression e.g. neural networks
• Rule based models, e.g. decision trees
• Support vector Machine
• Etc…
Classification vs. Prediction
• Classification:
• predicts categorical class labels
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction:
• models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications
• credit approval
• target marketing
• medical diagnosis
• treatment effectiveness analysis
Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes


• Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
• The set of tuples used for model construction: training set
• The model is represented as classification rules, decision trees, or mathematical
formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the
model
• Accuracy rate is the percentage of test set samples that are correctly classified
by the model
• Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED


M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
D ave A ssistant P rof 6 no
A nne A ssociate P rof 3 no
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no THEN tenured = ‘yes’
Classification Process (2): Use the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Decision Tree Learning

• Uses a decision tree as a predictive model


• A decision tree uses a tree-like graph or model of decisions
and their possible consequences, including chance event
outcomes, resource costs, and utility.
Two main types of decision trees
• Classification tree analysis is when the predicted outcome is the class
to which the data belongs.
• Regression tree analysis is when the predicted outcome can be
considered a real number (e.g. the price of a house, or a patient’s
length of stay in a hospital).
• The term Classification And Regression Tree (CART) analysis is
an umbrella term used to refer to both of the above procedures.
Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Random Forest
• Leo Breiman, Random Forests, Machine Learning, 45, 5-32, 2001
• Motivation: reduce error correlation between classifiers
• Main idea: build a larger number of un-pruned decision trees
• Key: using a random selection of features to split on at each node
How Random Forest Work
• Each tree is grown on a bootstrap sample of the training set of N cases.
• A number m is specified much smaller than the total number of
variables M (e.g. m = sqrt(M)).
• At each node, m variables are selected at random out of the M.
• The split used is the best split on these m variables.
• Final classification is done by majority vote across trees.

Source: Brieman and Cutler


59
Advantages of random forest

• More robust with respect to noise.


• More efficient on large data
• Provides an estimation of the importance of features in
determining classification
• More info at: http://stat-
www.berkeley.edu/users/breiman/RandomForests/cc_home.htm
Unsupervised Learning

Setia Pramana
In House Training Data Science OJK

61
Clustering vs. Class prediction
• Class prediction:
• A learning set of objects with known classes
• Goal: put new objects into existing classes
• Also called: Supervised learning, or classification
• Clustering:
• No learning set, no given classes
• Goal: discover the ”best” classes or groupings
• Also called: Unsupervised learning, or class discovery

Setia Pramana 62
Clustering
• Clustering: the process of grouping a set of objects into classes of
similar objects
• Most common form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given

Setia Pramana 63
Clustering

Setia Pramana 64
Clustering

Setia Pramana 65
Clustering

Setia Pramana 66
Clustering

Setia Pramana 67
Issues in clustering
• Used to explore and visualize data, with few preconceptions
• Many subjective choices must be made, so a clustering output tends
to be subjective
• It is difficult to get truly statistically ”significant” conclusions
• Algorithms will always produce clusters, whether any exist in the data
or not

Setia Pramana 68
Clustering Techniques

Clustering

Hierarchical Partitional

Single Complete Square Mixture


Wards
Link Link Error Maximization

Average K-means Expectation


Link Maximization

Setia Pramana 69
Technique Characteristics
• Agglomerative vs Divisive
• Agglomerative: each instance is its own cluster and the algorithm merges
clusters
• Divisive: begins with all instances in one cluster and divides it up
• Hard vs Fuzzy
• Hard clustering assigns each instance to one cluster whereas in fuzzy
clustering assigns degree of membership

Setia Pramana 70
More Characteristics
• Monothetic vs Polythetic
• Polythetic: all attributes are used simultaneously, e.g., to calculate distance (most algorithms)
• Monothetic: attributes are considered one at a time

• Incremental vs Non-Incremental
• With large data sets it may be necessary to consider only part of the data at a time (data
mining)
• Incremental works instance by instance

Setia Pramana 71
Partitional Clustering

Setia Pramana 72
Partitional Clustering
• Output a single partition of the data into clusters
• Good for large data sets
• Determining the number of clusters is a major challenge

Setia Pramana 73
K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at random)
2) Assign every item to its nearest cluster center (e.g. using Euclidean
distance)
3) Move each cluster center to the mean of its assigned items
4) Repeat steps 2,3 until convergence (change in cluster assignments
less than a threshold)

74 Setia Pramana
K-Means
• Clusters based on centroids (aka the center of gravity or mean) of
points in a cluster, c:

 1 
μ(c)  
| c | xc
x

• Reassignment of instances to clusters is based on distance to the


current cluster centroids.

Setia Pramana 75
K-means example, step 1

k1

Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3

X
76 Setia Pramana
K-means example, step 2

k1

k2
Assign
each point
to the closest
cluster
center k3

X
77 Setia Pramana
K-means example, step 3

k1 k1

Move k2
each cluster
k3
center k2
to the mean
of each cluster k3

X
78 Setia Pramana
K-means example, step 4

Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2

points are
reassigned?

X
79 Setia Pramana
K-means example, step 4 …

k1

Y
A: three
points with
animation k3
k2

X
80 Setia Pramana
K-means example, step 4b

k1

Y
re-compute
cluster
means k3
k2

X
81 Setia Pramana
K-means example, step 5

k1

k2
move cluster
centers to k3

cluster means

X
82 Setia Pramana
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a
set of documents.
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

How could you do this with k-means?

Setia Pramana 83
Hierarchical Clustering algorithms
• Agglomerative (bottom-up):
• Start with each document being a single cluster.
• Eventually all documents belong to the same cluster.
• Divisive (top-down):
• Start with all documents belong to the same cluster.
• Eventually each node forms a cluster on its own.
• Could be a recursive application of k-means like algorithms
• Does not require the number of clusters k in advance
• Needs a termination/readout condition

Setia Pramana 84
Hierarchical Clustering algorithms
• Agglomerative (bottom-up):
• Start with each document being a single cluster.
• Eventually all documents belong to the same cluster.
• Divisive (top-down):
• Start with all documents belong to the same cluster.
• Eventually each node forms a cluster on its own.
• Could be a recursive application of k-means like algorithms
• Does not require the number of clusters k in advance
• Needs a termination/readout condition

Setia Pramana 85
Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining the similarity of two


instances.
• Starts with all instances in a separate cluster and then repeatedly
joins the two clusters that are most similar until there is only one
cluster.
• The history of merging forms a binary tree or hierarchy.

Setia Pramana 86
Hierarchical Clustering
Dendrogram

S
F G i
m
i
l
a
r
i
C DE t
B y
A
A B C D E F G

Setia Pramana 87
What is a Good Clustering?
• Internal criterion: A good clustering will produce high quality
clusters in which:
• the intra-class (that is, intra-cluster) similarity is high
• the inter-class similarity is low
• The measured quality of a clustering depends on both the
document representation and the similarity measure used

Setia Pramana 88
Thank you

89

Vous aimerez peut-être aussi