Data Science Intro Mulawarman

Data Mining
An Introduction
Setia Pramana
1
About Me M.Sc in BioStatistics Hasselt
Univ. Belgium
Head of PPPM
Politeknik Statistika
Research Assistant
STIS
Journey to Europe 2011 @ STIS

1999-2000 2006 2015
2005 2007 2014 2018
M.Sc in App. Statistics Associate Professor

BSc in Statistics
Hasselt Univ. Belgium
Brawijaya Univ.
PhD Mathematics
Start working @ STIS Statistical Bioinformatics
Hasselt Univ. Belgium
Postdoc @ MEB
Karolinska Institutet
Board Member
• UN Global Working Group Big Data for Official Statistics
• Asosiasi Ilmuan Data Indonesia
• Ikatan Statistisi Indonesia
• Forum Pendidikan Tinggi Statistika
• Masyarakat Biodiversiti dan Bioinformatika Indonesia
• Asosiasi Artificial Intelegent Indonesia
Data, data and data everywhere……
5
http://mycervello.com 6
How Much Data Do We have?
• Google processes 20 PB a day (2008)
• Facebook has 60 TB of daily logs
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 1000 genomes project: 200 TB
• Cost of 1 TB of disk: $35

• Time to read 1 TB disk: 3 hrs
(100 MB/s)
7
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can afford to scan the data once
8
What to do?
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
9
Analytics Approaches
• Descriptive: What happened or what is happening now?

• Diagnostic: Why did it happen or Why is it happening now?
• Predictive: What will happen next? What will happen under various
conditions?
• Prescriptive: What are the options to create the most optimal/high
value result/outcome?
10
Data Science
“Applying advanced statistical tools to existing data to solve problems, generate

new insights, improve products/services”
“Everything that has something to do with data: Collecting, analyzing,

modeling...... yet the most important part is its applications --- all sorts of
application”
11
What is Data Science?
• Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High performance computing,
Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic modeling, Probability.
12
Data Science
• A Mashed Up Discipline
• A multi-disciplinary field
that uses scientific
methods, processes,
algorithms and systems to
extract knowledge and
insights from structured
and unstructured data
13
Data Science
14
Data Science
• New Discipline
• Very few books covering the
discipline as a whole
• Interdisciplinary fields like
business analysis that incorporate
computer science, modeling,
statistics, analytics, and
mathematics.
15
Monica Rogati https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 16
Data Engineers Data Scientist
17
Source: datacamp 18
19
Current Data Scientists Profile
20
21
Methods
Sentiment Time series analysis
analysis Data mining
Multilevel
Missing data
modeling imputations Classification and
clustering
Survival analysis
Pattern recognition
Principal component
and factor analysis
AB testing Machine learning
Forecasting
Propensity score Logistic, multinomial
matching and multiple linear
regression techniques Network analysis
22
https://medium.com/@StepUpAnalytics/ai-vs-machine-learning-vs-deep-learning-vs-data-science-572b34452c3 23
https://www.datasciencecentral.com 24
Data Mining, AI and Machine Learning
• Data Mining: extract existing information to highlight patterns, and
serves as foundation for AI and machine learning.
• Artificial Intelligence: creating machines that perform functions that
require intelligence when performed by people.
• Machine Learning: Offers data necessary for a machine to learn &
adapt. The machine must automatically learn the parameters of
models from the data. It uses self-learning algorithms to improve its
performance at a task with experience over time
25
AI in Sci-Fi Movies
• Terminator • Iron Man Marvel
J.A.R.V.I.S (Iron Man)

Just A Rather Very Intelligent
http://starwars.com/ System
Politeknik Statistika STIS 26

AI in Life AI assistants
Vacuum Cleaning Robot

AI in Life
• Kiva warehouse robot

What is Artificial Intelligence ?
• The art of creating machines that perform functions that require
intelligence when performed by people (Kurzweil, 1990)
• The study of how to make computers do things at which, at the
moment, people are better (Rich and Knight, 1991)
• AI: acting humanly

Machine Learning
Learning from Learning from Follow
Experience Data Instructions
30
Machine Learning
• Machine learning is aimed to optimize a certain task using
example data or past experience
• The extraction of knowledge from data
• Machine learning is preferred approach to
• Business Intelligence
• Speech recognition, Natural language processing
• Computer vision
• Robot control
• Computational biology
• Crime predictions
• Etc..
31
Types of Learning
• Supervised (inductive) learning
• Training data includes desired outputs
• Unsupervised learning
• Training data does not include desired outputs
• Reinforcement learning
• Rewards from sequence of actions
32
33
Methods
• Supervised learning • Unsupervised learning
• Decision tree induction • Clustering
• Rule induction • Dimensionality reduction
• Naïve Bayes
• Neural networks • Reinforcement learning
• Support vector machines • Decision making (robot, chess machine)
• Model ensembles
• Etc.
34
35
Classification
What is Classification?
• Assigning an object to a certain class based on its similarity to
previous examples of other objects
• Can be done with reference to original data or based on a model of
that data
• E.g: Me: “Its round, green, delicious and crunchy ” You: “It’s an
apple!”
Examples
• Classifying transactions as genuine or fraud – e.g credit card usage,
insurance claims, cell phone calls •
• Classifying prospects as good or bad customers
• Classifying engine faults by their symptoms
• Classifying healthy and sick people based on the symptoms
• Classifying tumor and normal cell line based on the DNA mutation,
Gene expression
(Un)Certainty
• As with most data mining solutions, a classification usually comes
with a degree of certainty.
• It might be the probability of the object belonging to the class or it
might be some other measure of how closely the object resembles
other examples from that class
Techniques
• Non-parametric, e.g. K nearest neighbour
• Mathematical models, LDA, logistic regression e.g. neural networks
• Rule based models, e.g. decision trees
• Support vector Machine
• Etc…
Classification vs. Prediction
• Classification:
• predicts categorical class labels
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction:
• models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications
• credit approval
• target marketing
• medical diagnosis
• treatment effectiveness analysis
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes

• Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
• The set of tuples used for model construction: training set
• The model is represented as classification rules, decision trees, or mathematical
formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the
model
• Accuracy rate is the percentage of test set samples that are correctly classified
by the model
• Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED

M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
D ave A ssistant P rof 6 no
A nne A ssociate P rof 3 no
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

(Model)
M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no THEN tenured = ‘yes’
Classification Process (2): Use the Model in
Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED

T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Decision Tree Learning
• Uses a decision tree as a predictive model

• A decision tree uses a tree-like graph or model of decisions
and their possible consequences, including chance event
outcomes, resource costs, and utility.
Two main types of decision trees
• Classification tree analysis is when the predicted outcome is the class
to which the data belongs.
• Regression tree analysis is when the predicted outcome can be
considered a real number (e.g. the price of a house, or a patient’s
length of stay in a hospital).
• The term Classification And Regression Tree (CART) analysis is
an umbrella term used to refer to both of the above procedures.
Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
Training Data Model: Decision Tree

Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Random Forest
• Leo Breiman, Random Forests, Machine Learning, 45, 5-32, 2001
• Motivation: reduce error correlation between classifiers
• Main idea: build a larger number of un-pruned decision trees
• Key: using a random selection of features to split on at each node
How Random Forest Work
• Each tree is grown on a bootstrap sample of the training set of N cases.
• A number m is specified much smaller than the total number of
variables M (e.g. m = sqrt(M)).
• At each node, m variables are selected at random out of the M.
• The split used is the best split on these m variables.
• Final classification is done by majority vote across trees.
Source: Brieman and Cutler

59
Advantages of random forest
• More robust with respect to noise.

• More efficient on large data
• Provides an estimation of the importance of features in
determining classification
• More info at: http://stat-
www.berkeley.edu/users/breiman/RandomForests/cc_home.htm
Unsupervised Learning
Setia Pramana
In House Training Data Science OJK
61
Clustering vs. Class prediction
• Class prediction:
• A learning set of objects with known classes
• Goal: put new objects into existing classes
• Also called: Supervised learning, or classification
• Clustering:
• No learning set, no given classes
• Goal: discover the ”best” classes or groupings
• Also called: Unsupervised learning, or class discovery
Setia Pramana 62
Clustering
• Clustering: the process of grouping a set of objects into classes of
similar objects
• Most common form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
Setia Pramana 63
Clustering
Setia Pramana 64
Clustering
Setia Pramana 65
Clustering
Setia Pramana 66
Clustering
Setia Pramana 67
Issues in clustering
• Used to explore and visualize data, with few preconceptions
• Many subjective choices must be made, so a clustering output tends
to be subjective
• It is difficult to get truly statistically ”significant” conclusions
• Algorithms will always produce clusters, whether any exist in the data
or not
Setia Pramana 68
Clustering Techniques
Clustering
Hierarchical Partitional
Single Complete Square Mixture

Wards
Link Link Error Maximization
Average K-means Expectation

Link Maximization
Setia Pramana 69
Technique Characteristics
• Agglomerative vs Divisive
• Agglomerative: each instance is its own cluster and the algorithm merges
clusters
• Divisive: begins with all instances in one cluster and divides it up
• Hard vs Fuzzy
• Hard clustering assigns each instance to one cluster whereas in fuzzy
clustering assigns degree of membership
Setia Pramana 70
More Characteristics
• Monothetic vs Polythetic
• Polythetic: all attributes are used simultaneously, e.g., to calculate distance (most algorithms)
• Monothetic: attributes are considered one at a time
• Incremental vs Non-Incremental
• With large data sets it may be necessary to consider only part of the data at a time (data
mining)
• Incremental works instance by instance
Setia Pramana 71
Partitional Clustering
Setia Pramana 72
Partitional Clustering
• Output a single partition of the data into clusters
• Good for large data sets
• Determining the number of clusters is a major challenge
Setia Pramana 73
K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at random)
2) Assign every item to its nearest cluster center (e.g. using Euclidean
distance)
3) Move each cluster center to the mean of its assigned items
4) Repeat steps 2,3 until convergence (change in cluster assignments
less than a threshold)
74 Setia Pramana
K-Means
• Clusters based on centroids (aka the center of gravity or mean) of
points in a cluster, c:
 1 
μ(c)  
| c | xc
x
• Reassignment of instances to clusters is based on distance to the

current cluster centroids.
Setia Pramana 75
K-means example, step 1
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
76 Setia Pramana
k1
k2
Assign
each point
to the closest
cluster
center k3
X
77 Setia Pramana
k1 k1
Move k2
each cluster
k3
center k2
to the mean
of each cluster k3
X
78 Setia Pramana
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
79 Setia Pramana
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
80 Setia Pramana
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
81 Setia Pramana
k1
k2
move cluster
centers to k3
cluster means
X
82 Setia Pramana
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a
set of documents.
animal
vertebrate invertebrate
fish reptile amphib. mammal worm insect crustacean
How could you do this with k-means?
Setia Pramana 83
Hierarchical Clustering algorithms
• Agglomerative (bottom-up):
• Start with each document being a single cluster.
• Eventually all documents belong to the same cluster.
• Divisive (top-down):
• Start with all documents belong to the same cluster.
• Eventually each node forms a cluster on its own.
• Could be a recursive application of k-means like algorithms
• Does not require the number of clusters k in advance
• Needs a termination/readout condition
Setia Pramana 84
Hierarchical Clustering algorithms
• Agglomerative (bottom-up):
• Start with each document being a single cluster.
• Eventually all documents belong to the same cluster.
• Divisive (top-down):
• Start with all documents belong to the same cluster.
• Eventually each node forms a cluster on its own.
• Could be a recursive application of k-means like algorithms
• Does not require the number of clusters k in advance
• Needs a termination/readout condition
Setia Pramana 85
Hierarchical Agglomerative Clustering (HAC)
• Assumes a similarity function for determining the similarity of two

instances.
• Starts with all instances in a separate cluster and then repeatedly
joins the two clusters that are most similar until there is only one
cluster.
• The history of merging forms a binary tree or hierarchy.
Setia Pramana 86
Hierarchical Clustering
Dendrogram
S
F G i
m
i
l
a
r
i
C DE t
B y
A
A B C D E F G
Setia Pramana 87
What is a Good Clustering?
• Internal criterion: A good clustering will produce high quality
clusters in which:
• the intra-class (that is, intra-cluster) similarity is high
• the inter-class similarity is low
• The measured quality of a clustering depends on both the
document representation and the similarity measure used
Setia Pramana 88
Thank you
89

Data Science Intro Mulawarman

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Science Intro Mulawarman

Transféré par

Droits d'auteur :

Formats disponibles

Data Mining

Journey to Europe 2011 @ STIS

2005 2007 2014 2018

M.Sc in App. Statistics Associate Professor

• Cost of 1 TB of disk: $35

• Descriptive: What happened or what is happening now?

“Applying advanced statistical tools to existing data to solve problems, generate

“Everything that has something to do with data: Collecting, analyzing,

J.A.R.V.I.S (Iron Man)

Politeknik Statistika STIS 26

Politeknik Statistika STIS 27

Politeknik Statistika STIS 28

Politeknik Statistika STIS 29

• Model construction: describing a set of predetermined classes

NAME RANK YEARS TENURED

NAME RANK YEARS TENURED Classifier

NAME RANK YEARS TENURED

• Uses a decision tree as a predictive model

1 Yes Single 125K No

Training Data Model: Decision Tree

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

Source: Brieman and Cutler

• More robust with respect to noise.

Single Complete Square Mixture

Average K-means Expectation

• Reassignment of instances to clusters is based on distance to the

fish reptile amphib. mammal worm insect crustacean

How could you do this with k-means?

• Assumes a similarity function for determining the similarity of two

Vous aimerez peut-être aussi