Académique Documents
Professionnel Documents
Culture Documents
An Introduction
Setia Pramana
1
About Me M.Sc in BioStatistics Hasselt
Univ. Belgium
Head of PPPM
Politeknik Statistika
Research Assistant
STIS
Postdoc @ MEB
Karolinska Institutet
Board Member
• UN Global Working Group Big Data for Official Statistics
• Asosiasi Ilmuan Data Indonesia
• Ikatan Statistisi Indonesia
• Forum Pendidikan Tinggi Statistika
• Masyarakat Biodiversiti dan Bioinformatika Indonesia
• Asosiasi Artificial Intelegent Indonesia
Data, data and data everywhere……
5
http://mycervello.com 6
How Much Data Do We have?
• Google processes 20 PB a day (2008)
• Facebook has 60 TB of daily logs
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 1000 genomes project: 200 TB
7
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can afford to scan the data once
8
What to do?
• Aggregation and Statistics
• Data warehousing and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
9
Analytics Approaches
10
Data Science
11
What is Data Science?
• Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
• Computer Science
• Pattern recognition, visualization, data warehousing, High performance computing,
Databases, AI
• Mathematics
• Mathematical Modeling
• Statistics
• Statistical and Stochastic modeling, Probability.
12
Data Science
• A Mashed Up Discipline
• A multi-disciplinary field
that uses scientific
methods, processes,
algorithms and systems to
extract knowledge and
insights from structured
and unstructured data
13
Data Science
14
Data Science
• New Discipline
• Very few books covering the
discipline as a whole
• Interdisciplinary fields like
business analysis that incorporate
computer science, modeling,
statistics, analytics, and
mathematics.
15
Monica Rogati https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 16
Data Engineers Data Scientist
17
Source: datacamp 18
19
Current Data Scientists Profile
20
21
Methods
Sentiment Time series analysis
analysis Data mining
Multilevel
Missing data
modeling imputations Classification and
clustering
Survival analysis
Pattern recognition
Principal component
and factor analysis
AB testing Machine learning
Forecasting
Propensity score Logistic, multinomial
matching and multiple linear
regression techniques Network analysis
22
https://medium.com/@StepUpAnalytics/ai-vs-machine-learning-vs-deep-learning-vs-data-science-572b34452c3 23
https://www.datasciencecentral.com 24
Data Mining, AI and Machine Learning
• Data Mining: extract existing information to highlight patterns, and
serves as foundation for AI and machine learning.
• Artificial Intelligence: creating machines that perform functions that
require intelligence when performed by people.
• Machine Learning: Offers data necessary for a machine to learn &
adapt. The machine must automatically learn the parameters of
models from the data. It uses self-learning algorithms to improve its
performance at a task with experience over time
25
AI in Sci-Fi Movies
• Terminator • Iron Man Marvel
30
Machine Learning
• Machine learning is aimed to optimize a certain task using
example data or past experience
• The extraction of knowledge from data
• Machine learning is preferred approach to
• Business Intelligence
• Speech recognition, Natural language processing
• Computer vision
• Robot control
• Computational biology
• Crime predictions
• Etc..
31
Types of Learning
• Supervised (inductive) learning
• Training data includes desired outputs
• Unsupervised learning
• Training data does not include desired outputs
• Reinforcement learning
• Rewards from sequence of actions
32
33
Methods
• Supervised learning • Unsupervised learning
• Decision tree induction • Clustering
• Rule induction • Dimensionality reduction
• Naïve Bayes
• Neural networks • Reinforcement learning
• Support vector machines • Decision making (robot, chess machine)
• Model ensembles
• Etc.
34
35
Classification
What is Classification?
• Assigning an object to a certain class based on its similarity to
previous examples of other objects
• Can be done with reference to original data or based on a model of
that data
• E.g: Me: “Its round, green, delicious and crunchy ” You: “It’s an
apple!”
Examples
• Classifying transactions as genuine or fraud – e.g credit card usage,
insurance claims, cell phone calls •
• Classifying prospects as good or bad customers
• Classifying engine faults by their symptoms
• Classifying healthy and sick people based on the symptoms
• Classifying tumor and normal cell line based on the DNA mutation,
Gene expression
(Un)Certainty
• As with most data mining solutions, a classification usually comes
with a degree of certainty.
• It might be the probability of the object belonging to the class or it
might be some other measure of how closely the object resembles
other examples from that class
Techniques
• Non-parametric, e.g. K nearest neighbour
• Mathematical models, LDA, logistic regression e.g. neural networks
• Rule based models, e.g. decision trees
• Support vector Machine
• Etc…
Classification vs. Prediction
• Classification:
• predicts categorical class labels
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction:
• models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications
• credit approval
• target marketing
• medical diagnosis
• treatment effectiveness analysis
Classification—A Two-Step Process
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Random Forest
• Leo Breiman, Random Forests, Machine Learning, 45, 5-32, 2001
• Motivation: reduce error correlation between classifiers
• Main idea: build a larger number of un-pruned decision trees
• Key: using a random selection of features to split on at each node
How Random Forest Work
• Each tree is grown on a bootstrap sample of the training set of N cases.
• A number m is specified much smaller than the total number of
variables M (e.g. m = sqrt(M)).
• At each node, m variables are selected at random out of the M.
• The split used is the best split on these m variables.
• Final classification is done by majority vote across trees.
Setia Pramana
In House Training Data Science OJK
61
Clustering vs. Class prediction
• Class prediction:
• A learning set of objects with known classes
• Goal: put new objects into existing classes
• Also called: Supervised learning, or classification
• Clustering:
• No learning set, no given classes
• Goal: discover the ”best” classes or groupings
• Also called: Unsupervised learning, or class discovery
Setia Pramana 62
Clustering
• Clustering: the process of grouping a set of objects into classes of
similar objects
• Most common form of unsupervised learning
• Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
Setia Pramana 63
Clustering
Setia Pramana 64
Clustering
Setia Pramana 65
Clustering
Setia Pramana 66
Clustering
Setia Pramana 67
Issues in clustering
• Used to explore and visualize data, with few preconceptions
• Many subjective choices must be made, so a clustering output tends
to be subjective
• It is difficult to get truly statistically ”significant” conclusions
• Algorithms will always produce clusters, whether any exist in the data
or not
Setia Pramana 68
Clustering Techniques
Clustering
Hierarchical Partitional
Setia Pramana 69
Technique Characteristics
• Agglomerative vs Divisive
• Agglomerative: each instance is its own cluster and the algorithm merges
clusters
• Divisive: begins with all instances in one cluster and divides it up
• Hard vs Fuzzy
• Hard clustering assigns each instance to one cluster whereas in fuzzy
clustering assigns degree of membership
Setia Pramana 70
More Characteristics
• Monothetic vs Polythetic
• Polythetic: all attributes are used simultaneously, e.g., to calculate distance (most algorithms)
• Monothetic: attributes are considered one at a time
• Incremental vs Non-Incremental
• With large data sets it may be necessary to consider only part of the data at a time (data
mining)
• Incremental works instance by instance
Setia Pramana 71
Partitional Clustering
Setia Pramana 72
Partitional Clustering
• Output a single partition of the data into clusters
• Good for large data sets
• Determining the number of clusters is a major challenge
Setia Pramana 73
K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at random)
2) Assign every item to its nearest cluster center (e.g. using Euclidean
distance)
3) Move each cluster center to the mean of its assigned items
4) Repeat steps 2,3 until convergence (change in cluster assignments
less than a threshold)
74 Setia Pramana
K-Means
• Clusters based on centroids (aka the center of gravity or mean) of
points in a cluster, c:
1
μ(c)
| c | xc
x
Setia Pramana 75
K-means example, step 1
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
76 Setia Pramana
K-means example, step 2
k1
k2
Assign
each point
to the closest
cluster
center k3
X
77 Setia Pramana
K-means example, step 3
k1 k1
Move k2
each cluster
k3
center k2
to the mean
of each cluster k3
X
78 Setia Pramana
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
79 Setia Pramana
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
80 Setia Pramana
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
81 Setia Pramana
K-means example, step 5
k1
k2
move cluster
centers to k3
cluster means
X
82 Setia Pramana
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a
set of documents.
animal
vertebrate invertebrate
Setia Pramana 83
Hierarchical Clustering algorithms
• Agglomerative (bottom-up):
• Start with each document being a single cluster.
• Eventually all documents belong to the same cluster.
• Divisive (top-down):
• Start with all documents belong to the same cluster.
• Eventually each node forms a cluster on its own.
• Could be a recursive application of k-means like algorithms
• Does not require the number of clusters k in advance
• Needs a termination/readout condition
Setia Pramana 84
Hierarchical Clustering algorithms
• Agglomerative (bottom-up):
• Start with each document being a single cluster.
• Eventually all documents belong to the same cluster.
• Divisive (top-down):
• Start with all documents belong to the same cluster.
• Eventually each node forms a cluster on its own.
• Could be a recursive application of k-means like algorithms
• Does not require the number of clusters k in advance
• Needs a termination/readout condition
Setia Pramana 85
Hierarchical Agglomerative Clustering (HAC)
Setia Pramana 86
Hierarchical Clustering
Dendrogram
S
F G i
m
i
l
a
r
i
C DE t
B y
A
A B C D E F G
Setia Pramana 87
What is a Good Clustering?
• Internal criterion: A good clustering will produce high quality
clusters in which:
• the intra-class (that is, intra-cluster) similarity is high
• the inter-class similarity is low
• The measured quality of a clustering depends on both the
document representation and the similarity measure used
Setia Pramana 88
Thank you
89