Académique Documents
Professionnel Documents
Culture Documents
Data Mining
Clustering Analysis Basics
Outline
• Introduction
• Distance Measures
• Summary
3
Introduction
• Cluster: A collection/group of data objects/points
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis
• find similarities between data according to characteristics underlying
the data and grouping similar data objects into clusters
• Clustering Analysis: Unsupervised learning
• no predefined classes for a training data set
• Two general tasks: identify the “natural” clustering number and
properly grouping objects into “sensible” clusters
• Typical applications
• as a stand-alone tool to gain an insight into data distribution
• as a preprocessing step of other algorithms in intelligent systems
4
Introduction
• Illustrative Example 1: how many clusters?
5
Introduction
• Illustrative Example 2: are they in the same cluster?
6
Introduction
• Real Applications: Google News
7
Introduction
• Real Applications: Genetics Analysis
8
Introduction
• Real Applications: Emerging Applications
9
Introduction
• A technique demanded by many real world tasks
• Bank/Internet Security: fraud/spam pattern discovery
• Biology: taxonomy of living things such as kingdom, phylum, class, order, family,
genus and species
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Climate change: understanding earth climate, find patterns of atmospheric and
ocean
• Finance: stock clustering analysis to uncover correlation underlying shares
• Image Compression/segmentation: coherent pixels grouped
• Information retrieval/organisation: Google search, topic-based news
• Land use: Identification of areas of similar land use in an earth observation database
• Marketing: Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
• Social network mining: special interest group automatic discovery
10
Clustering as a Preprocessing Tool (Utility)
• Summarization:
• Preprocessing for regression, PCA, classification, and association analysis
• Compression:
• Image processing: vector quantization
• Finding K-nearest Neighbors
• Localizing search to one or a small number of clusters
• Outlier detection
• Outliers are often viewed as those “far away” from any cluster
11
Quality: What Is Good Clustering?
12
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function, typically metric: d(i, j)
• The definitions of distance functions are usually rather different for interval-
scaled, boolean, categorical, ordinal ratio, and vector variables
• Weights should be associated with different variables based on applications and
data semantics
• Quality of clustering:
• There is usually a separate “quality” function that measures the “goodness” of a
cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
13
Considerations for Cluster Analysis
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
14
Requirements and Challenges
• Scalability
• Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture of these
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality
15
Major Clustering Approaches (I)
• Partitioning approach:
• Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of
square errors
• Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
• Based on connectivity and density functions
• Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
• based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
16
Quiz
17
Data Types and Representations
• Discrete vs. Continuous
• Discrete Feature
• Has only a finite set of values
e.g., zip codes, rank, or the set of words in a collection of documents
• Sometimes, represented as integer variable
• Continuous Feature
• Has real numbers as feature values
e.g, temperature, height, or weight
• Practically, real values can only be measured and represented using a
finite number of digits
• Continuous features are typically represented as floating-point variables
18
Data Types and Representations
• Data representations
• Data matrix (object-by-feature structure)
x11 ... x1f ... x1p
n data points (objects) with p
... ... ... ... ...
dimensions (features)
x ... xif ... xip
i1 Two modes: row and column
... ... ... ... ...
x represent different entities
n1 ... xnf ... xnp
• Distance/dissimilarity matrix (object-by-object structure)
0 n data points, but registers
d(2,1)
0 only the distance
d(3,1) d ( 3,2) 0 A symmetric/triangular
: : : matrix
d ( n,1) d ( n,2) ... ... 0 Single mode: row and
19 column for the same entity
Data Types and Representations
• Examples
3 point x y
p1 0 2
2 p1
p2 2 0
p3 p4 p3 3 1
1
p4 5 1
p2
0 Data Matrix
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
20
Distance Measures
• Minkowski Distance (http://en.wikipedia.org/wiki/Minkowski_distance)
For x (x1 x2 xn ) and y ( y1 y2 yn )
d(x , y) p | x1 y1 |p | x2 y2 |p | xn yn |p , p 0
• p = 2: Euclidean distance
d(x , y) | x1 y1 |2 | x2 y2 |2 | xn yn |2
point x y L2 p1 p2 p3 p4
p1 0 2 p1 0 2.828 3.162 5.099
p2 2 0 p2 2.828 0 1.414 3.162
p3 3 1 p3 3.162 1.414 0 2
p4 5 1 p4 5.099 3.162 2 0
Data Matrix Distance Matrix for Euclidean Distance
22
Distance Measures
• Cosine Measure (Similarity vs. Distance)
For x (x1 x2 xn ) and y ( y1 y2 yn )
xy x1y1 xn yn
cos( x , y)
x y x12 xn2 y12 yn2
xy
d(x , y) 1 cos( x , y) 1
x y
• Property: 0 d(x , y) 2
• Nonmetric vector objects: keywords in documents, gene
features in micro-arrays, …
• Applications: information retrieval, biologic taxonomy, ...
23
Distance Measures
• Example: Cosine measure
x1 (3, 2, 0, 5, 2, 0, 0), x 2 (1,0, 0, 0, 1, 0, 2)
x1 x 2 3 1 2 0 0 0 5 0 2 1 0 0 0 2 5
x1 32 2 2 0 2 5 2 2 2 02 02 42 6.48
x 2 12 0 2 0 2 0 2 12 0 2 22 6 2.45
x1 x 2 5
cos( x1 , x 2 ) 0.32
x1 x 2 6.48 2.45
d(x1 , x 2 ) 1 cos( x1 , x 2 ) 1 0.32 0.68
24
Distance Measures
• Distance for Binary Features
• For binary features, their value can be converted into 1 or 0.
• Contingency table for binary feature vectors, x and y
y
25
Distance Measures
• Distance for Binary Features
• Distance for symmetric binary features
Both of their states equally valuable and carry the same weight; i.e., no
preference on which outcome should be coded as 1 or 0 , e.g. gender
bc
d(x , y)
abcd
bc
d(x , y)
abc
26
Distance Measures
• Example: Distance for binary features
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
“Y”: yes
Jack M Y
1 N
0 P
1 N
0 N
0 N0 “P”: positive
Mary F Y
1 N
0 P
1 N
0 P
1 N0 “N”: negative
Jim M Y
1 P
1 N
0 N
0 N
0 N0
28
Distance Measures
• Distance for nominal features (cont.)
• Example: Play tennis
Outlook Temperature Humidity Wind
D1 Overcast
010 High
100 High
10 Strong
10
D2 Sunny
100 High
100 Normal
01 Strong
10
• Simple mis-matching
WIB dibajak Vincent
2 & Desta (1/4)
d(D1 , D2 ) 0.5
4
• Creating new binary features
• Using the same number of bits as those features can take
Outlook = {Sunny, Overcast, Rain} (100, 010, 001)
Temperature = {High, Mild, Cool} (100, 010, 001)
Humidity = {High, Normal} (10, 01)
Wind = {Strong, Weak} (10, 01)
29
Major Clustering Approaches
• Hierarchical approach
• Create a hierarchical decomposition of the set of data (or objects) using some
criterion
• Typical methods: Agglomerative, Diana, Agnes, BIRCH, ROCK, ……
30
Major Clustering Approaches
• Partitioning approach
• Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square distance cost
• Typical methods: k-means, k-medoids, CLARANS, ……
31
Major Clustering Approaches
• Density-based approach
• Based on connectivity and density functions
• Typical methods: DBSACN, OPTICS, DenClue, ……
32
Major Clustering Approaches
• Model-based approach
• A generative model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
• Typical methods: Gaussian Mixture Model (GMM), COBWEB, ……
33
Major Clustering Approaches
• Spectral clustering approach
• Convert data set into weighted graph (vertex, edge), then cut the graph into
sub-graphs corresponding to clusters via spectral analysis
• Typical methods: Normalised-Cuts ……
34
Major Clustering Approaches
• Clustering ensemble approach
• Combine multiple clustering results (different partitions)
• Typical methods: Evidence-accumulation based, graph-based ……
combinatio
n
35
Summary
• Clustering analysis groups objects based on their
(dis)similarity and has a broad range of applications.
• Measure of distance (or similarity) plays a critical role in
clustering analysis and distance-based learning.
• Clustering algorithms can be categorized into partitioning,
hierarchical, density-based, model-based, spectral
clustering as well as ensemble approaches.
• There are still lots of research issues on cluster analysis;
• finding the number of “natural” clusters with arbitrary shapes
• dealing with mixed types of features
• handling massive amount of data – Big Data
• coping with data of high dimensionality
• performance evaluation (especially when no ground-truth available)
36
Hierarchical Clustering
Outline
• Introduction
• Agglomerative Algorithm
• Relevant Issues
• Summary
a
ab
b Cluster distance
abcde
c Termination condition
cde
d
de
e
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
1. Calculate the distance matrix. 2. Calculate three cluster distances between C 1 and C2.
Single link
a b c d e dist (C1 , C 2 ) min{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
a 0 1 3 4 5 min{3, 4, 5, 2, 3, 4} 2
b 1 0 2 3 4 Complete link
dist(C 1 , C 2 ) max{d(a, c) , d(a, d), d(a, e), d(b, c), d(b, d), d(b, e)}
c 3 2 0 1 2
max{3, 4, 5, 2, 3, 4} 5
d 4 3 1 0 1
Average
d(a, c) d(a, d) d(a, e) d(b, c) d(b, d) d(b, e)
e 5 4 2 1 0 dist(C 1 , C 2 )
6
3 4 5 2 3 4 21
3.5
6 6
Sekolah Tinggi Ilmu Statistik Jakarta 42
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
data matrix
Euclidean distance
distance matrix
Sekolah Tinggi Ilmu Statistik Jakarta 44
Example
• Merge two closest clusters (iteration 1)
Apply the agglomerative algorithm with single-link, complete-link and averaging cluster
distance measures to produce three dendrogram trees, respectively.
a b c d e
a 0 1 3 4 5
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1
e 5 4 2 1 0
56
Sekolah Tinggi Ilmu Statistik Jakarta
K-means Clustering
Outline
• Introduction
• K-means Algorithm
• Example
• How K-means partitions?
• K-means Demo
• Relevant Issues
• Application: Cell Neulei Detection
• Summary
58
Introduction
• Partitioning Clustering Approach
• a typical clustering analysis approach via iteratively partitioning
training data set to learn a partition of the given data space
• learning a partition on a data set to produce several non-empty
clusters (usually, the number of clusters given in advance)
• in principle, optimal partition achieved via minimising the sum
of squared distance to its “representative object” in each
cluster
E k 1 xCk d ( x, mk )
K 2
N
e.g., Euclidean distance d 2 ( x, mk ) (xn m kn ) 2
n 1
59
Introduction
D
Medicine Weight pH-
Index C
A 1 1
B 2 1
A B
C 4 3
D 5 4
62
Example
• Step 1: Use initial seed points for partitioning
c1 A, c2 B
D
Euclidean distance
C
d( D , c1 ) ( 5 1)2 ( 4 1)2 5
A B
d( D , c2 ) ( 5 2)2 ( 4 1)2 4.24
63
Example
• Step 2: Compute new centroids of the current partition
c1 (1, 1)
2 4 5 1 3 4
c2 ,
3 3
11 8
( , )
3 3
64
Example
• Step 2: Renew membership based on new centroids
65
Example
• Step 3: Repeat the first two steps until its convergence
1 2 11 1
c1 , (1 , 1)
2 2 2
45 34 1 1
c2 , (4 , 3 )
2 2 2 2
66
Example
• Step 3: Repeat the first two steps until its convergence
67
Exercise
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
1. How many steps are required for convergence?
2. What are memberships of two clusters after convergence?
3. What are centroids of two clusters after convergence?
Medicine Weight pH- D
Index
C
A 1 1
B 2 1 A B
C 4 3
D 5 4
68
How K-means partitions?
A partition amounts to a
Voronoi Diagram
69
K-means Demo
1. User set up the number of
clusters they’d like. (e.g.
k=5)
70
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations
71
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
Center locations
3. Each data point finds out
which Center it’s closest to.
(Thus each Center “owns” a
set of data points)
72
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each Center “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
73
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there
74
K-means Demo
1. User set up the number of
clusters they’d like. (e.g. K=5)
2. Randomly guess K cluster
centre locations
3. Each data point finds out
which centre it’s closest to.
(Thus each centre “owns” a
set of data points)
4. Each centre finds the centroid
of the points it owns
5. …and jumps there
6. …Repeat until terminated!
75
K-means Demo Weka
76
Relevant Issues
• Efficient in computation
• O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
• Local optimum
• sensitive to initial seed points
• converge to a local optimum: maybe an unwanted solution
• Other problems
• Need to specify K, the number of clusters, in advance
• Unable to handle noisy data and outliers (K-Medoids algorithm)
• Not suitable for discovering clusters with non-convex shapes
• Applicable only when mean is defined, then what about
categorical data? (K-mode algorithm)
• how to evaluate the K-mean performance?
77
Application
• Colour-Based Image Segmentation Using K-Means
Step 1: Read Image
Step 2: Convert Image from RGB Colour Space to L*a*b* Colour
Space
Step 3: Classify the Colours in 'a*b*' Space Using K-means
Clustering
Step 4: Label Every Pixel in the Image Using the Results from
K-means Clustering (KMEANS)
Step 5: Create Images that Segment the H&E Image by Colour
Step 6: Segment the Nuclei into a Separate Image
78
Summary
• K-means algorithm is a simple yet popular method for
clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
• K-Medoids: resistance to noise and/or outliers
• K-Modes: extension to categorical data clustering analysis
• CLARA: extension to deal with large data sets
• Mixture models (EM algorithm): handling uncertainty of clusters
79