Vous êtes sur la page 1sur 44

Clustering Algorithm

A fundamental operation in data mining


Target for large database
For numeric data
Clustering
Similar items fall into the same cluster
while dissimilar ones fall into separate
According to some defined criteria
Unsupervised classification
Statistical v.s. Conceptual
Numeric v.s. Categorical
Definition of Clustering
( Statistical )
Given the desired number of cluster k and
a dataset of n points, and a distance-based
measurement function, we are asked to
find a partition of the dataset that
minimizes the value of the measurement
function.
Data
Metric v.s. Nonmetric
Numeric v.s. Categorical
A data matrix X, as an input to clustering
algorithm
Resemblance Coefficient measures the
overall resemblance ( the degree of
similarity ) between each pair of objects
Data ( continued )
Similarity v.s. Dissimilarity
Scales of measurement for attributes
nominal scale Qualitative Attributes
ordinal scale
interval scale Quantitative Attributes
ratio scale
Clustering Algorithm
Partitioning / Optimization techniques
Hierarchical techniques
Divisive
Agglomerative
Density Search techniques
Others
Input to the algorithms ?
Global v.s semi-global ?
FUZZY
Partitioning Cluster Algorithms
A partitioning method classifies the data
into k groups, which together satisfy the
requirements of a partition :
each group must contain at least one object
each object must belong to exactly one group
This implies k <= n
k is given by user ( a parameter hard to
determine at the beginning )
Partitioning Cluster Algorithms
Key points :
Techniques used for initiating clusters
Clustering criteria
Example :
k-means (MacQueen, 1976)
convergent clustering method using the k-
means process
k-medoids : PAM, CLARA
MacQueens k-means Method
1. Take the first k data units in the date set as clusters
of one member each.
2. Assign each of the remaining n-k data units to the
cluster with the nearest centroid. After each
assignment, re-compute the centroid of the gaining
cluster.
3. After all data units have been assigned in step 2,
take the existing cluster centroids as fixed seed
points and make one more pass through the data
set assigning each data unit to the nearest seed
point.
Convergent k-means process
1. Using step 1 and 2 of the ordinary MacQueen k-
means method.
2. Take each data unit in sequence and compute the
distances to all cluster centriods; if the nearest
centroid is not that of the data units parent cluster,
then reassign the data unit and update the
centroids of the losing and gaining clusters.
3. Repeat step 2 until convergence is achieved.
PAM :
Partitioning Around Medoids
(also called k-medoid method)
Medoid : representative object
Two phases :
BUILD : an initial clustering is obtained by the
successive selection of representative object
until k objects has been found.
SWAP : attempted to improve the set of
representative objects and therefore also to
improve the clustering yielded by this set.
PAM
Two-dimensional example with 10 objects
Objects 1 and 5 are the selected representative objects
Objects 1 and 5 are the selected representative objects
Objects 4 and 8 are the selected representative objects
Objects 4 and 8 are the selected representative objects
PAM Algorithm ( BUILD )
1.The first object is the one for which the sum
of the dissimilarities to all other objects is
as small as possible.
2.Consider an object i which has not yet been
selected :
Consider a nonselected object j, calculate Dj ,
and d( j, i )
Cji = max ( Dj - d( j, i ) ,0 )
Calculate
Ask for
PAM

Dji
d ( j, i )
Cji = Dji - d ( j, i )
PAM Algorithm ( SWAP )
Considering all pairs of objects ( i , h )
object i has been selected and object h has not
Consider a nonselected object j and
calculate its contribution Cjih
Cjih = 0
if d ( j ,h ) < Ej then Cjih = d ( j ,h ) - d ( j ,i )
if d ( j ,h ) >= Ej then Cjih = Ej - Dj
Cjih = d ( j ,h) - Dj
Tih = sum of Cjih
Ask for minimum of Tih
If min(Tih) < 0 , swap it and returns to step 1
If min(Tih) >= 0 , STOP!!
CLARA :
Clustering Large Applications
Consider Time and Space
Also based on k-medoid approach
five ( or more ) random samples of objects
the size of samples depends on the number
of clusters, 40+2k
CLARA
CLARA Algorithm
Draw a sample of 40 + 2k objects randomly
from the entire data set
Use PAM to find k medoids of the sample
Assign each object to the nearest
representative object (medoid)
The average distance obtained for the
assignment is used as a measure of the
quality of the clustering
Hierarchical Cluster Algorithms
Clustering method depends on : how to
measure the similarity between two
clusters
UPGMA : Unweighted Pair-Group Method
using Arithmetic averages
SLINK : Single Linkage
CLINK : Complete Linkage
Wards Minimum variance
Agglomerative

Divisive
Hierarchical Clustering Methods
STEP 1 : Obtain the Data Matrix
STEP 2 : Standardize the Data Matrix
STEP 3 : Compute the Resemblance Matrix
STEP 4 : Execute the Clustering Method
STEP 5 : Rearrange the Data and
Resemblance Matrices
STEP 6 : Computer the Cophenetic
Correlation Coefficient
Hierarchical Clustering Method
Step 1

Step 3

Step 2
Step 4

Using UPGMA
Hierarchical Tree

Step 5
Cophenetic Correlation Coefficient
Step 6
Step 4

Using SLINK
SLINK

( Chaining )
CLINK UPGMA
Step 4

Using WARDs
Minimum Variance
Density Search Algorithm
A cluster is defined as a region in which the
density of objects is locally higher than in
other regions
Two types :
The density near an object is defined as the
number of objects within a sphere with a fixed
radius R
The density is defined as the inverse of the
dissimilarity to the Tth nearest object or the
inverse of the average dissimilarity to the T
nearest objects
Density-connected sets
Spatial Database
characterized by
spatial attributes : points or spatially extended
objects such as polygons in some d-dimensional
space
non-spatial attributes : represent additional
properties of a spatial object
Requirements for clustering algorithms
target for spatial database
Minimal requirements of domain knowledge to
determine the input parameters
Discovery of clusters with arbitrary shape
Good efficiency on large database
Fast algorithms for clustering
very large data sets
Revisions of some existing clustering methods
using some carefully designed search methods
CLARANS : randomized search
organizing structure
BIRCH : CF Tree
organizing indices
DBSCAN : R*-tree
=> target on numeric data
Fast algorithms for clustering
very large data sets
CLARANS - for large database
BIRCH - for large database, reduce
memory usage and I/O cost
CURE - arbitrary sahpe
DBSCAN, GDBSCAN
DBSLASD - arbitrary shape, no input
parameter required
DBSCAN
Density-Based Algorithm
GDBSCAN
K-medoid
PAM DBCLASD
CLARANS
CLARA

BIRCH

A Hierarchical Clustering Method CURE

84 86 88 90 92 94 96 98
Year

Vous aimerez peut-être aussi