Clustering Part2 Continued: Han/Eick: Clustering II 1

Clustering Part2 continued
1. BIRCH skipped
2. Density-based Clustering --- DBSCAN and DENCLUE

3. GRID-based Approaches --- STING and CLIQUE skipped in 2009
4. SOM skipped
5. Cluster Validation other set of transparencies
6. Outlier/Anomaly Detection other set of transparencies
1
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the
order of the data record.
2
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)

N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2 CF = (5, (16,30),(54,190))
10
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
3
CF Tree Root
B=7 CF1 CF2 CF3 CF6
L=6 child1 child2 child3 child6
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
4
Chapter 8. Cluster Analysis
 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
5
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly
constructed density function
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
6
Density-Based Clustering: Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
 1) p belongs to NEps(q)
p MinPts = 5
 2) core point condition:
q
Eps = 1 cm
|NEps (q)| >= MinPts
7
Density-Based Clustering: Background (II)
 Density-reachable:
p
 A point p is density-reachable from
a point q wrt. Eps, MinPts if there is p1
a chain of points p1, …, pn, p1 = q, q
pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o wrt. Eps
and MinPts.
8
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A cluster is

defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise Not density reachable
from core point
Density reachable
from core point
Outlier
Border
Eps = 1cm
Core MinPts = 5
9
DBSCAN: The Algorithm
 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps
and MinPts.
 If p is a core point, a cluster is formed.
 If p ia not a core point, no points are density-
reachable from p and DBSCAN visits the next point of
the database.
 Continue the process until all of the points have been
processed.
10
DENCLUE: using density functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
 But needs a large number of parameters
11
Denclue: Technical Essence
 Influence function: describes the impact of a data point within its
neighborhood.
 Overall density of the data space can be calculated as the sum of
the influence function of all data points.
 Clusters can be determined mathematically by identifying density
attractors; object that are associated with the same density attractor
belong to the same cluster
 Density attractors are local maximal of the overall density function.
 Uses grid-cells to speed up computations; only data points in
neighboring grid-cells are used to determine the density for a point.
12
DENCLUE Influence Function and its Gradient
 Example
d ( x , y )2

f Gaussian ( x , y )  e 2 2
d ( x , xi ) 2

( x )  i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2

( x, xi )  i 1 ( xi  x)  e
N
f D
Gaussian
2 2
13
Example: Density Computation
D={x1,x2,x3,x4}
fDGaussian(x)= influence(x1) + influence(x2) + influence(x3) +

influence(x4)=0.04+0.06+0.08+0.6=0.78
x1
0.04 x3 0.08
y
x2 x4
0.06
x 0.6
Remark: the density value of y would be larger than the one for x
14
Density Attractor
15
Examples of DENCLUE Clusters
16
Basic Steps DENCLUE Algorithms
1. Determine density attractors

2. Associate data objects with density
attractors using hill climbing ( initial
clustering)
3. Merge the initial clusters further relying
on a hierarchical clustering approach
(optional)
17
Chapter 8. Cluster Analysis
 What is Cluster Analysis?

 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary
18
Steps of Grid-based Clustering Algorithms
Basic Grid-based Algorithm

1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and
compute the density of each cell.
3. Eliminate cells, whose density is below a
certain threshold .
4. Form clusters from contiguous (adjacent)
groups of dense cells (usually minimizing a
given objective function).
19
Advantages of Grid-based Clustering
Algorithms
 fast:
 No distance computations
 Clustering is performed on summaries and not
individual objects; complexity is usually O(#-

populated-grid-cells) and not O(#objects)
 Easy to determine which clusters are
neighboring
 Shapes are limited to union of rectangular grid-
cells
20
Grid-Based Clustering Methods
 Several interesting methods (in addition to the basic grid-
based algorithm)
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
21
STING: A Statistical Information
Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution
22
Grid Approach (2)
 Main contribution of STING is the proposal of a data structure that
can be used for many purposes (e.g. SCMRG, BIRCH kind of uses
it)
 The data structure is used to form clusters based on queries
 Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
 Parameters of higher level cells can be easily calculated from
parameters of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data queries

 Clusters are formed by merging cells that match a given query
description ( next slide) 23
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1. Start from a pre-selected layer—typically with a small number of
cells
2. From the pre-selected layer until you reach the bottom layer do
the following:
 For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;
 If it is relevant, include the cell in a cluster
 If it irrelevant, remove cell from further consideration
 otherwise, look for relevant cells at the next lower layer
3. Combine relevant cells into relevant regions (based on grid-
neighborhood) and return the so obtained clusters as your
answers.
24
Grid Approach (3)
 Advantages:
 Query-independent, easy to parallelize, incremental
update
 O(K), where K is the number of grid cells at the
lowest level
 Can be used in conjunction with a grid-based
clustering algorithm
 Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
25
Subspace Clustering
 Clustering in very high-dimensional spaces is very difficult

 High dimensional attribute spaces tend to be sparseit is
hard to find any clusters
 It is very difficult to create summaries from clusters in very
difficult
 This creates the motivation for subspace clustering:
 Find interesting subspaces (areas that are dense with
respect to the attributes belonging to the subspace)
 Find clusters for each interesting
 Remark: multiple, overlapping clusters might be
obtained; basically one clustering for each subspace.
26
CLIQUE (Clustering In QUEst)
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
 Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-
based
 It partitions each dimension into the same number of
equal length interval
 It partitions an m-dimensional data space into non-
overlapping rectangular units
 A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within
a subspace
27
CLIQUE: The Major Steps
 Partition the data space and find the number of points
that lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters:
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
28
Salary
(10,000)
=3
0 1 2 3 4 5 6 7
20
30
40
50
Sa
l ar
Vacation
y
60
age
30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
29
Strength and Weakness of CLIQUE
 Strength
 It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those

subspaces
 It is insensitive to the order of records in input and does not
presume some canonical data distribution

 It scales linearly with the size of input and has good
scalability as the number of dimensions in the data increases

 Weakness
 The accuracy of the clustering result may be degraded at
the expense of simplicity of the method

 Quite expensive
30
Self-organizing feature maps (SOMs)
 Clustering is also performed by having several

units competing for the current object
 The unit whose weight vector is closest to the
current object wins
 The winner and its neighbors learn by having
their weights adjusted
 SOMs are believed to resemble processing that
can occur in the brain
 Useful for visualizing high-dimensional data in
2- or 3-D space
31
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
32
References (2)
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster

Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data
sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB’98.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for
very large databases. SIGMOD'96.
33

Clustering Part2 Continued: Han/Eick: Clustering II 1

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clustering Part2 Continued: Han/Eick: Clustering II 1

Transféré par

Droits d'auteur :

Formats disponibles

Clustering Part2 continued

2. Density-based Clustering --- DBSCAN and DENCLUE

Clustering Feature: CF = (N, LS, SS)

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Leaf node Leaf node

 What is Cluster Analysis?

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98)

 Relies on a density-based notion of cluster: A cluster is

 Arbitrary select a point p

fDGaussian(x)= influence(x1) + influence(x2) + influence(x3) +

1. Determine density attractors

 What is Cluster Analysis?

Basic Grid-based Algorithm

 Clustering is performed on summaries and not

individual objects; complexity is usually O(#-

 type of distribution—normal, uniform, etc.

 Use a top-down approach to answer spatial data queries

vertical, and no diagonal boundary is detected

 Clustering in very high-dimensional spaces is very difficult

dimensionality such that high density clusters exist in those

presume some canonical data distribution

scalability as the number of dimensions in the data increases

the expense of simplicity of the method

 Clustering is also performed by having several

 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster

Vous aimerez peut-être aussi