Académique Documents
Professionnel Documents
Culture Documents
1. BIRCH skipped
1
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the
order of the data record.
2
Clustering Feature Vector
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
3
CF Tree Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
4
Chapter 8. Cluster Analysis
5
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly
constructed density function
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
6
Density-Based Clustering: Background
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
p MinPts = 5
2) core point condition:
q
Eps = 1 cm
|NEps (q)| >= MinPts
7
Density-Based Clustering: Background (II)
Density-reachable:
p
A point p is density-reachable from
a point q wrt. Eps, MinPts if there is p1
a chain of points p1, …, pn, p1 = q, q
pn = p such that pi+1 is directly
density-reachable from pi
Density-connected
A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o wrt. Eps
and MinPts.
8
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Density reachable
from core point
Outlier
Border
Eps = 1cm
Core MinPts = 5
9
DBSCAN: The Algorithm
11
Denclue: Technical Essence
Influence function: describes the impact of a data point within its
neighborhood.
Overall density of the data space can be calculated as the sum of
the influence function of all data points.
Clusters can be determined mathematically by identifying density
attractors; object that are associated with the same density attractor
belong to the same cluster
Density attractors are local maximal of the overall density function.
Uses grid-cells to speed up computations; only data points in
neighboring grid-cells are used to determine the density for a point.
12
DENCLUE Influence Function and its Gradient
Example
d ( x , y )2
f Gaussian ( x , y ) e 2 2
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
N
f D
Gaussian
2 2
13
Example: Density Computation
D={x1,x2,x3,x4}
x1
0.04 x3 0.08
y
x2 x4
0.06
x 0.6
Remark: the density value of y would be larger than the one for x
14
Density Attractor
15
Examples of DENCLUE Clusters
16
Basic Steps DENCLUE Algorithms
17
Chapter 8. Cluster Analysis
18
Steps of Grid-based Clustering Algorithms
fast:
No distance computations
neighboring
Shapes are limited to union of rectangular grid-
cells
20
Grid-Based Clustering Methods
Several interesting methods (in addition to the basic grid-
based algorithm)
STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
CLIQUE: Agrawal, et al. (SIGMOD’98)
21
STING: A Statistical Information
Grid Approach
Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
22
STING: A Statistical Information
Grid Approach (2)
Main contribution of STING is the proposal of a data structure that
can be used for many purposes (e.g. SCMRG, BIRCH kind of uses
it)
The data structure is used to form clusters based on queries
Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
24
STING: A Statistical Information
Grid Approach (3)
Advantages:
Query-independent, easy to parallelize, incremental
update
O(K), where K is the number of grid cells at the
lowest level
Can be used in conjunction with a grid-based
clustering algorithm
Disadvantages:
All the cluster boundaries are either horizontal or
25
Subspace Clustering
28
Salary
(10,000)
=3
0 1 2 3 4 5 6 7
20
30
40
50
Sa
l ar
Vacation
y
60
age
30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
29
Strength and Weakness of CLIQUE
Strength
It automatically finds subspaces of the highest
30
Self-organizing feature maps (SOMs)
31
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
32
References (2)
33