Vous êtes sur la page 1sur 33

Clustering Part2 continued

1. BIRCH skipped

2. Density-based Clustering --- DBSCAN and DENCLUE


3. GRID-based Approaches --- STING and CLIQUE skipped in 2009
4. SOM skipped
5. Cluster Validation other set of transparencies
6. Outlier/Anomaly Detection other set of transparencies

1
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve
the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the
order of the data record.
2
Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2 CF = (5, (16,30),(54,190))
10

9
(3,4)
(2,6)
8

(4,5)
5

1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

3
CF Tree Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

4
Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary

5
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly
constructed density function
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98)

6
Density-Based Clustering: Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
 1) p belongs to NEps(q)
p MinPts = 5
 2) core point condition:
q
Eps = 1 cm
|NEps (q)| >= MinPts
7
Density-Based Clustering: Background (II)

 Density-reachable:
p
 A point p is density-reachable from
a point q wrt. Eps, MinPts if there is p1
a chain of points p1, …, pn, p1 = q, q
pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o wrt. Eps
and MinPts.
8
DBSCAN: Density Based Spatial
Clustering of Applications with Noise

 Relies on a density-based notion of cluster: A cluster is


defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise Not density reachable
from core point

Density reachable
from core point
Outlier

Border
Eps = 1cm
Core MinPts = 5

9
DBSCAN: The Algorithm

 Arbitrary select a point p


 Retrieve all points density-reachable from p wrt Eps
and MinPts.
 If p is a core point, a cluster is formed.
 If p ia not a core point, no points are density-
reachable from p and DBSCAN visits the next point of
the database.
 Continue the process until all of the points have been
processed.
10
DENCLUE: using density functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Major features
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
 But needs a large number of parameters

11
Denclue: Technical Essence
 Influence function: describes the impact of a data point within its
neighborhood.
 Overall density of the data space can be calculated as the sum of
the influence function of all data points.
 Clusters can be determined mathematically by identifying density
attractors; object that are associated with the same density attractor
belong to the same cluster
 Density attractors are local maximal of the overall density function.
 Uses grid-cells to speed up computations; only data points in
neighboring grid-cells are used to determine the density for a point.

12
DENCLUE Influence Function and its Gradient

 Example
d ( x , y )2

f Gaussian ( x , y )  e 2 2

d ( x , xi ) 2

( x )  i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2

( x, xi )  i 1 ( xi  x)  e
N
f D
Gaussian
2 2

13
Example: Density Computation

D={x1,x2,x3,x4}

fDGaussian(x)= influence(x1) + influence(x2) + influence(x3) +


influence(x4)=0.04+0.06+0.08+0.6=0.78

x1
0.04 x3 0.08
y
x2 x4
0.06
x 0.6

Remark: the density value of y would be larger than the one for x
14
Density Attractor

15
Examples of DENCLUE Clusters

16
Basic Steps DENCLUE Algorithms

1. Determine density attractors


2. Associate data objects with density
attractors using hill climbing ( initial
clustering)
3. Merge the initial clusters further relying
on a hierarchical clustering approach
(optional)

17
Chapter 8. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis
 Summary

18
Steps of Grid-based Clustering Algorithms

Basic Grid-based Algorithm


1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and
compute the density of each cell.
3. Eliminate cells, whose density is below a
certain threshold .
4. Form clusters from contiguous (adjacent)
groups of dense cells (usually minimizing a
given objective function).
19
Advantages of Grid-based Clustering
Algorithms

 fast:
 No distance computations

 Clustering is performed on summaries and not

individual objects; complexity is usually O(#-


populated-grid-cells) and not O(#objects)
 Easy to determine which clusters are

neighboring
 Shapes are limited to union of rectangular grid-
cells

20
Grid-Based Clustering Methods
 Several interesting methods (in addition to the basic grid-
based algorithm)
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 CLIQUE: Agrawal, et al. (SIGMOD’98)

21
STING: A Statistical Information
Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution

22
STING: A Statistical Information
Grid Approach (2)
 Main contribution of STING is the proposal of a data structure that
can be used for many purposes (e.g. SCMRG, BIRCH kind of uses
it)
 The data structure is used to form clusters based on queries
 Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
 Parameters of higher level cells can be easily calculated from
parameters of lower level cell
 count, mean, s, min, max

 type of distribution—normal, uniform, etc.

 Use a top-down approach to answer spatial data queries


 Clusters are formed by merging cells that match a given query
description ( next slide) 23
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1. Start from a pre-selected layer—typically with a small number of
cells
2. From the pre-selected layer until you reach the bottom layer do
the following:
 For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;
 If it is relevant, include the cell in a cluster
 If it irrelevant, remove cell from further consideration
 otherwise, look for relevant cells at the next lower layer
3. Combine relevant cells into relevant regions (based on grid-
neighborhood) and return the so obtained clusters as your
answers.

24
STING: A Statistical Information
Grid Approach (3)
 Advantages:
 Query-independent, easy to parallelize, incremental

update
 O(K), where K is the number of grid cells at the

lowest level
 Can be used in conjunction with a grid-based

clustering algorithm
 Disadvantages:
 All the cluster boundaries are either horizontal or

vertical, and no diagonal boundary is detected

25
Subspace Clustering

 Clustering in very high-dimensional spaces is very difficult


 High dimensional attribute spaces tend to be sparseit is
hard to find any clusters
 It is very difficult to create summaries from clusters in very
difficult
 This creates the motivation for subspace clustering:
 Find interesting subspaces (areas that are dense with
respect to the attributes belonging to the subspace)
 Find clusters for each interesting
 Remark: multiple, overlapping clusters might be
obtained; basically one clustering for each subspace.
26
CLIQUE (Clustering In QUEst)
 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
 Automatically identifying subspaces of a high dimensional
data space that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-
based
 It partitions each dimension into the same number of
equal length interval
 It partitions an m-dimensional data space into non-
overlapping rectangular units
 A unit is dense if the fraction of total data points
contained in the unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within
a subspace
27
CLIQUE: The Major Steps
 Partition the data space and find the number of points
that lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters:
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.

28
Salary
(10,000)

=3
0 1 2 3 4 5 6 7

20
30
40
50

Sa
l ar
Vacation

y
60
age

30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40

age
50
60
age

29
Strength and Weakness of CLIQUE
 Strength
 It automatically finds subspaces of the highest

dimensionality such that high density clusters exist in those


subspaces
 It is insensitive to the order of records in input and does not

presume some canonical data distribution


 It scales linearly with the size of input and has good

scalability as the number of dimensions in the data increases


 Weakness
 The accuracy of the clustering result may be degraded at

the expense of simplicity of the method


 Quite expensive

30
Self-organizing feature maps (SOMs)

 Clustering is also performed by having several


units competing for the current object
 The unit whose weight vector is closest to the
current object wins
 The winner and its neighbors learn by having
their weights adjusted
 SOMs are believed to resemble processing that
can occur in the brain
 Useful for visualizing high-dimensional data in
2- or 3-D space

31
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
32
References (2)

 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster


Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data
sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB’98.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for
very large databases. SIGMOD'96.

33

Vous aimerez peut-être aussi