Vous êtes sur la page 1sur 5

Outline

Birch: An efficient data clustering


method for very large databases What is data clustering
Data clustering applications
Previous Approaches
Tian Zhang, Raghu Ramakrishnan, Birchs Goal
Miron Livny Clustering Feature
Birch clustering algorithm
CPSC 504
Clustering example
Presenter: Joel Lanir
Discussion: Dan Li

What is Data Clustering? Data Clustering


Helps understand the natural
A cluster is a closely-packed group. grouping or structure in a dataset
A collection of data objects that are Large set of multidimensional data
similar to one another and treated Data space is usually not uniformly
collectively as a group. occupied
Identify the sparse and crowded
Data Clustering is the partitioning places
of a dataset into clusters Helps visualization

Discussion Some Clustering applications


Can you give some examples for very Biology building groups of genes with
related patterns
large databases? What applications Marketing partition the population of
can you imagine that require such consumers to market segments
large databases for clustering? Division of WWW pages into genres.
Image segmentations for object
What are the special requirements recognition
that large databases pose on Land use Identification of areas of similar
clustering, or more general on data land use from satellite images
mining? Insurance Identify groups of policy
holders with high average claim cost

1
Data Clustering previous
approaches Approaches
Distance Based (statistics)
Must be a distance metric between two items
probability based (Machine learning): assumes that all data points are in memory and
make wrong assumption that can be scanned frequently
distributions on attributes are Ignores the fact that not all data points are
equally important
independent on each other Close data points are not gathered together
Probability representations of clusters Inspects all data points on multiple iterations
is expensive
These approaches do not deal with dataset
and memory size issues!

Clustering parameters Clustering parameters


Centroid Euclidian center Other measurements (like the
Radius average distance to center Euclidean distance of the centroids of
two clusters) will measure how far
Diameter average pairwise away two clusters are.
difference within a cluster
A good quality clustering will produce
Radius and diameter are measures of high intra-clustering and low inter-
the tightness of a cluster around its clustering
center. We wish to keep these low. A good quality clustering can help find
hidden patterns

Birchs goals: Clustering Feature (CF)


Minimize running time and data CF is a compact storage for data on
scans, thus formulating the problem points in a cluster
for large databases Has enough information to calculate
Clustering decisions made without the intra-cluster distances
scanning the whole data Additivity theorem allows us to merge
Exploit the non uniformity of data sub-clusters
treat dense areas as one, and remove
outliers (noise)

2
Clustering Feature (CF) CF Additivity Theorem
Given N d-dimensional data points in a If CF1 = (N1, LS1, SS1), and
cluster: {Xi} where i = 1, 2, , N, CF2 = (N2 ,LS2, SS2) are the CF entries of
CF = (N, LS, SS) two disjoint subclusters.
N is the number of data points in the
cluster, The CF entry of the subcluster formed by
LS is the linear sum of the N data points, merging the two disjoin subclusters is:
SS is the square sum of the N data
points. CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)

B = Max. no. of CF in a non-leaf node


CF Tree L = Max. no. of CF in a leaf node

Root CF TREE
CF1 CF2 CF3 CFb
child1 child2 child3 childb T is the threshold for the diameter or
radius of the leaf nodes
Non-leaf node The tree size is a function of T. The
CF1 CF2 CF3 CFb bigger T is, the smaller the tree will
child1 child2 child3 childb
be.
The CF tree is built dynamically as
Leaf node Leaf node data is scanned.
prev CF1 CF2 CFL next prev CF1 CF2 CFL next

T= Max. radius of a sub-cluster

CF Tree Insertion Birch Clustering Algorithm


Identifying the appropriate leaf: recursively Phase 1: Scan all data and build an
descending the CF tree and choosing the initial in-memory CF tree.
closest child node according to a chosen
distance metric Phase 2: condense into desirable
Modifying the leaf: test whether the leaf length by building a smaller CF tree.
can absorb the node without violating the Phase 3: Global clustering
threshold. If there is no room, split the Phase 4: Cluster refining this is
node
optional, and requires more passes
Modifying the path: update CF information
over the data to refine the results
up the path.

3
Birch Phase 1 Birch - Phase 2
Start with initial threshold and insert points Optional
into the tree
If run out of memory, increase threshold Phase 3 sometime have minimum
value, and rebuild a smaller tree by size which performs well, so phase 2
reinserting values from older tree and then prepares the tree for phase 3.
other values
Removes outliers, and grouping
Good initial threshold is important but hard
to figure out clusters.
Outlier removal when rebuilding tree
remove outliers

Birch Phase 3 Birch Phase 4


Problems after phase 1: Optional
Input order affects results Additional scan/s of the dataset,
Splitting triggered by node size attaching each item to the centroids
Phase 3: found.
cluster all leaf nodes on the CF values Recalculating the centroids and
according to an existing algorithm redistributing the items.
Algorithm used here: agglomerative Always converges
hierarchical clustering

Clustering example Clustering example


band224
K-means Clustering
to 5 classes

Pixel classification in images


From top to bottom:
BIRCH classification
Visible wavelength band
Near-infrared band

band2

band1

4
Conclusions Discussion
Birch performs faster than then After reading the two papers for data
existing algorithms on large datasets mining, what do you think is the
criteria to say if a data mining
Scans whole data only once algorithm is good?
Handles outliers Efficiency?
I/O cost?
Memory/disk requirement?
Stability?
Immunity to abnormal data?

Thanks for listening

Vous aimerez peut-être aussi