Académique Documents
Professionnel Documents
Culture Documents
1
Data Clustering previous
approaches Approaches
Distance Based (statistics)
Must be a distance metric between two items
probability based (Machine learning): assumes that all data points are in memory and
make wrong assumption that can be scanned frequently
distributions on attributes are Ignores the fact that not all data points are
equally important
independent on each other Close data points are not gathered together
Probability representations of clusters Inspects all data points on multiple iterations
is expensive
These approaches do not deal with dataset
and memory size issues!
2
Clustering Feature (CF) CF Additivity Theorem
Given N d-dimensional data points in a If CF1 = (N1, LS1, SS1), and
cluster: {Xi} where i = 1, 2, , N, CF2 = (N2 ,LS2, SS2) are the CF entries of
CF = (N, LS, SS) two disjoint subclusters.
N is the number of data points in the
cluster, The CF entry of the subcluster formed by
LS is the linear sum of the N data points, merging the two disjoin subclusters is:
SS is the square sum of the N data
points. CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)
Root CF TREE
CF1 CF2 CF3 CFb
child1 child2 child3 childb T is the threshold for the diameter or
radius of the leaf nodes
Non-leaf node The tree size is a function of T. The
CF1 CF2 CF3 CFb bigger T is, the smaller the tree will
child1 child2 child3 childb
be.
The CF tree is built dynamically as
Leaf node Leaf node data is scanned.
prev CF1 CF2 CFL next prev CF1 CF2 CFL next
3
Birch Phase 1 Birch - Phase 2
Start with initial threshold and insert points Optional
into the tree
If run out of memory, increase threshold Phase 3 sometime have minimum
value, and rebuild a smaller tree by size which performs well, so phase 2
reinserting values from older tree and then prepares the tree for phase 3.
other values
Removes outliers, and grouping
Good initial threshold is important but hard
to figure out clusters.
Outlier removal when rebuilding tree
remove outliers
band2
band1
4
Conclusions Discussion
Birch performs faster than then After reading the two papers for data
existing algorithms on large datasets mining, what do you think is the
criteria to say if a data mining
Scans whole data only once algorithm is good?
Handles outliers Efficiency?
I/O cost?
Memory/disk requirement?
Stability?
Immunity to abnormal data?