Académique Documents
Professionnel Documents
Culture Documents
BIRCH vs Coresets
Modified BIRCH
Johannes Blomer,
Daniel Kuntze, Kathrin Bujna (Paderborn)
Christian Sohler, Melanie Schmidt (Dortmund)
Chris Schwiegelshohn
Hendrik Fichtenberger, Marc Gille,
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
cC
End
Introduction
BIRCH vs Coresets
Modified BIRCH
. . . in Data Streams:
Points arrive in a stream one after the other
arbitrary order
only one pass over the data allowed
limited storage capacity
End
Introduction
BIRCH vs Coresets
Modified BIRCH
. . . in Data Streams:
Points arrive in a stream one after the other
arbitrary order
only one pass over the data allowed
limited storage capacity
In Practice: BIRCH as a well-known heuristic
In Theory: Coreset Theory
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Clustering Feature
Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP
pP
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Clustering Feature
Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP
pP
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Clustering Feature
Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP
pP
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Clustering Feature
Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP
pP
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Clustering Feature
Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP
pP
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Clustering Feature
Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP
pP
pP
||p z||2 =
pP
End
Introduction
BIRCH vs Coresets
BIRCH
uses Clustering Features
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs
End
Introduction
BIRCH vs Coresets
Modified BIRCH
BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs
End
Introduction
BIRCH vs Coresets
Modified BIRCH
BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs
q(S{p})
qS
|S|+1
|S|
End
Introduction
BIRCH vs Coresets
Modified BIRCH
BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs
q(S{p})
qS
|S|+1
|S|
p
risPadded to CF representing subset S if
2
p(S {p}) (qS )
|S|+1
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)| cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)| cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)| cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .
4
2
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)| cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .
4
2
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)| cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .
4
2
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)| cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .
4
2
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Coreset constructions
01: Agarwal, Har-Peled and Varadarajan: Coreset concept
02: Badoiu,
Har-Peled and Indyk:
First coreset construction for clustering problems
04: Har-Peled and Mazumdar, Coreset of size O(k d log n),
maintainable in data streams
05: Frahling and Sohler: Coreset of size O(k d log n),
insertion-deletion data streams
06: Chen: Coresets for metric and Euclidean k-median and
k-means, polynomial in d,n and 1
07: Feldman, Monemizadeh, Sohler: weak coresets, poly(k , 1 )
2 k 3 /2 )
10: Langberg, Schulman: O(d
11: Feldman, Langberg: O(dk /2 )
Merge & Reduce: Coreset Construction
Clustering in Data Streams: Improving BIRCH
Streaming Algorithms.
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
StreamKM++
also an outcome of this SPP
practical k -means streaming algorithm
computes a coreset, moderate storage requirement
better solutions than BIRCH, but slower
End
Introduction
BIRCH vs Coresets
Modified BIRCH
StreamKM++
also an outcome of this SPP
practical k -means streaming algorithm
computes a coreset, moderate storage requirement
better solutions than BIRCH, but slower
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
-7 -6 -5 -4 -3 -2 -1 0
Introduction
BIRCH vs Coresets
Modified BIRCH
End
-7 -6 -5 -4 -3 -2 -1 0
Lemma
Depending on the threshold T , BIRCH either needs
CFs or it computes no constant factor approximation.
(For a generalized example)
|P|
d
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Insertion of a point
1
f ()
kgi
OPT
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Analysis: Quality
Inspired by known coreset constructions
Distinguish between points close to optimal centers
( packing argument)
and far away centers (error neglectable to clustering cost)
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
Theorem
The modified BIRCH algorithm computes a (k , )-coreset if
OPT is known and can be modified for the case that OPT is not
known. The size of the coreset is
!
k d
k
log n log2 log n .
O
+ 2cd
f ()
f ()
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End
Introduction
BIRCH vs Coresets
Modified BIRCH
End