Vous êtes sur la page 1sur 40

Introduction

BIRCH vs Coresets

Modified BIRCH

Clustering in Data Streams: Improving BIRCH


Project: Practical Theory for Clustering Algorithms

Johannes Blomer,
Daniel Kuntze, Kathrin Bujna (Paderborn)
Christian Sohler, Melanie Schmidt (Dortmund)
Chris Schwiegelshohn
Hendrik Fichtenberger, Marc Gille,

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Clustering Algorithms: Practice and Theory

Clustering: Grouping of similar objects according to some


distance measure

The k-means Problem

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Clustering Algorithms: Practice and Theory

Clustering: Grouping of similar objects according to some


distance measure

The k-means Problem


Given a point set P Rd ,

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Clustering Algorithms: Practice and Theory

Clustering: Grouping of similar objects according to some


distance measure

The k-means Problem


Given a point set P Rd ,
compute a set C Rd
with |C| = k centers

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Clustering Algorithms: Practice and Theory

Clustering: Grouping of similar objects according to some


distance measure

The k-means Problem


Given a point set P Rd ,
compute a set C Rd
with |C| = k centers
which minimizes
cost(P, C)
X
=
min ||c p||2 ,
pP

Clustering in Data Streams: Improving BIRCH

cC

the sum of the squared


distances.

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Clustering Algorithms: Practice and Theory

. . . in Data Streams:
Points arrive in a stream one after the other
arbitrary order
only one pass over the data allowed
limited storage capacity

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Clustering Algorithms: Practice and Theory

. . . in Data Streams:
Points arrive in a stream one after the other
arbitrary order
only one pass over the data allowed
limited storage capacity
In Practice: BIRCH as a well-known heuristic
In Theory: Coreset Theory

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Clustering Feature

Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP

pP

where is the centroid of P.

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Clustering Feature

Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP

pP

where is the centroid of P.

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Clustering Feature

Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP

pP

where is the centroid of P.

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Clustering Feature

Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP

pP

where is the centroid of P.

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Clustering Feature

Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP

pP

where is the centroid of P.

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Clustering Feature

Fact
The sum of the squared distances
satisfies the equation
X
X
||pz||2 =
||p||2 +|P|||z||2
pP

pP

where is the centroid of P.


P
P
Simply store |P|, pP p and pP ||p||2
P
P
2
||p ||2 + |P|||||2
pP ||p|| =
pP

pP

||p z||2 =

Clustering in Data Streams: Improving BIRCH

pP

||p||2 |P|||||2 + |P||| z||2

End

Introduction

BIRCH vs Coresets

Data Stream Clustering in Practice and Theory

BIRCH
uses Clustering Features

Clustering in Data Streams: Improving BIRCH

Modified BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs

Insertion of a new point


When a new point is added to the CF Tree

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs

Insertion of a new point


When a new point is added to the CF Tree
BIRCH searches for the closest CF according to
P
P

2 P

2
P
q(S{p}) q
qS q
q

q(S{p})
qS
|S|+1
|S|

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

BIRCH
uses Clustering Features
CFs are stored in a CF Tree, nodes contain the CFs

Insertion of a new point


When a new point is added to the CF Tree
BIRCH searches for the closest CF according to
P
P

2 P

2
P
q(S{p}) q
qS q
q

q(S{p})
qS
|S|+1
|S|

p
risPadded to CF representing subset S if
2
p(S {p}) (qS )

|S|+1

Clustering in Data Streams: Improving BIRCH

T for a given threshold

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)|  cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)|  cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)|  cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .

4
2

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)|  cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .

4
2

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)|  cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .

4
2

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Coreset Theory
Coresets
Given a set of points P, a weighted subset S P is a
(k, )-coreset if for all sets C of k centers it holds
|costw (S, C) cost(P, C)|  cost(P, C)
P
where costw (S, C) = pS mincC w(p)||p c||2 .

4
2

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

Coreset constructions
01: Agarwal, Har-Peled and Varadarajan: Coreset concept

02: Badoiu,
Har-Peled and Indyk:
First coreset construction for clustering problems
04: Har-Peled and Mazumdar, Coreset of size O(k d log n),
maintainable in data streams
05: Frahling and Sohler: Coreset of size O(k d log n),
insertion-deletion data streams
06: Chen: Coresets for metric and Euclidean k-median and
k-means, polynomial in d,n and 1
07: Feldman, Monemizadeh, Sohler: weak coresets, poly(k , 1 )
2 k 3 /2 )
10: Langberg, Schulman: O(d
11: Feldman, Langberg: O(dk /2 )
Merge & Reduce: Coreset Construction
Clustering in Data Streams: Improving BIRCH

Streaming Algorithms.

End

Introduction

BIRCH vs Coresets

Data Stream Clustering in Practice and Theory

Clustering in Data Streams: Improving BIRCH

Modified BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

StreamKM++
also an outcome of this SPP
practical k -means streaming algorithm
computes a coreset, moderate storage requirement
better solutions than BIRCH, but slower

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Data Stream Clustering in Practice and Theory

StreamKM++
also an outcome of this SPP
practical k -means streaming algorithm
computes a coreset, moderate storage requirement
better solutions than BIRCH, but slower

Motivations to Improve BIRCH


Analyzable BIRCH is valuable
Might outperform both StreamKM++ and BIRCH
Hope of keeping good practical properties

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Small Change, Huge Effect

When does BIRCH perform badly?

Clustering in Data Streams: Improving BIRCH

Modified BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

End

Small Change, Huge Effect

When does BIRCH perform badly?

-7 -6 -5 -4 -3 -2 -1 0

Clustering in Data Streams: Improving BIRCH

Introduction

BIRCH vs Coresets

Modified BIRCH

End

Small Change, Huge Effect

When does BIRCH perform badly?

-7 -6 -5 -4 -3 -2 -1 0

Lemma
Depending on the threshold T , BIRCH either needs
CFs or it computes no constant factor approximation.
(For a generalized example)

Clustering in Data Streams: Improving BIRCH

|P|
d

Introduction

BIRCH vs Coresets

Modified BIRCH

Small Change, Huge Effect

Lessons from Coreset Theory


Base insertion decision on induced error
Error can be bound if the clustering cost of a CF is small
Use packing arguments to bound number of CFs

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

End

Small Change, Huge Effect

Lessons from Coreset Theory


Base insertion decision on induced error
Error can be bound if the clustering cost of a CF is small
Use packing arguments to bound number of CFs

Insertion of a point
1

Levels of clustering features, Start at top level

Search for closest CF

Point has to lie within the radius of the CF


(Radius decreases by constant factor per level)

Add point if clustering cost of CF stays below

If insertion fails, open a new CF or go one level down

Clustering in Data Streams: Improving BIRCH

f ()
kgi

OPT

Introduction

BIRCH vs Coresets

Small Change, Huge Effect

Clustering in Data Streams: Improving BIRCH

Modified BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Small Change, Huge Effect

Analysis: Quality
Inspired by known coreset constructions
Distinguish between points close to optimal centers
( packing argument)
and far away centers (error neglectable to clustering cost)

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Small Change, Huge Effect

Analysis: Number of CFs


1

Bound number of levels:


Constant Factor between Radii number of points until
full doubles logarithmic in the number of points

Number of full CFs:


can be bound by lower bound on their clustering cost

Two types of non-full CFs:


Children of full CFs ( bound carries over)

and non-full CFs on the first level ( packing argument)

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Small Change, Huge Effect

Analysis: Number of CFs


1

Bound number of levels:


Constant Factor between Radii number of points until
full doubles logarithmic in the number of points

Number of full CFs:


can be bound by lower bound on their clustering cost

Two types of non-full CFs:


Children of full CFs ( bound carries over)

and non-full CFs on the first level ( packing argument)

And if OPT is not known?


dynamically increase threshold
Analysis still works, but gets more involved
Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Small Change, Huge Effect

Theorem
The modified BIRCH algorithm computes a (k , )-coreset if
OPT is known and can be modified for the case that OPT is not
known. The size of the coreset is
!


k d
k
log n log2 log n .
O
+ 2cd
f ()
f ()

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Small Change, Huge Effect

And what is still missing. . .


. . . is the experimental analyses. This is the next step :-)

Clustering in Data Streams: Improving BIRCH

End

Introduction

BIRCH vs Coresets

Modified BIRCH

Thank you for your attention!

Clustering in Data Streams: Improving BIRCH

End

Vous aimerez peut-être aussi