Vous êtes sur la page 1sur 36

Clustering!

adapted from:
Doug Downey and Bryan Pardo, Northwestern University

Bagging
Use bootstrapping to generate L training sets
and train one base-learner with each
(Breiman, 1996)
Use voting
Unstable algorithms profit from bagging

Boosting

Given a large training set, randomly divide it into 3


sets (X1, X2, and X3)
Use X1 to train D1
Test D1 with X2
Training set for D2 = Take all instances from X2
misclassified by D1 (and also as many instances on
which D1 is correct from X2)
Test D1 and D2 with X3
Training set for D3 = The instances from X3 on
which D1 and D2 disagree

AdaBoost
Generate a
sequence of
base-learners
each focusing
on previous
ones errors
(Freund and
Schapire,
1996)

Mixture of Experts
Voting where weights are input-dependent (gating)
L

y = w jd j
j =1

(Jacobs et al., 1991)

Stacking

Combiner f () is
another learner
(Wolpert, 1992)

Cascading
Use dj only if
preceding ones
are not confident
Cascade
learners in order
of complexity

Clustering

Grouping data into (hopefully useful) sets.


Things on the left
Things on the right

Clustering

Unsupervised Learning

No labels

Why do clustering?

Labeling is costly
Data pre-processing
Text Classification (e.g., search engines, Google Sets)
Hypothesis Generation/Data Understanding
Clusters might suggest natural groups.
Visualization

Some definitions

Let X be the dataset:

X = {x1 , x2 , x3 ,...xn }

An m-clustering of X is a partition of X into m


sets (clusters) C1,Cm such that:

1. Clusters are non - empty : Ci {},1 i m


m
i =1

2. Clusters cover all of X :

3. Clusters do not overlap :

Ci C j = {}, if j i

Ci = X

How many possible clusterings?


(Stirling numbers)
Size of
dataset

m
1
mi
n
S(n, m) = (1) i
m! i=0
i
Number
of clusters

S (15,3) = 2,375,101
S (20,4) = 45,232,115,901
S (100,5) 10

68

What does this mean?

We cant try all possible clusterings.

Clustering algorithms look at a small fraction


of all partitions of the data.

The exact partitions tried depend on the kind


of clustering used.

Who is right?
Different techniques cluster the same data
set DIFFERENTLY.
Who is right? Is there a right clustering?

Classic Example: Half Moons

From Batra et al., http://www.cs.cmu.edu/~rahuls/pub/bmvc2008-clustering-rahuls.pdf

Steps in Clustering
Select Features
Define a Proximity Measure
Define Clustering Criterion
Define a Clustering Algorithm
Validate the Results
Interpret the Results

Kinds of Clustering

Sequential

Cost Optimization

Fast
Fixed number of clusters (typically)

Hierarchical

Start with many clusters


join clusters at each step

A Sequential Clustering Method


m =1
Basic Sequential Algorithmic
C1 = {x1 }
Scheme (BSAS)
S. Theodoridis and K. Koutroumbas, Pattern
For i = 2 to n
Recognition, Academic Press, London England, 1999
Find C k : d(xi ,C k ) = min d(xi ,C j ) Assumption: The number of
j

If (d(xi ,C k ) > ) and (m < q)


m = m +1
C m = {x i }
Else
C k = C k {x i }
End
End

clusters is not known in advance.

d(x,C) = the distance between feature


vector x and cluster C.
= the threshold of dissimilarity
q = the maximum number of clusters
n = the number of data points

A Cost-optimization method

K-means clustering

J. B. MacQueen (1967): "Some Methods for classification and Analysis of


Multivariate Observations, Proceedings of 5-th Berkeley Symposium on
Mathematical Statistics and Probability", Berkeley, University of California Press
1:281-297

A greedy algorithm
Partitions n examples into k clusters
minimizes the sum of the squared distances
to the cluster centers

The K-means algorithm


1.

2.
3.
4.

Place K points into the space represented by the


objects that are being clustered. These points
represent initial group centroids (means).
Assign each object to the group that has the closest
centroid (mean).
When all objects have been assigned, recalculate the
positions of the K centroids (means).
Repeat Steps 2 and 3 until the centroids no longer
move.

K-means clustering

The way to initialize the mean values is not specified.


Randomly choose k samples?
Results depend on the initial means
Try multiple starting points?
Assumes K is known.
How do we chose this?

k-Means Clustering
Find k reference vectors (centroids) which
best represent data
Reference vectors, mj, j =1,...,k
Use nearest (most similar) reference:

x =>mi wheremi has min x m j


j

Encoding/Decoding

Reconstruction Error

k
i i =1

E {m } X = t i b x mi
t
i

t
t

x mj
1 if x mi = min
t
j
bi =
0 otherwise

k-means Clustering

Leader Cluster Algorithm


Instance far away from all centroids (dist >
threshold) => becomes a new centroid
Cluster that covers a large number of
instances (num > threshold) => split into 2
clusters
Cluster that covers too few instances (num <
threshold) can be removed (and perhaps
randomly assigned to another random data
point)

Choosing K
Defined by the application, e.g., image
quantization
PCA
Incremental (leader-cluster) algorithm: Add
one at a time until elbow (reconstruction
error)
Manual check for meaning

Supervised Learning After


Clustering

Nave Bayes Mood Classifier

Training Data

Human Powered Compression


Label each of the following moods with one of the
following seven categories: happy, sad, angry, fearful,
disgusted, surprised or none of the above.

Pleased
Jubilant
Recumbent
Ditzy
Weird
Geeky
Blank
Dirty
Thirsty

Guilty
Hot
Worried
Nervous
Hungry
Nostalgic
Artistic
Crushed
Giggly

LiveJournal Mood Hierarchy

angry (#2)

aggravated (#1)
annoyed (#3)
bitchy (#110)
cranky (#8)
cynical (#104)
enraged (#12)
frustrated (#47)
grumpy (#95)
infuriated (#19)
irate (#20)
irritated (#112)
moody (#23)
pissed off (#24)
stressed (#28)

rushed (#100)

awake (#87)
confused (#6)

determined (#45)

predatory (#118)

devious (#130)
energetic (#11)

curious (#56)

bouncy (#59)
hyper (#52)

enthralled (#13)
happy (#15)

amused (#44)
cheerful (#125)
chipper (#99)
ecstatic (#98)
excited (#41)

K-Means Clustering
Happy

Sad

Angry

Energetic
Bouncy
Happy
Hyper
Cheerful
Ecstatic
Excited
Jubilant
Giddy
Giggly

Confused
Crappy
Crushed
Depressed
Distressed
Envious
Gloomy
Guilty
Intimidated
Jealous
Lonely
Rejected
Sad
Scared

Aggravated
Angry
Bitchy
Enraged
Infuriated
Irate
Pissed off

en
r
pi ag
ss ed
in ed
fu of
ria f
de
an d
gr
y
ira
t
ag bi e
t
c
gr
av hy
at
e
gi d
dd
h y
en yp
er er
g
ec etic
st
at
gi ic
g
ch gly
ee
r
bo ful
un
ju cy
bi
la
ex nt
ci
te
ha d
p
en py
vi
ou
s
l
o
de n
e
p
l
in res y
tim s
e
id d
at
e
sc d
ar
e
gu d
co il
nf ty
u
re sed
je
ct
cr ed
di ap
st p
re y
ss
gl ed
oo
m
je y
al
ou
s
s
cr ad
us
he
d

K-Means Clustering
Number of Posts per Mood

18000

16000

14000

12000

10000

8000

6000

4000

2000

Vous aimerez peut-être aussi