Académique Documents
Professionnel Documents
Culture Documents
adapted from:
Doug Downey and Bryan Pardo, Northwestern University
Bagging
Use bootstrapping to generate L training sets
and train one base-learner with each
(Breiman, 1996)
Use voting
Unstable algorithms profit from bagging
Boosting
AdaBoost
Generate a
sequence of
base-learners
each focusing
on previous
ones errors
(Freund and
Schapire,
1996)
Mixture of Experts
Voting where weights are input-dependent (gating)
L
y = w jd j
j =1
Stacking
Combiner f () is
another learner
(Wolpert, 1992)
Cascading
Use dj only if
preceding ones
are not confident
Cascade
learners in order
of complexity
Clustering
Clustering
Unsupervised Learning
No labels
Why do clustering?
Labeling is costly
Data pre-processing
Text Classification (e.g., search engines, Google Sets)
Hypothesis Generation/Data Understanding
Clusters might suggest natural groups.
Visualization
Some definitions
X = {x1 , x2 , x3 ,...xn }
Ci C j = {}, if j i
Ci = X
m
1
mi
n
S(n, m) = (1) i
m! i=0
i
Number
of clusters
S (15,3) = 2,375,101
S (20,4) = 45,232,115,901
S (100,5) 10
68
Who is right?
Different techniques cluster the same data
set DIFFERENTLY.
Who is right? Is there a right clustering?
Steps in Clustering
Select Features
Define a Proximity Measure
Define Clustering Criterion
Define a Clustering Algorithm
Validate the Results
Interpret the Results
Kinds of Clustering
Sequential
Cost Optimization
Fast
Fixed number of clusters (typically)
Hierarchical
A Cost-optimization method
K-means clustering
A greedy algorithm
Partitions n examples into k clusters
minimizes the sum of the squared distances
to the cluster centers
2.
3.
4.
K-means clustering
k-Means Clustering
Find k reference vectors (centroids) which
best represent data
Reference vectors, mj, j =1,...,k
Use nearest (most similar) reference:
Encoding/Decoding
Reconstruction Error
k
i i =1
E {m } X = t i b x mi
t
i
t
t
x mj
1 if x mi = min
t
j
bi =
0 otherwise
k-means Clustering
Choosing K
Defined by the application, e.g., image
quantization
PCA
Incremental (leader-cluster) algorithm: Add
one at a time until elbow (reconstruction
error)
Manual check for meaning
Training Data
Pleased
Jubilant
Recumbent
Ditzy
Weird
Geeky
Blank
Dirty
Thirsty
Guilty
Hot
Worried
Nervous
Hungry
Nostalgic
Artistic
Crushed
Giggly
angry (#2)
aggravated (#1)
annoyed (#3)
bitchy (#110)
cranky (#8)
cynical (#104)
enraged (#12)
frustrated (#47)
grumpy (#95)
infuriated (#19)
irate (#20)
irritated (#112)
moody (#23)
pissed off (#24)
stressed (#28)
rushed (#100)
awake (#87)
confused (#6)
determined (#45)
predatory (#118)
devious (#130)
energetic (#11)
curious (#56)
bouncy (#59)
hyper (#52)
enthralled (#13)
happy (#15)
amused (#44)
cheerful (#125)
chipper (#99)
ecstatic (#98)
excited (#41)
K-Means Clustering
Happy
Sad
Angry
Energetic
Bouncy
Happy
Hyper
Cheerful
Ecstatic
Excited
Jubilant
Giddy
Giggly
Confused
Crappy
Crushed
Depressed
Distressed
Envious
Gloomy
Guilty
Intimidated
Jealous
Lonely
Rejected
Sad
Scared
Aggravated
Angry
Bitchy
Enraged
Infuriated
Irate
Pissed off
en
r
pi ag
ss ed
in ed
fu of
ria f
de
an d
gr
y
ira
t
ag bi e
t
c
gr
av hy
at
e
gi d
dd
h y
en yp
er er
g
ec etic
st
at
gi ic
g
ch gly
ee
r
bo ful
un
ju cy
bi
la
ex nt
ci
te
ha d
p
en py
vi
ou
s
l
o
de n
e
p
l
in res y
tim s
e
id d
at
e
sc d
ar
e
gu d
co il
nf ty
u
re sed
je
ct
cr ed
di ap
st p
re y
ss
gl ed
oo
m
je y
al
ou
s
s
cr ad
us
he
d
K-Means Clustering
Number of Posts per Mood
18000
16000
14000
12000
10000
8000
6000
4000
2000