09 ClassificationAndClustering PDF

CS109/Stat121/AC209/E-109
Data Science
Classification and Clustering
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu / blitzstein@stat.harvard.edu
residual
y
column space of X
This Week
HW2 due 10/3 at 11:59 pm.
Reminder: for this and all assignments, make

sure to check through your submission, both
before submitting and by re-downloading from
the dropbox and then looking through the file.
No homeworks will be accepted more than 2 days
late, since a maximum of 2 late days can be applied
to an assignment.
Friday lab 10-11:30 am in MD G115

Classification vs. Clustering
Classification is supervised learning.

Clustering is unsupervised learning.
Classification has pre-defined classes, and training data

with labels. Use xs to predict ys.
Clustering has no pre-defined classes. Group data points

into clusters, try to find structure in the data.
Classification vs. Clustering
http://www.kontagent.com/kaleidoscope/2013/01/09/kscope-profile-what-george-clooney-can-teach-you-about-ltv-and-machine-learning/
Discriminative vs. Generative Classifiers
What to model and what not to model?
discriminative: directly model p(y|x)

generative: give a full model
p(x,y)=p(x)p(y|x)=p(y)p(x|y)
Classification via Logistic Regression
logit(p) = 0 + 1 x1 + + k 1 xk 1
where p is the probability of being in group 1 (can also extend

logistic regression to the case of more than 2 groups)
modeling approach: model p(y|x), dont model p(x)
http://cvxopt.org/examples/book/logreg.html
linear decision boundary nonlinear decision boundary
Conway and White, Machine Learning for Hackers

glm
Generative Models
f (x|Y = 1)P (Y = 1)
P (Y = 1|X = x) =
f (x|Y = 1)P (Y = 1) + f (x|Y = 0)P (Y = 0)
(by Bayes Rule)
Then can model the densities f(x|Y=1), f(x|Y=0).

Gaussian and Linear Classifiers
Take the conditional X|Y to be Multivariate Normal.
This leads to a quadratic decision boundary. If the

covariance matrices for the two groups are assumed
equal, then get a linear decision boundary.
Naive Bayes
Naive conditional independence assumption:
fj (x1 , . . . , xd ) = fj1 (x1 )fj2 (x2 ) . . . fjd (xd )
Often unrealistic, but still may be useful esp. since it leads to a

drastic reduction in the number of parameters to estimate.
help to combat overfitting, for example
e the best size of decision tree to learn. 80
Bayes
a, since if we use it to make too many C4.5
Test-Set Accuracy (%)
can itself start to overfit [17]. 75
tion, there are many methods to combat 70

ost popular one is adding a regulariza-
aluation function. This can, for exam- 65
rs with more structure, thereby favoring
ss room to overfit. Another option is to 60
l significance test like chi-square before
e, to decide whether the distribution of 55
fferent with and without this structure.
e particularly useful when data is very 50
, you should be skeptical of claims that 10 100 1000 10000
ue solves the overfitting problem. Its Number of Examples
tting (variance) by falling into the op-
rfitting (bias). Simultaneously avoiding Figure 2: Naive Bayes can outperform a state-of-
ng a perfect classifier, and short of know- the-art rule learner (C4.5rules) even when the true
re is no single technique that will always classifier is a set of rules.
ch).
Domingos, http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
trillion examples, the latter covers only a fraction of about
ption about overfitting is that it is caused 18
What kind of regression model should we use?
A Regression Proble
y = f(x) + noise
Can we learn f fr
x Lets consider th
Moore, www.cs.cmu.edu/~awm/tutorials
Linear in x
Linear Regression
Quadratic in x
Quadratic Regression
Connect the dots
Join-the-dots
Also known as piece
linear nonparame
regression if that m
you feel better
What do we really want?
y y
x x
Why not choose the method with the
best fit to the data?
How well are you going to predict

Underfitting vs. Overfitting
50
40
30
y
20
10
0
-2 -1 0 1 2
Shalizi, http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
plot(x,y2)
curve(7*x^2-0.5*x,add=TRUE,col="grey")
Figure 3.2: Scatter-plot showing sample data and the true, quadratic regression curve
In-Sample MSE
100.0
50.0
mean squared error
20.0
10.0
5.0
2.0
1.0
0.5
0 2 4 6 8
polynomial degree
mse.q = vector(length=10)
for (degree in 0:9) {
Out-Of-Sample MSE
50.0 100.0
mean squared error
20.0
10.0
5.0
2.0
1.0
0.5
0 2 4 6 8
polynomial degree
gmse.q = vector(length=10)
kNN (k Nearest Neighbors)
http://scott.fortmann-roe.com/docs/BiasVariance.html
Choice of k is another bias-variance tradeoff.

kNN in Collaborative Filtering
The most common tool for recommendation
systems when the Netflix Prize began, and remained
an integral tool of most of the successful teams.
s r
jN ( i ; u ) ij uj
rui =
s
jN ( i ; u ) ij
Both user-oriented and item-oriented versions are useful.
How should k be chosen? How should the weights be chosen?

52 AOn
Geometry Puzzle
Geometry and Sums of Squares
In R2 , one places a unit circle in

each quadrant of the square [2, 2]2 .
A non-overlapping circle of maximal

radius is then centered at the origin.
Fig. 4.1. This arrangement of 5 = 22 + 1 circles in [2, 2]2 has a natural

generalization to an arrangement of 2d + 1 spheres in [2, 2]d . This general
arrangement then provokes a question which a practical person might find
perplexing or even silly. Does the central sphere stay inside the box [2, 2]d
for all values of d?
Steele, The Cauchy-Schwarz Master Class
Forwe10will begin withand
dimensions thehigher,
third task. This time
it extends the problem
outside the box.that guides
In fact, as us
is framed increases
dimension with the help
theof%the arrangement
of its of circles
volume inside illustrated
the box in 0.
goes to Figure
2
Curse of Dimensionality
For n indep. Unif(-1,1) r.v.s, what is the probability
that the random vector is in the unit ball?
n probability
2 0.79
3 0.52
6 0.08
10 0.002
15 0.00001
In many high-dimensional settings, the vast majority of

data will be near the boundaries, not in the center.
Blessing of Dimensionality
In statistics, curse of dimensionality is often used to refer
to the difficulty of fitting a model when many possible
predictors are available. But this expression bothers me,
because more predictors is more data, and it should not be
a curse to have more data....
With multilevel modeling, there is no curse of

dimensionality. When many measurements are taken on
each observation, these measurements can themselves be
grouped. Having more measurements in a group gives us
more data to estimate group-level parameters (such as the
standard deviation of the group effects and also coefficients
for group-level predictors, if available).
In all the realistic curse of dimensionality problems Ive
seen, the dimensionsthe predictorshave a structure. The
data dont sit in an abstract K-dimensional space; they are
units with K measurements that have names, orderings, etc.
Gelman, http://andrewgelman.com/2004/10/27/the_blessing_of/
u transmit the k-means Clustering
of points drawn
Lossy Compression
om this dataset.
tall decoding
the receiver.
allowed to send
r point.
be a lossy
n.
m Squared Error
coded coords and
rds.
der/decoder will
st information? Moore, www.cs.cmu.edu/~awm/tutorials
04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3
of points drawn
Lossy Compression
om this dataset.
K-means
tall 1.decoding
Ask user how many
clusters theyd like.
the receiver.
(e.g. k=5)
2. Randomly guess k
allowed to Center
cluster send
r point.locations
be a lossy
n.
m Squared Error
coded coords and
rds.
der/decoder will
Copyright 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 7

guess cluster means
of points drawn
Lossy Compression
om this dataset.
K-means
tall1.decoding
Ask user how many
the receiver.
(e.g. k=5)
2. Randomly guess k
allowed to send
cluster Center
r point.
locations
3. Each datapoint finds
be a out
lossy
which Center its
n. closest to. (Thus
each Center owns
m Squared
a set ofError
datapoints)
coded coords and

rds.
der/decoder will
each cluster mean takes responsibility for the data closest to it

of points drawn
Lossy Compression
om K-means
this dataset.
tall1.decoding
Ask user how many
the receiver.
(e.g. k=5)
2. Randomly guess k
allowed toCenter
cluster send
r point.
locations
3. Each datapoint finds
be a out
lossy
which Center its
closest to.
n.
4. Each Center finds
m Squared Error
the centroid of the
points it owns
coded coords and
rds.
der/decoder will

recompute and iterate
K-means issues
number of clusters?
initial guess?
ht Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.ca
hard clustering vs. soft clustering?
buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
non-Multivariate Normal looking shapes?
288 20 An Exam
10 10
8 8
6 6
(a) (b)
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
MacKay, http://www.inference.phy.cam.ac.uk/itila/Potter.html

09 ClassificationAndClustering PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

09 ClassificationAndClustering PDF

Transféré par

Droits d'auteur :

Formats disponibles

CS109/Stat121/AC209/E-109

Reminder: for this and all assignments, make

Friday lab 10-11:30 am in MD G115

Classification is supervised learning.

Classification has pre-defined classes, and training data

Clustering has no pre-defined classes. Group data points

What to model and what not to model?

discriminative: directly model p(y|x)

where p is the probability of being in group 1 (can also extend

modeling approach: model p(y|x), dont model p(x)

Conway and White, Machine Learning for Hackers

(by Bayes Rule)

Then can model the densities f(x|Y=1), f(x|Y=0).

This leads to a quadratic decision boundary. If the

fj (x1 , . . . , xd ) = fj1 (x1 )fj2 (x2 ) . . . fjd (xd )

Often unrealistic, but still may be useful esp. since it leads to a

can itself start to overfit [17]. 75

tion, there are many methods to combat 70

How well are you going to predict

Choice of k is another bias-variance tradeoff.

Both user-oriented and item-oriented versions are useful.

How should k be chosen? How should the weights be chosen?

In R2 , one places a unit circle in

A non-overlapping circle of maximal

Fig. 4.1. This arrangement of 5 = 22 + 1 circles in [2, 2]2 has a natural

In many high-dimensional settings, the vast majority of

With multilevel modeling, there is no curse of

04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3

coded coords and

04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3

each cluster mean takes responsibility for the data closest to it

04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3

Vous aimerez peut-être aussi