Vous êtes sur la page 1sur 28

CS109/Stat121/AC209/E-109

Data Science
Classification and Clustering
Hanspeter Pfister & Joe Blitzstein
pfister@seas.harvard.edu / blitzstein@stat.harvard.edu

residual

y
column space of X
This Week
HW2 due 10/3 at 11:59 pm.

Reminder: for this and all assignments, make


sure to check through your submission, both
before submitting and by re-downloading from
the dropbox and then looking through the file.
No homeworks will be accepted more than 2 days
late, since a maximum of 2 late days can be applied
to an assignment.

Friday lab 10-11:30 am in MD G115


Classification vs. Clustering

Classification is supervised learning.


Clustering is unsupervised learning.

Classification has pre-defined classes, and training data


with labels. Use xs to predict ys.

Clustering has no pre-defined classes. Group data points


into clusters, try to find structure in the data.
Classification vs. Clustering

http://www.kontagent.com/kaleidoscope/2013/01/09/kscope-profile-what-george-clooney-can-teach-you-about-ltv-and-machine-learning/
Discriminative vs. Generative Classifiers

What to model and what not to model?

discriminative: directly model p(y|x)


generative: give a full model
p(x,y)=p(x)p(y|x)=p(y)p(x|y)
Classification via Logistic Regression
logit(p) = 0 + 1 x1 + + k 1 xk 1

where p is the probability of being in group 1 (can also extend


logistic regression to the case of more than 2 groups)

modeling approach: model p(y|x), dont model p(x)

http://cvxopt.org/examples/book/logreg.html
linear decision boundary nonlinear decision boundary

Conway and White, Machine Learning for Hackers


glm
Generative Models
f (x|Y = 1)P (Y = 1)
P (Y = 1|X = x) =
f (x|Y = 1)P (Y = 1) + f (x|Y = 0)P (Y = 0)

(by Bayes Rule)

Then can model the densities f(x|Y=1), f(x|Y=0).


Gaussian and Linear Classifiers
Take the conditional X|Y to be Multivariate Normal.

This leads to a quadratic decision boundary. If the


covariance matrices for the two groups are assumed
equal, then get a linear decision boundary.
Naive Bayes
Naive conditional independence assumption:

fj (x1 , . . . , xd ) = fj1 (x1 )fj2 (x2 ) . . . fjd (xd )

Often unrealistic, but still may be useful esp. since it leads to a


drastic reduction in the number of parameters to estimate.
help to combat overfitting, for example
e the best size of decision tree to learn. 80
Bayes
a, since if we use it to make too many C4.5
Test-Set Accuracy (%)

can itself start to overfit [17]. 75

tion, there are many methods to combat 70


ost popular one is adding a regulariza-
aluation function. This can, for exam- 65
rs with more structure, thereby favoring
ss room to overfit. Another option is to 60
l significance test like chi-square before
e, to decide whether the distribution of 55
fferent with and without this structure.
e particularly useful when data is very 50
, you should be skeptical of claims that 10 100 1000 10000
ue solves the overfitting problem. Its Number of Examples
tting (variance) by falling into the op-
rfitting (bias). Simultaneously avoiding Figure 2: Naive Bayes can outperform a state-of-
ng a perfect classifier, and short of know- the-art rule learner (C4.5rules) even when the true
re is no single technique that will always classifier is a set of rules.
ch).
Domingos, http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
trillion examples, the latter covers only a fraction of about
ption about overfitting is that it is caused 18
What kind of regression model should we use?
A Regression Proble

y = f(x) + noise
Can we learn f fr

x Lets consider th
Moore, www.cs.cmu.edu/~awm/tutorials
Linear in x

Linear Regression

Moore, www.cs.cmu.edu/~awm/tutorials
Quadratic in x
Quadratic Regression

Moore, www.cs.cmu.edu/~awm/tutorials
Connect the dots

Join-the-dots
Also known as piece
linear nonparame
regression if that m
you feel better

Moore, www.cs.cmu.edu/~awm/tutorials
What do we really want?

y y

x x

Moore, www.cs.cmu.edu/~awm/tutorials
Why not choose the method with the
best fit to the data?

How well are you going to predict


Underfitting vs. Overfitting

50
40
30
y

20
10
0

-2 -1 0 1 2

Shalizi, http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
plot(x,y2)
curve(7*x^2-0.5*x,add=TRUE,col="grey")

Figure 3.2: Scatter-plot showing sample data and the true, quadratic regression curve
In-Sample MSE

100.0
50.0
mean squared error

20.0
10.0
5.0
2.0
1.0
0.5

0 2 4 6 8

polynomial degree

Shalizi, http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
mse.q = vector(length=10)
for (degree in 0:9) {
Out-Of-Sample MSE

50.0 100.0
mean squared error

20.0
10.0
5.0
2.0
1.0
0.5

0 2 4 6 8

polynomial degree

Shalizi, http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
gmse.q = vector(length=10)
kNN (k Nearest Neighbors)

http://scott.fortmann-roe.com/docs/BiasVariance.html

Choice of k is another bias-variance tradeoff.


kNN in Collaborative Filtering
The most common tool for recommendation
systems when the Netflix Prize began, and remained
an integral tool of most of the successful teams.

s r
jN ( i ; u ) ij uj
rui =
s
jN ( i ; u ) ij

Both user-oriented and item-oriented versions are useful.

How should k be chosen? How should the weights be chosen?


52 AOn
Geometry Puzzle
Geometry and Sums of Squares

In R2 , one places a unit circle in


each quadrant of the square [2, 2]2 .

A non-overlapping circle of maximal


radius is then centered at the origin.

Fig. 4.1. This arrangement of 5 = 22 + 1 circles in [2, 2]2 has a natural


generalization to an arrangement of 2d + 1 spheres in [2, 2]d . This general
arrangement then provokes a question which a practical person might find
perplexing or even silly. Does the central sphere stay inside the box [2, 2]d
for all values of d?
Steele, The Cauchy-Schwarz Master Class
Forwe10will begin withand
dimensions thehigher,
third task. This time
it extends the problem
outside the box.that guides
In fact, as us
is framed increases
dimension with the help
theof%the arrangement
of its of circles
volume inside illustrated
the box in 0.
goes to Figure
2
Curse of Dimensionality
For n indep. Unif(-1,1) r.v.s, what is the probability
that the random vector is in the unit ball?

n probability

2 0.79

3 0.52

6 0.08

10 0.002

15 0.00001

In many high-dimensional settings, the vast majority of


data will be near the boundaries, not in the center.
Blessing of Dimensionality
In statistics, curse of dimensionality is often used to refer
to the difficulty of fitting a model when many possible
predictors are available. But this expression bothers me,
because more predictors is more data, and it should not be
a curse to have more data....

With multilevel modeling, there is no curse of


dimensionality. When many measurements are taken on
each observation, these measurements can themselves be
grouped. Having more measurements in a group gives us
more data to estimate group-level parameters (such as the
standard deviation of the group effects and also coefficients
for group-level predictors, if available).
In all the realistic curse of dimensionality problems Ive
seen, the dimensionsthe predictorshave a structure. The
data dont sit in an abstract K-dimensional space; they are
units with K measurements that have names, orderings, etc.

Gelman, http://andrewgelman.com/2004/10/27/the_blessing_of/
u transmit the k-means Clustering
of points drawn
Lossy Compression
om this dataset.
tall decoding
the receiver.
allowed to send
r point.
be a lossy
n.
m Squared Error
coded coords and
rds.
der/decoder will
st information? Moore, www.cs.cmu.edu/~awm/tutorials
04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3
u transmit the k-means Clustering
of points drawn
Lossy Compression
om this dataset.
K-means
tall 1.decoding
Ask user how many
clusters theyd like.
the receiver.
(e.g. k=5)
2. Randomly guess k
allowed to Center
cluster send
r point.locations
be a lossy
n.
m Squared Error
coded coords and
rds.
der/decoder will
st information? Moore, www.cs.cmu.edu/~awm/tutorials
Copyright 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 7

04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3


guess cluster means
u transmit the k-means Clustering
of points drawn
Lossy Compression
om this dataset.
K-means
tall1.decoding
Ask user how many
clusters theyd like.
the receiver.
(e.g. k=5)
2. Randomly guess k
allowed to send
cluster Center
r point.
locations
3. Each datapoint finds
be a out
lossy
which Center its
n. closest to. (Thus
each Center owns
m Squared
a set ofError
datapoints)

coded coords and


rds.
der/decoder will
st information? Moore, www.cs.cmu.edu/~awm/tutorials
Copyright 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 8

04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3

each cluster mean takes responsibility for the data closest to it


u transmit the k-means Clustering
of points drawn
Lossy Compression
om K-means
this dataset.
tall1.decoding
Ask user how many
clusters theyd like.
the receiver.
(e.g. k=5)
2. Randomly guess k
allowed toCenter
cluster send
r point.
locations
3. Each datapoint finds
be a out
lossy
which Center its
closest to.
n.
4. Each Center finds
m Squared Error
the centroid of the
points it owns
coded coords and
rds.
der/decoder will
st information? Moore, www.cs.cmu.edu/~awm/tutorials
Copyright 2001, 2004, Andrew W. Moore K-means and Hierarchical Clustering: Slide 9

04, Andrew W. Moore K-means and Hierarchical Clustering: Slide 3


recompute and iterate
K-means issues
number of clusters?
initial guess?
ht Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.ca
hard clustering vs. soft clustering?
buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.
non-Multivariate Normal looking shapes?
288 20 An Exam

10 10

8 8

6 6
(a) (b)
4 4

2 2

0 0
0 2 4 6 8 10 0 2 4 6 8 10

MacKay, http://www.inference.phy.cam.ac.uk/itila/Potter.html

Vous aimerez peut-être aussi