Vous êtes sur la page 1sur 44

Machine Learning

Dr. Faraz Akram

Riphah International University


Unsupervised
learning

Clustering
Supervised learning
label
label1
model/
label3 predictor

label4

label5

Supervised learning: given labeled


examples

3
Unsupervised learning

Unupervised learning: given data, i.e. examples, but no labels

4
Unsupervised learning:Clustering

Raw data features


f1, f2, f3, , fn

f1, f2, f3, , fn

f1, f2, f3, , fn Clusters


f1, f2, f3, , fn group into
extract
classes/clusters
features
f1, f2, f3, , fn

No supervision, were only given data and want to


find natural groupings
5
What is Clustering
A grouping of data objects such that the objects
within a group are similar (or related) to one
another and different from (or unrelated to)
the objects in other groups
Examples in
Examples within different clusters
a cluster are very are very different
similar
Clustering example
Image segmentation:
Goal: Break up the image into meaningful
or perceptually similar regions

8
K-Means clustering
An iterative clustering algorithm
Initialize: Pick K random points as cluster
centers
Repeat:
1. Assign data points to closest cluster center
2. Change the cluster center to the average of its
assigned points
Stop when no points assignments change

9
10
11
K-means: an example
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center

No changes: Done
K-means

Iterate:
Assign/cluster each example to closest center
Recalculate centers as the mean of the points in a cluster

How do we do this?
K-means

Iterate:
Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center (hard cluster)
Recalculate centers as the mean of the points in a cluster
K-means

Iterate:
Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center
Recalculate centers as the mean of the points in a cluster

What distance measure should we use?


K-means

Iterate:
Assign/cluster each example to closest center
Recalculate centers as the mean of the points in
a cluster

Where are the cluster centers?


K-means

Iterate:
Assign/cluster each example to closest center
Recalculate centers as the mean of the points in
a cluster

How do we calculate these?


Example

we have 4 types of medicines


and each medicine have two
features. Our goal is to group 4.5
4

these objects into K=2 groups 3.5


3
2.5
2
Weight index pH
Medicine 1.5
(X) (Y) 1
0.5

A 1 1 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

B 2 1
C 4 3
D 5 4

26
Iteration-1
Initial
value of centroids: Suppose we use medicine A
and medicine B as the first centroids
,

4.5
Medicine X Y Dist- Dist- Cluster 4
3.5
3
2.5
A 1 1 0 1 C-1 2
1.5
B 2 1 1 0 C-2 1
0.5
C 4 3 3.61 2.83 C-2 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
D 5 4 5 4.24 C-2

Recompute centroids:

27
Iteration-2

Medicine X Y Dist- Dist- Cluster 4.5


4
3.5
3
A 1 1 0 3.14 C-1 2.5
2
B 2 1 1 2.36 C-1 1.5
1
C 4 3 3.61 0.47 C-2 0.5
0
D 5 4 5 1.89 C-2 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

Recompute centroids:

28
K-means variations/parameters
Initial (seed) cluster centers

Convergence
A fixed number of iterations
Partitions unchanged
Cluster centers dont change
K-means: Initialize centers randomly

What would happen here?

Seed selection ideas?


Seed choice
Results can vary drastically based on random seed
selection

Some seeds can result in poor convergence rate, or


convergence to sub-optimal clusterings

Common choices
Random point in feature space
Random point from dataset
Points least similar to any existing center (furthest centers
heuristic)
Try out multiple starting points
Furthest centers heuristic

1 = pick random point

for i = 2 to K:
i = point that is furthest from any previous centers
K-means: Initialize furthest from centers

Pick a random point for the first center


K-means: Initialize furthest from centers

What point will be chosen next?


K-means: Initialize furthest from centers

Furthest point from center

What point will be chosen next?


K-means: Initialize furthest from centers

Furthest point from center

What point will be chosen next?


K-means: Initialize furthest from centers

Furthest point from center

Any issues/concerns with this approach?


Furthest points concerns

If k = 4, which points will get chosen?


Furthest points concerns

If we do a number of trials, will we get


different centers?
Furthest points concerns

Doesnt deal well with outliers


K-means++

1 = pick random point

for k = 2 to K:
for i = 1 to N:
si = min d(xi, 1k-1) // smallest distance to any center

k = randomly pick point proportionate to s

How does this help?


K-means++
1 = pick random point

for k = 2 to K:
for i = 1 to N:
si = min d(xi, 1k-1) // smallest distance to any center

k = randomly pick point proportionate to s

- Makes it possible to select other points


- if #points >> #outliers, we will pick good points
- Makes it non-deterministic, which will help with random
runs
- Nice theoretical guarantees!
Pros
Easy to use
Good initial method

Cons
Need to know K
Problems when clusters are of different
size, densities
Cant handle Outliers well

43
44
Summary
Definition of clustering
Difference between supervised and unsupervised learning.
Finding labels for each datum.
Clustering algorithms
K-means
Always K clusters exist.
Find new mean value.
Find new clusters.
Stop when nothing changes in clusters (or changes are less than
very small value).

54