Vous êtes sur la page 1sur 65

Linear Algebra for Machine Learning

Main Reference
University of Washington
CSS 581 - Introduction to Machine Learning
Instructor: J Jeffry Howbert
Lecture 2: Math Essentials

Probability
Linear algebra
Linear algebra applications

1) Operations on or between vectors and matrices


2) Coordinate transformations
3) Dimensionality reduction
4) Linear regression
5) Solution of linear systems of equations
6) Many others

Applications 1) 4) are directly relevant to this


course. Today well start with 1).

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Why vectors and matrices?

Most common form of data vector


organization for machine Refund Marital Taxable

learning is a 2D array, where Status Income Cheat

Yes Single 125K No

rows represent samples No


No
Married
Single
100K
70K
No
No
(records, items, datapoints) Yes Married 120K No
No Divorced 95K Yes

columns represent attributes No Married 60K No


Yes Divorced 220K No
(features, variables) No Single 85K Yes
No Married 75K No
Natural to think of each sample No Single 90K Yes
10

as a vector of attributes, and


whole array as a matrix matrix

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Lecture 8: Regression

Linear Regression
Regression Trees

Jeff Howbert Introduction to Machine Learning Winter 2014 #


slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Jeff Howbert Introduction to Machine Learning Winter 2014 #


slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Jeff Howbert Introduction to Machine Learning Winter 2014 #


slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Loss function

Suppose target labels come from set Y


Binary classification: Y = { 0, 1 }
Regression: Y= (real numbers)
A loss function maps decisions to costs:
L( y, y ) defines the penalty for predicting y when the
true value is y .
Standard choice for classification:
0/1 loss (same as 0 if y y
L0 /1 ( y, y )
misclassification error) 1 otherwise

Standard choice for regression:


squared loss
L( y, y ) ( y y) 2
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Least squares linear fit to data

Most popular estimation method is least squares:


Determine linear coefficients w that minimize sum of
squared loss (SSL).
Use standard (multivariate) differential calculus:
differentiate SSL with respect to w
find zeros of each partial differential equation
solve for each wi
In one dimension:
N
SSL ( y j ( w0 w1 x j )) 2 N number of samples
j 1

cov[ x, y ]
w1 w0 y w1 x x ,y means of training x, y
var[ x]
y t w0 w1 xt for test sample xt
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Least squares linear fit to data

Multiple dimensions
To simplify notation and derivation, add a new feature
x0 = 1 to feature vector x:
d
y w0 1 wi xi w x
i 1

Calculate SSL and determine w:


N d
SSL ( y j wi xi ) 2 (y Xw ) T (y Xw )
j 1 i 0

y vector of all training responses y j


X matrix of all training samplesx j
w ( X T X) 1 X T y
y t w x t for test samplex t
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Least squares linear fit to data

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Lecture 9: Recommendation Systems

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Netflix Viewing Recommendations
Recommender Systems
DOMAIN: some field of activity where users buy, view,
consume, or otherwise experience items

PROCESS:
1. users provide ratings on items they have experienced
2. Take all < user, item, rating > data and build a predictive
model
3. For a user who hasnt experienced a particular item, use
model to predict how well they will like it (i.e. predict
rating)
Roles of Recommender Systems
Help users deal with paradox of choice

Allow online sites to:


Increase likelihood of sales
Retain customers by providing positive search experience

Considered essential in operation of:


Online retailing, e.g. Amazon, Netflix, etc.
Social networking sites
Amazon.com Product Recommendations
Social Network Recommendations
Recommendations on essentially every category of
interest known to mankind
Friends
Groups
Activities
Media (TV shows, movies, music, books)
News stories
Ad placements
All based on connections in underlying social network
graph, and the expressed likes and dislikes of yourself
and your connections
Types of Recommender Systems
Base predictions on either:

content-based approach
explicit characteristics of users and items

collaborative filtering approach


implicit characteristics based on similarity of users
preferences to those of other users
The Netflix Prize Contest
GOAL: use training data to build a recommender system,
which, when applied to qualifying data, improves error rate by
10% relative to Netflixs existing system

PRIZE: first team to 10% wins $1,000,000


Annual Progress Prizes of $50,000 also possible
The Netflix Prize Contest

PARTICIPATION:
51051 contestants on 41305 teams from 186 different
countries
44014 valid submissions from 5169 different teams
The Netflix Prize Data
Netflix released three datasets
480,189 users (anonymous)
17,770 movies
ratings on integer scale 1 to 5

Training set: 99,072,112 < user, movie > pairs with ratings
Probe set: 1,408,395 < user, movie > pairs with ratings
Qualifying set of 2,817,131 < user, movie > pairs with no
ratings
Model Building and Submission Process
training set probe set
ratings
99,072,112 1,408,395 known
tuning
MODEL validate

make predictions
RMSE on RMSE kept
public 1,408,342 1,408,789 secret for
leaderboard quiz set test set final scoring

qualifying set
(ratings unknown)
Why the Netflix Prize Was Hard

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
Massive dataset
Very sparse matrix


user 1 1 2 3
only 1.2% occupied user 2 2 3 3 4
Extreme variation in user 3 5 3 4
number of ratings user 4 2 3 2 2
user 5 4 5 3 4
per user user 6 2
Statistical properties user 7 2 4 2 3
user 8 3 4 4
of qualifying and user 9 3
probe sets different user 10 1 2 2
from training set
user 480189 4 3 3
Dealing with Size of the Data
MEMORY:
2 GB bare minimum for common algorithms
4+ GB required for some algorithms
need 64-bit machine with 4+ GB RAM if serious
SPEED:
Program in languages that compile to fast machine code
64-bit processor
Exploit low-level parallelism in code (SIMD on Intel x86/x64)
Common Types of Algorithms
Global effects
Nearest neighbors
Matrix factorization
Restricted Boltzmann machine
Clustering
Etc.
Nearest Neighbors in Action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9


user 1 1 2 3
user 2 2 3 3 4 ? Identical preferences
user 3 5 3 strong weight
user 4 2 3 2 2
user 5 2 3 5 4 2 4
Similar preferences
user 6 2
moderate weight
user 7 2 4 2
user 8 3 1 3 4 5 4
user 9 3
user 10 1 2 2

user 480189 4 3 3
Matrix Factorization in Action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770


movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >


user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 reduced-rank

factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 singular
user 7 2 4 2 3
value
user 8 3 4 4
user 9 3 decomposition user 1
user 10 1 2 2 (sort of) user 2
user 3

< a bunch of
user 480189 4 3 3 user 4

numbers >
user 5
user 6
user 7
user 8
user 9
user 10

user 480189
Matrix Factorization in Action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9

movie 17770
factor 1

movie 10
factor 2

movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 3


factor 4
factor 5 user 1 1 2 3
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5

multiply and add user 6 2


factor vectors user 7 2 4 2 3
user 1 (dot product) user 8 3 4 4 ?
user 2 user 9 3
for desired user 10 1 2 2
user 3
< user, movie >
user 4
user 5
prediction user 480189 4 3 3

user 6
user 7
user 8
user 9
user 10

user 480189
Netflix Prize Progress: Major Milestones
1.05
RMS Error on Quiz Set

1.00
me, starting
June, 2008

0.95

0.90

8.43%
9.44%
10.09%
10.00%
0.85
trivial algorithm Cinematch 2007 Progress 2008 Progress Grand Prize
Prize Prize

DATE: Oct. 2007 Oct. 2008 July 2009


BellKor in
WINNER: BellKor ???
BigChaos
July 26, 18:43 GMT Contest Over!
Final Test Scores
Netflix Prize: What Did I Learn?
Several new machine learning algorithms
A lot about optimizing predictive models
Stochastic gradient descent
Regularization
A lot about optimizing code for speed and memory usage
Some linear algebra
Enough to come up with one original approach that actually
worked

Money and fame make people crazy, in both good ways and bad

COST: about 1000 hours of my free time over 13 months


Lecture 10: Collaborative Filtering

Matrix Factorization Approach

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Stochastic gradient descent

Stochastic (definition):
1. involving a random variable
2. involving chance or probability; probabilistic

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Stochastic gradient descent

Application to training a machine learning model:


1. Choose one sample from training set
2. Calculate loss function for that single sample
3. Calculate gradient from loss function
4. Update model parameters a single step based on
gradient and learning rate
5. Repeat from 1) until stopping criterion is satisfied
Typically entire training set is processed multiple
times before stopping.
Order in which samples are processed can be
fixed or random.
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Matrix factorization in action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770


movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >


user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 factorization

factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 (training
user 7 2 4 2 3
process)
user 8 3 4 4
user 9 3 user 1
user 10 1 2 2 user 2
user 3

< a bunch of
user 480189 4 3 3 user 4

numbers >
user 5
training user 6
data user 7
user 8
user 9
user 10

user 480189

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Matrix factorization in action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9

movie 17770

factor 1

movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3


factor 4 user 1 1 2 3
factor 5
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5

multiply and add user 6 2


factor vectors user 7 2 4 2 3
user 1 (dot product) user 8 3 4 4 ?
user 2 user 9 3
for desired user 10 1 2 2
user 3
< user, movie >
user 4
user 5 prediction user 480189 4 3 3

user 6
user 7
user 8
user 9
user 10

user 480189

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Matrix factorization

Notation
Number of users = I
Number of items = J
Number of factors per user / item = F
User of interest = i
Item of interest = j
Factor index = f

User matrix U dimensions = I x F


Item matrix V dimensions = J x F

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Matrix factorization

Prediction rij for user, item pair i, j :


F
rij U if V jf
f 1

Loss for prediction where true rating is rij :


F
L(rij , rij ) (rij rij ) 2 (rij U if V jf ) 2
f 1

Using squared loss; other loss functions possible


Loss function contains F model variables from U
and F model variables from V
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Matrix factorization

Gradient of loss function for sample i, j :


F
(rij U if V jf ) 2
L(rij , rij ) F
2(rij U if V jf )V jf
f 1

U if U if f 1
F
(rij U if V jf ) 2
L(rij , rij ) F
2(rij U if V jf )U if
f 1

V jf V jf f 1

for f = 1 to F

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Matrix factorization

Lets simplify the notation:


F
let e rij U if V jf (the prediction error)
f 1

L(rij , rij ) e 2
2eV jf
U if U if
L(rij , rij ) e 2
2eU if
V jf V jf

for f = 1 to F

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Matrix factorization

Set learning rate =


Then the factor matrix updates for sample i, j
are:
U if U if 2eV jf
V jf V jf 2eU if

for f = 1 to F

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Matrix factorization

SGD for training a matrix factorization:

1. Decide on F = dimension of factors


2. Initialize factor matrices with small random values
3. Choose one sample from training set
4. Calculate loss function for that single sample
5. Calculate gradient from loss function
6. Update 2 F model parameters a single step using
gradient and learning rate
7. Repeat from 3) until stopping criterion is satisfied

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Matrix factorization

Must use some form of regularization (usually L2):


F F F
L(rij , rij ) (rij U if V jf ) U if V jf
2 2 2

f 1 f 1 f 1

Update rules become:

U if U if 2 (eV jf U if )
V jf V jf 2 (eU if V jf )

for f = 1 to F

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Lecture 17:

Dimensionality Reduction

Some slides thanks to Xiaoli Fern (CS534, Oregon State Univ., 2011).

Some figures taken from "An Introduction to Statistical Learning, with applications in R" (Springer,
2013) with permission of the authors, G. James, D. Witten, T. Hastie and R. Tibshirani.

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Dimensionality reduction

Many modern data domains involve huge


numbers of features / dimensions

Documents: thousands of words, millions of


bigrams

Images: thousands to millions of pixels

Genomics: thousands of genes, millions of


DNA polymorphisms
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Why reduce dimensions?

High dimensionality has many costs

Redundant and irrelevant features degrade


performance of some ML algorithms

Difficulty in interpretation and visualization

Computation may become infeasible


what if your algorithm scales as O( n3 )?

Curse of dimensionality
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Approaches to dimensionality reduction

Feature selection
Select subset of existing features (without
modification)
Lecture 5 and Project 1
Model regularization
L2 reduces effective dimensionality
L1 reduces actual dimensionality
Combine (map) existing features into smaller
number of new features
Linear combination (projection)
Nonlinear combination
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Linear dimensionality reduction

Linearly project n-dimensional data onto a k-


dimensional space
k < n, often k << n
Example: project space of 104 words into 3
dimensions

There are infinitely many k-dimensional


subspaces we can project the data onto.

Which one should we choose?


Jeff Howbert Introduction to Machine Learning Winter 2014 #
LDA for two classes

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Unsupervised dimensionality reduction

Consider data without class labels


Try to find a more compact representation of the
data

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Principal component analysis (PCA)

Widely used method for unsupervised, linear


dimensionality reduction

GOAL: account for variance of data in as few


dimensions as possible (using linear projection)

Jeff Howbert Introduction to Machine Learning Winter 2014 #


PCA: conceptual algorithm

Find a line, such that when the data is projected


onto that line, it has the maximum variance.

Jeff Howbert Introduction to Machine Learning Winter 2014 #


PCA: conceptual algorithm

Find a second line, orthogonal to the first, that


has maximum projected variance.

Jeff Howbert Introduction to Machine Learning Winter 2014 #


PCA: conceptual algorithm

Repeat until have k orthogonal lines


The projected position of a point on these lines
gives the coordinates in the k-dimensional
reduced space.

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Steps in principal component analysis

Mean center the data

Compute covariance matrix

Calculate eigenvalues and eigenvectors of


Eigenvector with largest eigenvalue 1 is 1st
principal component (PC)
Eigenvector with kth largest eigenvalue k is kth
PC
k / i i = proportion of variance captured by
k th PC
Jeff Howbert Introduction to Machine Learning Winter 2014 #
PCA: choosing the dimension k

Jeff Howbert Introduction to Machine Learning Winter 2014 #


PCA example: face recognition

A typical image of size 256 x 128 pixels is


described by 256 x 128 = 32768 dimensions.
Each face image lies somewhere in this high-
dimensional space.
Images of faces are generally similar in overall
configuration, thus
They cannot be randomly distributed in this
space.
We should be able to describe them in a much
lower-dimensional space.
Jeff Howbert Introduction to Machine Learning Winter 2014 #
PCA for face images: eigenfaces

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Face recognition in eigenface space

(Turk and Pentland 1991)

Jeff Howbert Introduction to Machine Learning Winter 2014 #


Face image retrieval

Jeff Howbert Introduction to Machine Learning Winter 2014 #