Vous êtes sur la page 1sur 65

# Linear Algebra for Machine Learning

Main Reference
University of Washington
CSS 581 - Introduction to Machine Learning
Instructor: J Jeffry Howbert
Lecture 2: Math Essentials

Probability
Linear algebra
Linear algebra applications

## 1) Operations on or between vectors and matrices

2) Coordinate transformations
3) Dimensionality reduction
4) Linear regression
5) Solution of linear systems of equations
6) Many others

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Why vectors and matrices?

## Most common form of data vector

organization for machine Refund Marital Taxable

## rows represent samples No

No
Married
Single
100K
70K
No
No
(records, items, datapoints) Yes Married 120K No
No Divorced 95K Yes

## columns represent attributes No Married 60K No

Yes Divorced 220K No
(features, variables) No Single 85K Yes
No Married 75K No
Natural to think of each sample No Single 90K Yes
10

## as a vector of attributes, and

whole array as a matrix matrix

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Lecture 8: Regression

Linear Regression
Regression Trees

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

slide thanks to Greg Shakhnarovich (CS195-5, Brown Univ., 2006)

Loss function

## Suppose target labels come from set Y

Binary classification: Y = { 0, 1 }
Regression: Y= (real numbers)
A loss function maps decisions to costs:
L( y, y ) defines the penalty for predicting y when the
true value is y .
Standard choice for classification:
0/1 loss (same as 0 if y y
L0 /1 ( y, y )
misclassification error) 1 otherwise

## Standard choice for regression:

squared loss
L( y, y ) ( y y) 2
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Least squares linear fit to data

## Most popular estimation method is least squares:

Determine linear coefficients w that minimize sum of
squared loss (SSL).
Use standard (multivariate) differential calculus:
differentiate SSL with respect to w
find zeros of each partial differential equation
solve for each wi
In one dimension:
N
SSL ( y j ( w0 w1 x j )) 2 N number of samples
j 1

cov[ x, y ]
w1 w0 y w1 x x ,y means of training x, y
var[ x]
y t w0 w1 xt for test sample xt
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Least squares linear fit to data

Multiple dimensions
To simplify notation and derivation, add a new feature
x0 = 1 to feature vector x:
d
y w0 1 wi xi w x
i 1

## Calculate SSL and determine w:

N d
SSL ( y j wi xi ) 2 (y Xw ) T (y Xw )
j 1 i 0

## y vector of all training responses y j

X matrix of all training samplesx j
w ( X T X) 1 X T y
y t w x t for test samplex t
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Least squares linear fit to data

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Lecture 9: Recommendation Systems

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Netflix Viewing Recommendations
Recommender Systems
DOMAIN: some field of activity where users buy, view,
consume, or otherwise experience items

PROCESS:
1. users provide ratings on items they have experienced
2. Take all < user, item, rating > data and build a predictive
model
3. For a user who hasnt experienced a particular item, use
model to predict how well they will like it (i.e. predict
rating)
Roles of Recommender Systems
Help users deal with paradox of choice

## Allow online sites to:

Increase likelihood of sales
Retain customers by providing positive search experience

## Considered essential in operation of:

Online retailing, e.g. Amazon, Netflix, etc.
Social networking sites
Amazon.com Product Recommendations
Social Network Recommendations
Recommendations on essentially every category of
interest known to mankind
Friends
Groups
Activities
Media (TV shows, movies, music, books)
News stories
All based on connections in underlying social network
graph, and the expressed likes and dislikes of yourself
Types of Recommender Systems
Base predictions on either:

content-based approach
explicit characteristics of users and items

## collaborative filtering approach

implicit characteristics based on similarity of users
preferences to those of other users
The Netflix Prize Contest
GOAL: use training data to build a recommender system,
which, when applied to qualifying data, improves error rate by
10% relative to Netflixs existing system

## PRIZE: first team to 10% wins \$1,000,000

Annual Progress Prizes of \$50,000 also possible
The Netflix Prize Contest

PARTICIPATION:
51051 contestants on 41305 teams from 186 different
countries
44014 valid submissions from 5169 different teams
The Netflix Prize Data
Netflix released three datasets
480,189 users (anonymous)
17,770 movies
ratings on integer scale 1 to 5

Training set: 99,072,112 < user, movie > pairs with ratings
Probe set: 1,408,395 < user, movie > pairs with ratings
Qualifying set of 2,817,131 < user, movie > pairs with no
ratings
Model Building and Submission Process
training set probe set
ratings
99,072,112 1,408,395 known
tuning
MODEL validate

make predictions
RMSE on RMSE kept
public 1,408,342 1,408,789 secret for
leaderboard quiz set test set final scoring

qualifying set
(ratings unknown)
Why the Netflix Prize Was Hard

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
Massive dataset
Very sparse matrix

user 1 1 2 3
only 1.2% occupied user 2 2 3 3 4
Extreme variation in user 3 5 3 4
number of ratings user 4 2 3 2 2
user 5 4 5 3 4
per user user 6 2
Statistical properties user 7 2 4 2 3
user 8 3 4 4
of qualifying and user 9 3
probe sets different user 10 1 2 2
from training set
user 480189 4 3 3
Dealing with Size of the Data
MEMORY:
2 GB bare minimum for common algorithms
4+ GB required for some algorithms
need 64-bit machine with 4+ GB RAM if serious
SPEED:
Program in languages that compile to fast machine code
64-bit processor
Exploit low-level parallelism in code (SIMD on Intel x86/x64)
Common Types of Algorithms
Global effects
Nearest neighbors
Matrix factorization
Restricted Boltzmann machine
Clustering
Etc.
Nearest Neighbors in Action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9

user 1 1 2 3
user 2 2 3 3 4 ? Identical preferences
user 3 5 3 strong weight
user 4 2 3 2 2
user 5 2 3 5 4 2 4
Similar preferences
user 6 2
moderate weight
user 7 2 4 2
user 8 3 1 3 4 5 4
user 9 3
user 10 1 2 2

user 480189 4 3 3
Matrix Factorization in Action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770

movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >

user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 reduced-rank

factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 singular
user 7 2 4 2 3
value
user 8 3 4 4
user 9 3 decomposition user 1
user 10 1 2 2 (sort of) user 2
user 3

< a bunch of
user 480189 4 3 3 user 4

numbers >
user 5
user 6
user 7
user 8
user 9
user 10

user 480189
Matrix Factorization in Action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9

movie 17770
factor 1

movie 10
factor 2

movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 3

factor 4
factor 5 user 1 1 2 3
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5

## multiply and add user 6 2

factor vectors user 7 2 4 2 3
user 1 (dot product) user 8 3 4 4 ?
user 2 user 9 3
for desired user 10 1 2 2
user 3
< user, movie >
user 4
user 5
prediction user 480189 4 3 3

user 6
user 7
user 8
user 9
user 10

user 480189
Netflix Prize Progress: Major Milestones
1.05
RMS Error on Quiz Set

1.00
me, starting
June, 2008

0.95

0.90

8.43%
9.44%
10.09%
10.00%
0.85
trivial algorithm Cinematch 2007 Progress 2008 Progress Grand Prize
Prize Prize

## DATE: Oct. 2007 Oct. 2008 July 2009

BellKor in
WINNER: BellKor ???
BigChaos
July 26, 18:43 GMT Contest Over!
Final Test Scores
Netflix Prize: What Did I Learn?
Several new machine learning algorithms
A lot about optimizing predictive models
Regularization
A lot about optimizing code for speed and memory usage
Some linear algebra
Enough to come up with one original approach that actually
worked

Money and fame make people crazy, in both good ways and bad

## COST: about 1000 hours of my free time over 13 months

Lecture 10: Collaborative Filtering

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Stochastic (definition):
1. involving a random variable
2. involving chance or probability; probabilistic

## Application to training a machine learning model:

1. Choose one sample from training set
2. Calculate loss function for that single sample
3. Calculate gradient from loss function
4. Update model parameters a single step based on
5. Repeat from 1) until stopping criterion is satisfied
Typically entire training set is processed multiple
times before stopping.
Order in which samples are processed can be
fixed or random.
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Matrix factorization in action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
movie 17770

movie 10
factor 1
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3 < a bunch of numbers >

user 1 1 2 3 factor 4
user 2 2 3 3 4 factor 5
user 3 5 3 4
user 4 2 3 2 2 +
user 5 4 5 3 4 factorization

factor 1
factor 2
factor 3
factor 4
factor 5
user 6 2 (training
user 7 2 4 2 3
process)
user 8 3 4 4
user 9 3 user 1
user 10 1 2 2 user 2
user 3

< a bunch of
user 480189 4 3 3 user 4

numbers >
user 5
training user 6
data user 7
user 8
user 9
user 10

user 480189

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Matrix factorization in action

movie 17770
movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9

movie 17770

factor 1

movie 10
movie 1
movie 2
movie 3
movie 4
movie 5
movie 6
movie 7
movie 8
movie 9
factor 2
factor 3

factor 4 user 1 1 2 3
factor 5
user 2 2 3 3 4
user 3 5 3 4
+ user 4 2 3 2 2
user 5 4 5 3 4
factor 1
factor 2
factor 3
factor 4
factor 5

## multiply and add user 6 2

factor vectors user 7 2 4 2 3
user 1 (dot product) user 8 3 4 4 ?
user 2 user 9 3
for desired user 10 1 2 2
user 3
< user, movie >
user 4
user 5 prediction user 480189 4 3 3

user 6
user 7
user 8
user 9
user 10

user 480189

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Matrix factorization

Notation
Number of users = I
Number of items = J
Number of factors per user / item = F
User of interest = i
Item of interest = j
Factor index = f

## User matrix U dimensions = I x F

Item matrix V dimensions = J x F

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Matrix factorization

F
rij U if V jf
f 1

## Loss for prediction where true rating is rij :

F
L(rij , rij ) (rij rij ) 2 (rij U if V jf ) 2
f 1

## Using squared loss; other loss functions possible

Loss function contains F model variables from U
and F model variables from V
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Matrix factorization

## Gradient of loss function for sample i, j :

F
(rij U if V jf ) 2
L(rij , rij ) F
2(rij U if V jf )V jf
f 1

U if U if f 1
F
(rij U if V jf ) 2
L(rij , rij ) F
2(rij U if V jf )U if
f 1

V jf V jf f 1

for f = 1 to F

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Matrix factorization

## Lets simplify the notation:

F
let e rij U if V jf (the prediction error)
f 1

L(rij , rij ) e 2
2eV jf
U if U if
L(rij , rij ) e 2
2eU if
V jf V jf

for f = 1 to F

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Matrix factorization

## Set learning rate =

Then the factor matrix updates for sample i, j
are:
U if U if 2eV jf
V jf V jf 2eU if

for f = 1 to F

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Matrix factorization

## 1. Decide on F = dimension of factors

2. Initialize factor matrices with small random values
3. Choose one sample from training set
4. Calculate loss function for that single sample
5. Calculate gradient from loss function
6. Update 2 F model parameters a single step using
7. Repeat from 3) until stopping criterion is satisfied

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Matrix factorization

## Must use some form of regularization (usually L2):

F F F
L(rij , rij ) (rij U if V jf ) U if V jf
2 2 2

f 1 f 1 f 1

## Update rules become:

U if U if 2 (eV jf U if )
V jf V jf 2 (eU if V jf )

for f = 1 to F

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Lecture 17:

Dimensionality Reduction

Some slides thanks to Xiaoli Fern (CS534, Oregon State Univ., 2011).

Some figures taken from "An Introduction to Statistical Learning, with applications in R" (Springer,
2013) with permission of the authors, G. James, D. Witten, T. Hastie and R. Tibshirani.

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Dimensionality reduction

## Many modern data domains involve huge

numbers of features / dimensions

bigrams

## Genomics: thousands of genes, millions of

DNA polymorphisms
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Why reduce dimensions?

## Redundant and irrelevant features degrade

performance of some ML algorithms

## Computation may become infeasible

what if your algorithm scales as O( n3 )?

Curse of dimensionality
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Approaches to dimensionality reduction

Feature selection
Select subset of existing features (without
modification)
Lecture 5 and Project 1
Model regularization
L2 reduces effective dimensionality
L1 reduces actual dimensionality
Combine (map) existing features into smaller
number of new features
Linear combination (projection)
Nonlinear combination
Jeff Howbert Introduction to Machine Learning Winter 2014 #
Linear dimensionality reduction

## Linearly project n-dimensional data onto a k-

dimensional space
k < n, often k << n
Example: project space of 104 words into 3
dimensions

## There are infinitely many k-dimensional

subspaces we can project the data onto.

## Which one should we choose?

Jeff Howbert Introduction to Machine Learning Winter 2014 #
LDA for two classes

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Unsupervised dimensionality reduction

## Consider data without class labels

Try to find a more compact representation of the
data

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Principal component analysis (PCA)

## Widely used method for unsupervised, linear

dimensionality reduction

## GOAL: account for variance of data in as few

dimensions as possible (using linear projection)

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

PCA: conceptual algorithm

## Find a line, such that when the data is projected

onto that line, it has the maximum variance.

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

PCA: conceptual algorithm

## Find a second line, orthogonal to the first, that

has maximum projected variance.

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

PCA: conceptual algorithm

## Repeat until have k orthogonal lines

The projected position of a point on these lines
gives the coordinates in the k-dimensional
reduced space.

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Steps in principal component analysis

## Calculate eigenvalues and eigenvectors of

Eigenvector with largest eigenvalue 1 is 1st
principal component (PC)
Eigenvector with kth largest eigenvalue k is kth
PC
k / i i = proportion of variance captured by
k th PC
Jeff Howbert Introduction to Machine Learning Winter 2014 #
PCA: choosing the dimension k

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

PCA example: face recognition

## A typical image of size 256 x 128 pixels is

described by 256 x 128 = 32768 dimensions.
Each face image lies somewhere in this high-
dimensional space.
Images of faces are generally similar in overall
configuration, thus
They cannot be randomly distributed in this
space.
We should be able to describe them in a much
lower-dimensional space.
Jeff Howbert Introduction to Machine Learning Winter 2014 #
PCA for face images: eigenfaces

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Face recognition in eigenface space

## Jeff Howbert Introduction to Machine Learning Winter 2014 #

Face image retrieval