Académique Documents
Professionnel Documents
Culture Documents
Lecture 4
90
80
Waiting Time (min)
70
60
50
Can we meaningfully predict the time between eruptions only using the
duration of the last eruption?
E XAMPLE : O LD FAITHFUL
90
80
Waiting Time (min)
70
60
50
Can we meaningfully predict the time between eruptions only using the
duration of the last eruption?
E XAMPLE : O LD FAITHFUL
90
80
70
60
I
50
Refresher
w1 is the slope, w0 is called the intercept, bias, shift, offset.
H IGHER DIMENSIONS
Two inputs
x1 x2
R EGRESSION : P ROBLEM D EFINITION
Data
Input: x Rd (i.e., measurements, covariates, features, indepen. variables)
Output: y R (i.e., response, dependent variable)
Goal
Find a function f : Rd R such that y f (x; w) for the data pair (x, y).
f (x; w) is called a regression function. Its free parameters are w.
Model learning
We have the set of training data (x1 , y1 ) . . . (xn , yn ). We want to use this data
to learn a w such that yi f (xi ; w). But we first need an objective function to
tell us what a good value of w is.
Least squares
The least squares objective tells us to pick the w that minimizes the sum of
squared errors
n
X
wLS = arg min (yi f (xi ; w))2 arg min L.
w w
i=1
L EAST SQUARES IN PICTURES
Observations:
Vertical length is error.
2-dimensional problem
Input: (education, seniority) R2 .
Output: (income) R
Thus far
We have data pairs (xi , yi ) of measurements xi Rd and a response yi R.
We believe there is a linear relationship between xi and yi ,
d
X
yi = w0 + xij wj + i
j=1
i=1 i=1
1
1 x1T
1 x11 ... x1d
xi1 T
1 x21 ... 1 x2
x2d
xi =
xi2 ,
X=
.. .. =
.. .
.. . . .
. T
1 xn1 ... xnd 1 xn
xid
Pn Pd 2
Original least squares objective function: L = i=1 (yi w0 j=1 xij wj )
Pn
Using vectors, this can now be written: L = i=1 (yi xiT w)2
Solving gives,
n
X n
X n
X n
1 X
2yi xi + 2xi xiT w = 0 wLS = xi xiT yi xi .
i=1 i=1 i=1 i=1
L EAST SQUARES IN MATRIX FORM
w L = 2X T Xw 2X T y = 0 wLS = (X T X)1 X T y.
R ECALL FROM LINEAR ALGEBRA
Pn
Recall: Matrix vector (X T y = i=1 yi xi )
y1
| | | y2 | | |
x1 x2 ... xn = y1 x1 + y2 x2 + + yn xn
..
| | | . | | |
yn
Pn
Recall: Matrix matrix (X T X = T
i=1 xi xi )
x1T
| | | x2T
x1 x2 ... xn = x1 x1T + + xn xnT .
..
| | |
.
xnT
L EAST SQUARES LINEAR REGRESSION : K EY EQUATIONS
Making Predictions
We use wLS to make predictions.
Potential issues
Calculating wLS = (X T X)1 X T y assumes (X T X)1 exists.
10
2 1 0 1 2 3
x
B ROADENING LINEAR REGRESSION
y = w0 + w1 x
10
2 1 0 1 2 3
x
B ROADENING LINEAR REGRESSION
y = w0 + w1 x + w2 x2 + w3 x3
10
2 1 0 1 2 3
x
P OLYNOMIAL REGRESSION IN R
1 x1 x12 . . . x1p
1 x2 x22 . . . x2p
X= .
..
..
.
1 xn xn2 ... xnp
1 M =0
t
0 x 1
P OLYNOMIAL REGRESSION (M TH ORDER )
1 M =1
t
0 x 1
P OLYNOMIAL REGRESSION (M TH ORDER )
1 M =3
t
0 x 1
P OLYNOMIAL REGRESSION (M TH ORDER )
1 M =9
t
0 x 1
P OLYNOMIAL REGRESSION IN TWO DIMENSIONS
For example,
gs (xi ) = xij2
gs (xi ) = log xij
gs (xi ) = I(xij < a)
gs (xi ) = I(xij < xij0 )
One caveat is that, as the number of functions increases, we need more data
to avoid overfitting.
G EOMETRY OF LEAST SQUARES REGRESSION
y = w 0 + x 1w 1 + x 2w 2
x1 x2
There are some key difference between (a) and (b) worth highlighting as you
try to develop the corresponding intuitions.
(a) Can be shown for all n, but only for xi R2 (not counting the added 1).
" #
1 x1
(b) This corresponds to n = 3 and one-dimensional data: X = 1 x2 .
1 x3