Vous êtes sur la page 1sur 198

Machine Learning: Business Applications

Francisco Rosales Marticorena, PhD.


frosales@esan.edu.pe

04.04.19 – 30.05.19
ESAN Graduate School of Business
Datos Generales del Curso

Asignatura: Machine Learning: Aplicaciones en los Negocios


Área académica: Programa de Especialización para Ejecutivos
Año y semestre: 2019 – I
Profesor: Francisco Rosales Marticorena, PhD.
Mail: frosales@esan.edu.pe
Teléfono: (511) 317-7200 / 444340

1
Sumilla

Este curso presenta métodos de machine learning con énfasis en


problemas de clasificación y regresión de aprendizaje supervisado.
El curso contiene sesiones de fundamentos matemáticos; sesiones de
desarrollo metodológico y de aplicaciones.
Se usará el software R para la resolución de casos de estudio.

2
Objetivos de la Asignatura

Mejorar las capacidades cuantitativas de analistas y gerentes para


interpretar los resultados de métodos que aprenden de los datos.
Utilizar adecuadamente conceptos matemáticos básicos involucrados
en los métodos de aprendizaje supervisado de Machine Learning.
Utilizar el software R y sus librerı́as especializadas para es desarrollo
de implementaciones propias o de terceros.

3
Programación de Contenidos

1 Estadı́stica básica:
Sesión 1: Introducción
Sesión 2: Software R
Sesión 3: Regresión lineal
2 Métodos Lineales:
Sesión 4: Modelos de Clasificación
Sesión 5: Métodos de Resampleo
Sesión 6: Regularización
Sesión 7: Reducción dimensional
Sesión 8: Taller
3 Métodos No-Lineales:
Sesión 9: Splines
Sesión 10: GAMs
Sesión 11, 12: Árboles de decisión
Sesión 13, 14: Support Vector Machines
Sesión 15: Evaluación Final

4
Metodologı́a

Las exposiciones del profesor se complementarán con actividades que


harán los alumnos en el salón de clase, y fuera de él:

Participar en clase.
Leer la bibliografı́a indicada en el programa.
Hacer las tareas.
Rendir las evaluaciones programadas.

5
Evaluación

Nota Final “ seis tareas p60%q ` un examen final p40%q.

Las tareas se podrán realizar de manera individual o en parejas.


El examen final es individual.
El examen final es obligatorio, y se rendirá el dı́a 30.05.19.

6
Fuentes de Información

[EH06] Everitt, B. and T. Hothorn. A handbook of statistical


analyses using R. Chapman & Hall/CRC, 2006.
[JO13] James, G. et. al. (2013). An Introduction to Statistical
Learning with Applications in R. Springer Series in Statistics.
[HT09] Hastie, T., R. Tibshirani and J. Friedman (2009). The
Elements of Statistical Learning: Data Mining, Inference and
Prediction. Springer Series in Statistics.
[RS05] Ramsay, J.O. and B.W. Silverman (2006). Functional Data
Analysis. Springer Series in Statistics.
[RO03] Ruppert, D., M. Wand and R. Carrol (2003).
Semiparametric Regression. Cambridge Series in Statistical and
Probabilistic Mathematics.
[W06] Wood, S. (2006). Generalized Additive Models: An
Introduction with R. Chapman & Hall/CRC.

7
Docente

Educación:

Doctor. Matemáticas y Ciencia Comp. Universidad de Göttingen.


Magister. Matemáticas Ap. y Estadı́stica. SUNY Stony Brook.
Magister. Matemáticas. PUCP.
Licenciado y Bachiller. Economı́a. UP.

Experiencia:

2019: Profesor Investigador. TI. U. ESAN.


2018: Gerente. Financial Services Office. EY Perú.
2017: Profesor investigador. Finanzas. U. del Pacı́fico.
2011–2016: Investigador asociado. IMS. U. Göttingen.
2005–2008: Cientı́fico Investigador. CGIAR. CIP.

8
Asistentes

Expectativa: analista, gerente, etc.


Sectores: banca, seguros, reguladores, etc.
Lenguajes: Python, R, C++, Matlab, etc.

9
Materiales

Desde un celular:

Desde un navegador:

https://github.com/LFRM/Lectures

10
Introduction
Overview

Machine Learning: is a toolbox to understand data using statistics


Objectives
1 Prediction / to predict something
2 Inference / to explain something
Problems
1 Supervised learning: “input ñ output” structure.
2 Unsupervised learning: only “input” structure.
Methods
1 Regression
2 Classification
3 Clustering

11
Basics

The general model follows:

Y “ f pX q ` , X “ tX1 , . . . , Xp u

where

Y is called response / dependent variable


X are called features / independent variables / predictors
f is a non-random function
 is a random error term, independent of X , and with zero mean.

In this course: we find f to predict / explain.

12
Basics

Usual Steps:

1 We observe predictors X and response Y .


2 We characterize their relationship

Y “ f pX q ` . (1)

3 We estimate fˆ “somehow”.
4 We use fˆ in X to make prediction Ŷ

Ŷ “ fˆpX q. (2)

Focus depends on Goals:

Prediction: we care mostly about Ŷ .


Inference: we care mostly about fˆ.

13
Estimation Error

Proposition 1

Erpf pX q ´ fˆpX qq2 s ` lo


ErpY ´ Ŷ q2 s “ loooooooooomoooooooooon Varrs
omoon . (3)
Reducible Irreducible

Proof. Trivial. Direct substitution from (1) and (2). 

Interpretation of proposition 1:

The magnitude of the estimation error has a reducible and an


irreducible component.
We cannot reduce the estimation error below Varrs.

14
Estimation of f : Parametric vs. Non-Parametric

Parametric Methods:

Impose rigid structure on f , e.g. f is linear.


Trade-off: easy to interpret vs. bad accuracy.

Non-Parametric Methods:

Impose flexible structure on f , e.g. f is a piecewise polynomial.


Trade-off: difficult to interpret vs. good accuracy.

15
Estimation of f : Parametric vs. Non-Parametric

Source: [JO13]

16
Estimation of f : Regression vs. Classification

Discrete response ñ classification / cont. response ñ regression.


Specific methods for regression or classification.
Some methods deal with both, e.g. K-nearest neighbors, boosting.

17
Estimation of f : Assessing Model Accuracy

Definition 1.1 (Mean Squared Error)


The Mean Squared Error (MSE) is defined as

MSE :“ Avetpyi ´ fˆpxi qq2 u (4)


n
1ÿ
“ pyi ´ fˆpxi qq2 .
n i“1

We call it “train” MSE if we compute it with training data pX , Y q and


“test” MSE if we compute it with “test” data pX0 , Y0 q.

Train MSE can be reduced arbitrarily (overfitting).


Test MSE cannot be reduced arbitrarily.
Test MSE is used for model selection.

18
Estimation of f : Assessing Model Accuracy

Left: data (circles), true function (black), linear fit (orange), spline fit 1 (blue),
spline fit 2 (green). Right: train MSE (gray), test MSE (red).

Source: [JO13]
19
Estimation of f : Bias - Variance Trade-off

Proposition 2

Given x0 , the expected test MSE can be decomposed into the sum of
three quantities: the variance of fˆpx0 q, the squared bias of fˆpx0 q and the
variance of the error term.
Proof. From proposition 1 we known that
Erpy0 ´ fˆpx0 qq2 s “ Erpf px0 q ´ fˆpx0 qq2 s ` Varrs
“ Erptf px0 q ´ Erfˆpx0 qsu
´tfˆpx0 q ´ Erfˆpx0 qsuq2 s ` Varrs,
Thus,
Erpy0 ´ fˆpx0 qq2 s Erpf px0 q ´ Erfˆpx0 qsq2 s ` looooooooooooomooooooooooooon
“ looooooooooooomooooooooooooon Erpfˆpx0 q ´ Erfˆpx0 qsq2 s

Squared Bias of fˆ Variance of fˆ


` Varrs
lo
omoon
Variance of error
20
Estimation of f : Bias - Variance Trade-off

Interpretation of proposition 2:

The reducible error is a mixture of bias and variance.


To minimize the expected test MSE we need a method that
minimizes both Bias and Variance.

Methods:

More flexible methods: Ó bias and Ò variance.


Less flexible methods: Ò bias and Ó variance.
The adequacy of the method depends on the data.

21
Estimation of f : Classification Setting

Definition 1.2 (Error Rate)


The error rate (ER) is the proportion of mistakes that the classifier makes
in a given data set

ER : “ AvetIŷi ‰yi u (5)


n
1ÿ
“ Iŷ ‰y ,
n i“1 i i

Definition 1.3 (Bayes Classifier)


The Bayes classifier minimizes (5), by selecting the value of j such that

maxj PrY “ j|X “ x0 s, (6)

for each x in the data set.

22
Estimation of f : Classification Setting

Note that the Bayes classifier:

Is theoretical, i.e. the conditional probability in (31) is unknown.


Provides a lower bound on ER,

ER ě 1 ´ E rmaxj PrY “ j|X “ x0 ss

23
Estimation of f : Classification Setting

Definition 1.4 (K -Nearest Neighbors)


K -Nearest Neighbors (KNN) is a classification method that requires a
positive integer K . For a test observation x0 , it identifies K points near
x0 , called N0 , and estimates the conditional probability for class j as:
1 ÿ
P̂rY “ j|X “ x0 s “ Iy “j ,
K iPN i
0

i.e. as the fraction of the points in N0 whose response values equal j.


Note that:

K has a strong effect on the classification obtained by KNN.


Small K : Ó bias and Ò variance.
Large K : Ò bias and Ó variance.

24
Estimation of f : Classification Setting

Train data ` True f

0.5

0.5
● ●
● ●
● ● ● ●
●● ● ● ●● ● ●
● ● ● ●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ●● ●
●● ● ● ● ● ●●● ●● ● ● ● ● ●●●
●● ● ●
●●●● ●● ● ●
●●●●
● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
● ● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●

● ●● ● ●● ●●
●●●

● ●●●●●● ●● ● ●● ● ●● ●●
●●●

● ●●●●●● ●●
●● ●● ●
●●● ●● ●● ●●●● ● ●● ●● ●● ●
●●● ●● ●● ●●●● ● ●●
● ● ● ●●

●●●● ●● ●
● ●
●●●●●●
● ●● ●● ●●
●●● ● ● ● ●●

●●●● ●● ●
● ●
●●●●●● ●● ●● ●●
●●●
●●●●● ●●● ●
● ●● ● ●● ●● ●● ● ●● ●●●●● ●●● ●
● ●●
●● ●● ●● ●● ● ●●
●● ● ●●
●●●● ●●●●● ●●●
●● ●●●●● ● ●●
● ●● ●● ●
●●
●●●● ● ● ●● ● ●●
●●●● ●●●●● ●●●
●● ●●●●● ● ●●
● ●● ●● ●
●●
●●●● ● ●
● ● ●●●●●●●●
● ●●● ● ●● ● ●● ●● ●●●● ●
●●
● ●●
●●● ● ● ●●●●●●●●
● ●●● ● ●● ● ●● ●● ●●●● ●
●●
● ●●
●●●

0.4
●●

0.4
● ● ● ● ●● ● ● ●● ● ●●
● ● ● ● ● ● ●●●●●●●●●●●● ● ● ●

● ● ● ● ●●●●●●●●●●●● ●
●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ● ●● ●● ●
●●●● ●● ●●

●● ● ●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ● ●● ●● ●
●●●● ●● ●●

●● ●
● ● ● ●
●● ● ●●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ●● ●
●●
●● ● ●●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ●● ●
●●
● ●●● ●● ● ● ●● ● ●● ● ● ● ●●●●●●
● ● ● ●●● ●● ● ● ●● ● ●● ● ● ● ●●●●●●
● ●
●● ●● ●
●● ● ● ●
● ●● ●●
● ● ● ●●●
● ●●
●● ●●●
●● ●● ●
●● ● ● ●
● ●● ●●
● ● ● ●●●
● ●●
●● ●●●

●●
● ●● ●● ● ● ● ● ●
●●●●●

●●
●● ● ●
●●
● ●● ●● ● ● ● ● ●
●●●●●

●●
●● ●
●● ● ● ● ●●●● ● ● ● ● ●●
●●● ●● ● ● ● ●●●● ● ● ● ● ●●
●●●
● ● ● ●●●


● ●●●●●● ●●●

● ● ● ●
●● ●● ● ● ● ● ●●●


● ●●●●●● ●●●

● ● ● ●
●● ● ● ●
● ● ●● ●

● ● ●●● ●●● ●● ● ●● ●● ●● ● ● ●● ●

● ● ●●● ●●● ●● ● ●● ●● ●●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●●●● ● ● ● ● ●● ●●●
● ●
● ●● ●●●● ● ● ● ● ●● ●●●
● ●

● ●● ● ●
●●● ● ● ●● ●●●● ● ●● ● ●
●●● ● ● ●● ●● ●●
●● ●●●
● ●●● ● ●●

● ●● ●● ●●●
● ●●● ● ●●

● ●●
●● ●
● ●●
● ●●●●● ● ●● ● ●
● ● ●● ●● ●
● ●●
● ●●●●● ● ●● ● ●
● ● ●

●●● ●● ●● ● ● ●● ●●● ● ● ●●● ●● ●● ● ● ●● ●●● ● ●
● ●● ●●● ● ●● ● ● ● ●● ●●● ● ●● ● ●
0.3

0.3
● ● ● ●● ●●●●● ●● ● ● ● ● ●● ●●●●● ●● ●
●●● ●●●●
● ●●●●● ●●
● ●
●●●● ●●● ●●●●
● ●●●●● ●●● ●
●●●●
●● ●● ● ● ●●●

●● ●●
● ●● ●● ● ● ●●●

●● ●●

y

y

● ●● ● ●● ●●●● ●
● ●● ● ●● ●●●●
●●● ● ● ● ●●
●●● ●● ●● ●●● ● ● ● ●●
●●● ●● ●●
●● ● ●● ●
● ●
● ●
●●

●●●●●● ● ●●● ●
●● ● ●
● ●
●●

●●●●●● ● ●●● ●
●●
● ● ● ● ●● ● ● ● ● ●●
●●●● ●●● ●●
●●●●● ●●●● ●●● ●●
●●●●●
●● ● ● ●●
● ●● ●● ●● ● ● ●● ● ● ●●
● ●● ●● ●● ● ●
●● ●
●● ●●
● ●● ●
●● ●●

●●●
●●●
●●●●
● ● ●●●●●● ●●●
●●●
●●●●
● ● ●●●●●●
●● ●● ●●●●● ● ●●●● ● ●● ●● ●●●●● ● ●●●● ●
●● ●● ●● ●●
●●●● ●●●●●●● ● ●● ●● ●
● ●●●● ●●●●●●● ● ●● ●● ●

● ● ●●
● ●● ●● ● ● ●●
● ●● ●●
0.2

0.2
●●●●● ● ●●●●● ●
●●●● ●● ●●●● ●●
● ●
● ● ●●● ●●● ● ●
●●● ●●●●
● ● ●●● ●●● ● ●
●●● ●●●●
● ●●●● ●● ● ●●●● ●●
●● ● ●● ●● ● ●● ● ●● ●● ●
● ●● ●●● ● ● ● ● ●● ●●● ● ● ●
●● ●
● ● ●●● ● ●● ●
● ● ●●● ●
● ●●●● ●
●● ● ●●●● ●
●●
●●●●●●●●●


● ●●●●●●●●●



● ● ●
● ●●
●●● ● ● ●
● ●●
●●●
● ●

●● ● ●
●● ●
●●
●●●●●●●●

● ●●
●●●●●●●●


●●●●
● ●●
● ●●●●
● ●●

0.1

0.1
● ● ● ●
●●●● ● ●●●● ●
● ●
●● ●●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

` Test data ` KNN fit


0.5

0.5
● ●
● ●
● ● ● ●
●● ● ● ●● ● ●
● ● ● ●● ●● ●

● ●● ● ● ● ● ●● ●● ●

● ●● ●
●● ● ● ● ● ●●● ●● ● ● ● ● ●●●
●● ● ●●●●
● ●● ● ●●●●

● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
● ● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●

● ●● ● ●● ●●
●●
●●●●●●●●● ●● ● ●● ● ●● ●●
●●
●●●●●●●●● ●●
●● ●● ●
●●● ●● ●● ●● ● ●●
● ●● ● ●● ●● ●
●●● ●● ●● ●● ● ●●
● ●● ●
●● ● ● ●


●●
●●●
●●● ●● ●
● ●●●●

●●●
● ●●● ●●●●●●●
●●

●● ●● ● ● ●


●●
●●●
●●● ●● ●
● ●●●●

●●●
● ●●● ●●●●●●●
●●

●●
●● ● ●●●●●●
● ●
● ●●● ●
● ●●
● ●● ● ●●
● ●●
●●● ●●

●●●
●●● ● ●
● ●● ● ●●●●●●
● ●
● ●●● ●
● ●●
● ●● ● ●●
● ●●
●●● ●●

●●●
●●● ● ●

● ● ●●●●●

●●●●●● ●●●●
●●●● ●● ● ●●
●● ● ●
●●●● ●
●●● ●● ●●●●●

●●●●●● ●●●●
●●●● ●● ● ●●
●● ● ●
●●●● ●
●●● ●●
● ●●● ● ● ●● ● ● ● ●●● ● ● ●●
0.4

● ●

0.4
● ●● ● ● ● ● ●● ● ●
● ● ●
● ●

● ● ● ● ●●● ●●●●

● ●
●●●● ●●● ● ● ●

● ● ● ● ●●● ●●●●

● ●
●●●● ●●● ●
●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ●

● ●● ●● ●●●● ●

● ●●●● ● ●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ●

● ●● ●● ●●●● ●

● ●●●● ●
●● ● ● ● ●●● ● ●● ● ● ● ●
●● ●● ● ● ● ●●● ● ●● ● ● ● ●
●●
●●● ● ●●●
●● ● ● ● ●● ● ● ● ● ● ● ●

●●
●●●●
●●
● ●●● ● ●●●
●● ● ● ● ●● ● ● ● ● ● ● ●

●●
●●●●
●●

● ●
● ●● ● ● ● ● ● ● ●
● ● ●●● ● ●
●●●
● ●●● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ●●● ● ●
●●●
● ●●● ●
●● ●● ● ● ● ●●●● ● ●● ●● ●● ●● ●● ● ● ● ●●●● ● ●● ●● ●●
● ●
●●
●●●●● ● ● ● ●● ●● ●
●●●
● ●
●● ● ● ●
●●
●●●●● ● ● ● ●● ●● ●
●●●
● ●
●● ●
●●●●●●
● ●●● ●● ●●
●● ● ● ●●
●●● ● ●●●●●●
● ●●● ●● ●●
●● ● ● ●●
●●● ●
● ● ● ●●●●● ●

● ●
●●●● ●● ● ● ● ● ●●●●● ●

● ●
●●●● ●● ●
● ● ●● ● ●● ●● ●●● ● ● ●● ●●● ● ● ●● ● ●● ●● ●●● ● ● ●● ●●●
● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●● ●
●● ●● ● ● ●●●
● ● ●● ●● ● ● ●●●
● ●
● ●● ● ●
●●● ● ● ●● ●●●● ● ●● ● ●
●●● ● ● ●● ●●●●
● ● ● ●●●●●
●●●●
● ● ● ●●
● ●
● ● ● ● ● ●●●●●
●●●●
● ● ● ●●
● ●
● ●
● ●
● ●●●●● ● ● ● ● ●● ● ●
● ●●●●● ● ● ● ● ●●
● ● ●●

● ● ●●
● ● ● ● ●
●●●●●●●●● ●●● ●
● ●●●●● ● ●●●●●●●●● ●●● ●
● ●●●●● ●
0.3

0.3

● ● ● ●● ●●●●● ● ●●● ● ● ● ●● ●●●●● ● ●●●


●●● ●●● ●●
● ● ●●●●
● ●
●●●● ● ●●● ●●● ●●
● ● ●●●●
● ●
●●●● ●
● ●
●● ● ● ●
●● ●
●● ●● ● ●●● ●●
● ● ●● ●● ● ●●● ●●
● ●
y

● ●
●●● ● ●

●● ●●●
● ● ●● ●●●● ●● ●●● ● ●

●● ●●●
● ● ●● ●●●● ●●
●● ● ●●● ●● ●●● ● ●● ● ●●● ●● ●●● ●
●● ● ●
●●

●●●●●● ● ●● ●
●● ●● ● ●
●●

●●●●●● ● ●● ●
●●
● ●●●●
● ● ● ●●●●●●● ● ●●●●
● ● ● ●●●●●●●
●● ● ● ●
● ● ●● ●●● ●
●● ● ●
●●
●● ● ● ●
● ● ●● ●●● ●
●● ● ●
●●
● ●●●●● ●●●●● ● ●●●●● ●●●●●
●●●
●● ● ● ●
●● ● ●●●
●● ● ● ●
●● ●
●●
●●●● ● ● ● ●●● ●●
●●●● ● ● ● ●●●
●●
● ●● ●● ●● ● ● ●●●● ● ●●
● ●● ●● ●● ● ● ●●●● ●
●● ● ●●● ●● ● ●●●
●●●● ●●●●●●● ● ●● ●
● ●●●● ●●●●●●● ● ●● ●

● ● ●
● ●●
●● ●● ● ● ●
● ●●
●● ●●
0.2

0.2

●●●●● ● ●●●●● ●
●● ●● ●● ●●
●●● ●● ● ●●● ●● ●
● ● ●●● ●●
●●● ● ●●●●
● ● ●●● ●●
●●● ● ●●●●
● ● ●
● ● ●● ● ● ●
● ● ●●
●● ● ●● ●● ● ●● ● ●● ●● ●
● ●● ●● ●●
●● ● ● ●● ●● ●●
●● ●
●● ●
● ●●●●● ● ●● ●
● ●●●●● ●
● ●●●●● ●
●● ● ●●●●● ●
●●
●●●●●●●●●


● ●●●●●●●●●



● ● ●
● ●●
●●● ● ● ●
● ●●
●●●
● ●

●● ● ●
●● ●
●●
●●●●


●●
●● ●● ●●
●●●●


●●
●● ●●
●●●●
● ●●
● ●●●●
● ●●

0.1

0.1

●● ● ●● ●
●●● ● ●●● ●
● ●
●● ●●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

25
R Software
The R Project for Statistical Computing

This is an extremely short introduction to R.


Click here to download the software.
For more information look at my lecture notes in Applied Statistics.

For now, lets run R online:

26
Basic Commands

The call for the funcname function is funcname(arg1,arg2),


where arg1 and arg2 are args. of the function.
1 What does c() evaluated in the arguments 1 and 2 does?
2 Does the order of the arguments matter?
We can store values using “<-” or “=”.
> x <- c(1,2)
> y = c(4,-1)
> x + y
[1] 5 1
Let z1 = 1000 and z2 = rep(1000, 3). Compute
1 x + z1
2 x + z2

27
Basic Commands

To ask for help on function funcname use ?funcname


To list all the objects in the workspace use ls().
To delete object obj from the workspace use rm(obj).
To delete every object use rm(list = ls())

28
Basic Commands

To create a matrix use matrix


> x <- matrix(data = 1:4, nrow = 2, ncol = 2)
> x
[,1] [,2]
[1,] 1 3
[2,] 2 4
Compute
1 z <- x ^ 2
2 r <- matrix(1, 2, 4)
3 q <- z %*% r

29
Basic Commands

To create a realization of a normal random variable use rnorm. To


specigy the mean, use mean and, to specify the standard deviation,
use sd
> x <- rnorm(10000)
> y <- rnorm(10000, mean = 50, sd = 0.1)
To compute the correlation between two r.v.s use cor.
> z <- x + y
> cor(x, z)
[1] 0.9951
To reproduce the exact same random number use set.seed() with
an arbitrar integer argument, e.g. set.seed(123)
> set.seed(123)
> rnorm(5)
[1] -0.56047565 -0.23017749 1.55870831 ...

30
Graphics

There are various functions for plotting. See ?plot


> x = rnorm(100) + 1:100
> y = rnorm(100) + seq(-1, -100, length = 100)
> plot(x, y)
> plot(x, y, xlab = "this is my x-axis",
ylab = "this is my y lab", main = "Plot x vs y")
To save the output use pdf(), or jpeg()
> pdf("myfgure.pdf")
> plot(x, y, color = 2, lwd = 3)
> dev.off()
null device
1

31
Graphics

Let f : R2 Ñ R,

To plot the contour of f use contour or image.


> x = 1:10
> y = x
> f = outer( x, y, function (x, y) cos(y) / (1 + x ^ 2) )
> contour(x, y, f)
> contour(x, y, f, nlevels = 45)
> fa = ( f - t(f) ) / 2
> contour( x, y, fa, nlevels = 15)
> image(x, y, fa)
To plot f in three dimensions use persp.
> persp(x, y, fa)
> persp(x, y, fa, theta = 30, phi = 20)
> persp(x, y, fa, theta = 30, phi = 40)

32
Indexing Matrix Data

Extracting part of a data set can be done in different ways:

> A = t(matrix(1:16, 4, 4))


> A
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16

Compute
1 A[2,3]
2 A[1,]
3 A[1:3, c(2,4)]
4 A[,-4]
5 dim(A)
33
Loading Data

To import data use read.table() or read.csv(); and to visualize


the imported data use fix().
> Auto = read.csv("Auto.csv", header = T, sep=" ")
> fix(Auto)
> dim(Auto)
[1] 392 9
To list the variable names in the data set use names().
> names(Auto)
[1] "mpg" "cylinders" "displacement" "horsepower"
[5] "weight" "acceleration" "year" "origin"
[9] "name"
To export data use write.table()

34
Additional Graphical and Numerical Summaries

To access a variable cylinders in data frame Auto, we use


> Auto$cylinders
To avoid using the dollar symbol, we can simply attach the data, so
that all the variables in the data frame are added to the workspace.
> attach(Auto)
> plot(cylinders, mpg)
Plot the following:
1 mpg vs. cylinders using red circles
2 mpg vs. cylinders using a red circles, with axis labels “mpg” and
“cylinders” resp.

35
Additional Graphical and Numerical Summaries

To plot a histogram use hist()


> hist(mpg)
> hist(mpg, col = 2)
To create a scatterplot matrix use pairs()
> pairs(Auto)
> pairs(~ mpg + displacement + horsepower
+ weight + acceleration, Auto)
To print a summary of a given variable or data frame, use
summary()
> summary(mpg)
> summary(Auto)

36
Linear Regression
Motivation

What?

Supervised learning method for continuous response.

Why?

Well document starting point.


Recall: not flexible, tipically Ò bias and Ó variance.
Fancy approaches are extensions/generalizations of linear regression.

37
Simple Linear Regression

Model:
Y “ f pX q ` , f pX q “ β0 ` β1 X , (7)
where  is a centered random noise, uncorrelated to X .

Example 1 Example 2
0.5

● ●
●● ●
● ●●
● ●
● ●

●●●● ●●● ● ●
● ● ●● ●


● ●● ● ●
●● ● ● ● ● ● ● ● ● ●● ● ●
●●●●●
●● ●



●●


●● ● ● ●● ●●● ● ●


● ●
●●●● ● ●● ●● ● ●●
● ● ●
●●●
●●●
● ●
●●●
●●


●●●



● ● ● ● ●● ● ●● ●●● ●
● ● ● ●
●● ●●


●●

●● ●

● ● ●●●●●●● ● ●●●●●●
● ●● ●● ● ●●●●●


●●
●●

● ●●●●
●●●
●●
●●
●●



●●


●●


●●



●●


●●
● ●


●●

●●
●●

●●


●●

● ●● ●●●● ●●● ●● ● ● ●●


●●
●●●● ●
●●●●●
● ● ●●●
● ● ● ●

●●
●●


●●


●●
● ● ●
●●

●●

●●


●●
● ●●●●● ●● ● ●●●●
●●
● ●
●●
● ●●●●● ●


●●


●●

●●

●●
● ● ●●
● ●● ●
●● ●●●●
●●●



●●


●●

●●
● ●● ● ●●

●●








●●●



● ●

● ●● ● ●● ● ●

0.8

● ●● ● ●
● ●●● ●● ● ●●
●● ●
●●●
●●


●●





●●
●●
● ●●
● ●● ● ● ●
● ● ●●● ●



●●


●●


● ●● ● ● ●● ●● ●
●●●●
●●





●●
●●● ●● ●
●●●●
● ● ● ● ● ● ●●●●●
● ●● ●●● ●●● ● ●

● ●●
0.4


● ● ●● ● ●●●●
● ●● ●●●●● ●●●










●●

●●●●●●● ●●●●●● ● ● ●
●●●●
●●●



●●●

●●

●●● ●●●

●●●● ●●●
●●●







●●



●●●● ● ●
●● ● ●● ●



●●
● ●●
● ● ● ● ● ●● ●●


● ●●● ●●●


● ● ●
●●● ● ●●
●●
● ● ●● ●● ● ●●●● ●● ● ● ●● ● ●

● ● ●● ● ●
●●
● ●●● ●
● ●●
●●●


●●●
●● ●● ● ● ●●● ●● ●● ●● ●●●


●●

●●

●●●● ●●●●

● ●
●●












●●
●●●
●● ● ●●● ●

●●

●●●● ●●●
●●

● ● ● ● ● ●● ●
●●


●●
●●
● ●●●
●●

●● ● ●● ● ● ● ● ● ●
●●●


●● ● ● ●●
● ●●


●●
●●●
●●
●● ●● ●●●●●●●




●●● ●●● ● ●● ● ● ●●● ● ●
●●●

●●

●●● ●
● ●●
●●
● ●

●●

●●


●●●●
●●
● ●●●●● ● ●




●●● ●● ●● ● ●●●
●●
●●

● ●● ●
● ● ●●
●●●
●●●

●●

●●
● ●
●●
●● ● ●
●●
●●
●● ● ● ● ● ●



● ● ●●
●●●●●




●●

●●
●●● ●

● ● ●
●●


●● ●

● ● ● ●●
●●●
●●●●
● ●● ●
●●



●●●
● ●
● ●●
● ●
● ●●●






●●●●
● ● ●●● ● ● ● ●●




●●


●●● ● ●
●● ● ●
● ●●
● ●
●●
●●






●●

●●●●●●

●● ● ●●●●●

●●●●
● ●●●●●●● ●●●
● ●
●●●● ● ●

●●


●●
●●●●
● ●

● ●●● ●● ● ● ●●● ●
●●● ●● ●
●●


● ●


●● ●
● ●
●● ●●●

● ●●●● ●● ●● ●● ●


●●
●●

● ●●
●● ●
●●
●● ●● ●● ● ●●●
● ●
●●
● ●
●●●●●
●●

●●
●●
●●

●●
● ●
●●●
● ●

0.6
●● ●

● ● ● ●

● ●
● ● ●●● ●●






● ●
●●
●●


●●●●● ● ●
● ●












●●

●●



●●

● ●

●●

●●
● ● ●● ● ● ● ● ●
● ●

●●
●● ●
●●● ● ● ●
● ●● ●
0.3

● ● ● ●●

● ● ● ●●
● ●●●● ●●●


●●●

● ●
● ●●
● ●● ● ●

●●

●●

● ●
●● ●●●●●● ● ●●●●●●●●
● ●●● ●●●●
●●

●●

●●
●●


● ●

● ● ●●●●
●●


●●● ●● ●●●
●●
●● ● ●● ●●






●●●● ●

●●



●●●
● ●● ● ●●●●
●●

● ● ●●●●● ●

●●
●●





●●
●●●●
●●●●● ●●




● ● ● ●
● ●
●●
●● ● ●● ●
●●
● ●


●●
●●
●●●●
●● ● ●
●● ●●
●●



●●●● ● ●●● ●
● ● ●






●●
●●
●●●●●●
y

●●● ●

y
● ●● ●● ●
●● ●
●●
● ● ●●

● ● ●● ●●
●●● ●●
●●●● ●●

●●


●●



●●
●●
●●


●●


●●●●●● ● ●●

● ● ●●●
● ●●●
●●



●●

●●

● ●●
● ●● ● ●

●●
●●● ● ●●●

●●●
● ●
●● ● ●● ● ●●










● ● ● ●

●●●● ● ● ●●

● ● ●
●● ●● ●●●

● ●● ●
●●●
● ●●●●
● ● ● ●●● ●●● ●
● ●

●●


●●
● ● ●
●●●
●●
●●
●● ● ●
● ●●●●


● ●●● ● ●●


●●

●●
●●


●● ●
●●●● ●●

●● ●●●●
●●● ●
● ●●●
●●
●●
● ●●

0.4
● ● ● ●●● ●●●● ●
●●● ●
●●●●●● ●● ●●●● ●
●●


● ●●●●
● ●●
●●

●●
●● ●


●●
● ●●●●●
●●
●●●

●●


●●


●●● ●
● ● ●
●● ●

● ●●

● ●●● ● ●
●●


●●

●●●
●●●●
● ● ●
0.2

●● ● ●
●● ● ●● ●●
● ●

●●
● ●
●●


●●● ● ● ●●
● ● ●●

●●

● ●● ●● ●
● ●●●





● ● ● ●● ●● ●●

●●





●●





● ●●●
●●
● ●●●●● ●●●●

● ● ● ●● ●●●
●●
●● ●●● ●
●● ●

●●●●
●●
● ●
●● ●● ●● ●● ● ● ●●
●●
●●



●●
●●


●● ●● ●





●● ●
● ●● ● ●●●● ●

●●
●●●●●
●● ●●



● ● ●
● ●●●●●●

●●

●●










●●●●

●●
●●
● ● ●●● ● ●●●


●● ●
● ●●

●●
●● ● ● ●●●●●●●
●●


●●
●●●●●

●●

●●●


●●● ● ●● ●●
●● ●●●
●●





●●
●●

●●●●● ●
● ●
●●

● ● ● ● ●
●●● ● ●
●●

●●
●● ●●
●●● ●
●● ●● ●●● ● ●

●● ●
●●
● ●●●●





●●

● ●●● ●
●●







●●

●●



●● ●●
●● ●
●●
●● ● ●●●


●●● ●● ●
0.2

●●●●●
● ●● ●●●●● ●


●●


● ●
●●●●
●●●● ● ● ●




●●● ●● ●●●●●● ●
●●
●●







●●

●● ●
●●●●
● ●●

●●●● ●●
●● ●

● ●●●●
●●●

●●
●●

●●●

●● ● ●● ●
0.1



● ● ●●● ●

●●


●●

● ●●
●●●●●
● ● ●
● ●
●●
●●
●●●
●●●














● ●
●● ●
●● ● ● ●

● ●

● ●● ●
●●
● ●



●●
●●
●●

●●●
●● ●
● ●●









●●
●●

●●●● ●
● ●
● ●
●●●●●●
● ●
● ●
● ●●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

38
Estimation

Idea: estimate f pX q via

fˆpX q “ β̂0 ` β̂1 X , (8)

e.g. identifying β̂0 , β̂1 that fulfills the least squares criteria.
Least Squares Criteria:

pβ̂0 , β̂1 q “ argminβ̂0 ,β̂1 RSS (9)


n
ÿ
RSS :“ ei2 , ei :“ yi ´ ŷi , (10)
i“1

where RSS denotes the “Residual Sum of Squares” (RSS) of the


model.

39
Estimation

FOC:
BRSS
“ β̂0 ` β̂1 x̄ ´ ȳ “ 0, (11)
B β̂0
n n n
BRSS ÿ ÿ ÿ
“ β̂0 xi ` β̂1 xi2 ´ xi yi “ 0. (12)
B β̂1 i“1 i“1 i“1

with solution

β̂0 “ ȳ ´ β̂1 x̄, (13)


řn
px ´ x̄qpyi ´ ȳ q
β̂1 “ řn i
i“1
2
. (14)
i“1 pxi ´ x̄q

RSS is convex, thus the SOC is fullfilled.

40
Example: Sample Mean Estimator

Recall:

Let Z „ pµ, σq, where the mean µ and s.d. σ are unknown.
Sample mean estimator: Collect n random samples of Z , and
řn
estimate µ by µ̂ “ n1 i“1 zi . Repeat the previous m times with
different samples of the same size: tµ̂p1q , µ̂p2q , . . . , µ̂pmq u.
Unbiasedness: We say µ̂ is an unbiased estimator of µ, since
m
1 ÿ pkq
µ̂ Ñ µ as k Ñ 8.
m k“1

Variance: defined as
σ2
SEpµ̂q2 “ ,
n
measures the precision of the sample mean estimator.

41
Bias in Simple Linear Regression

β0 and β1 are r.v.s, and given some data we estimate them by β̂0
and β̂1 , but the estimators vary according to the random sample.
pkq
Unbiasedness: if we consider m random samples, we obtain β̂0 and
pkq
β̂1 , for k “ 1, . . . , m. It can be shown that
m m
1 ÿ pkq 1 ÿ pkq
β̂ Ñ β0 and β̂ Ñ β1 as k Ñ 8,
m k“1 0 m k“1 1

thus fˆpX q is an unbiased estimator of f pX q, that is


m
1 ÿ ˆ pkq
f pX q Ñ f pX q as k Ñ 8,
m k“1

where f pX q is the “population line” and fˆpX qpkq is a “sample


estimate of the line” corresponding to the least squares criteria.

42
Variance

Variance: The precision of the estimators follow:

x̄ 2
„ 
2 2 1
SErβ̂0 s “ σ ` řn 2
, (15)
n i“1 pxi ´ x̄q
σ2
SErβ̂1 s2 “ řn 2
, (16)
i“1 pxi ´ x̄q

where we have used Varri s “ σ 2 and Covpi , j q “ 0, for i ‰ j.


Note that σ is not known, and it is estimated by
c
RSS
σ̂ “ , (17)
n´2
sometimes called the “residual sum of errors” (RSE).

43
Confidence Bands

Is a range of values that contain the true unknown value of the


parameter with certain probability.
Example: a 95% confidence band says that with 95% probability, the
band contains the true value of the parameter
In linear regression, if we add the assumption that  is normally
distributed, the bands for β̂i look approx. like this

rβ̂i ´ 2 ˆ SEpβ̂i q, β̂i ` 2 ˆ SEpβ̂i qs, i “ 0, 1. (18)

44
Hypothesis testing

Consider the hypothesis “there is no relation between X and Y ”.

H0 : β1 “ 0
Ha : β1 ‰ 0

We test H0 by computing the quantity

β̂1 ´ 0
t“ , (19)
SErβ̂1 s

which is distributed t with n ´ 2 degrees of freedom (tn´2 in short),


and is called t-statistic.
We reject H0 if the t-statistic is “large” ô if the p-value is “small”,
where “small” means, e.g., less than ă 0.05.

45
Hypothesis Testing

Remark 3.1
To see that t „ tn´2 in (19), recall that a r.v. X “ ?Z is distributed
V {ν
tν if Z „ N p0, 1q, and V „ χ2ν ,
where χ2ν
denotes the chi-squared
distribution with ν degrees of freedom. The denominator of (19) reads
d d
RSS{σ 2 σ2
SErβ̂1 s “ řn 2
, (20)
pn ´ 2q i“1 pxi ´ x̄q

where RSS 2
σ 2 „ χn´2 , while the numerator follows
Od
σ2
pβ̂1 ´ 0q řn 2
„ N p0, 1q . (21)
i“1 pxi ´ x̄q

Writing (21) over (20) we obtain the desired expression.

46
Model Accuracy

Residual Standard Error (RSE).


dř dř
n n
c
2 2
RSS i“1 ei i“1 pyi ´ ŷi q
RSE :“ “ “ .
n´2 n´2 n´2

Problem: it is not clear what is a “good” RSE.


R-Squared (R 2 )

TSS ´ RSS RSS


R2 “ “1´ , (22)
TSS TSS
n
ÿ
TSS “ pyi ´ ȳ q2 .
i“1

Interpretation: It is between 0 and 1. It indicates the proportion of


the variability of Y that can be explained by X . Further, it can be
2
shown that R 2 “ Cor(X,Y) , by plugging-in (13) and (14) in (22).

47
Multiple Linear Regression

Model:

Y “ f pX q ` , f pX q “ β0 ` β1 X ` ¨ ¨ ¨ ` βp X , (23)

where  is a centered random noise, uncorrelated to X .

48
Multiple Linear Regression

Data 1 True Function 1


● ●
●●● ● ● ●
●●● ●● ●●● ●
● ● ●● ● ● ●

● ●●●●●●●● ●
● ●●●●

● ● ●● ●
●● ● ● ●
● ● ●●
● ● ● ●● ● ●●
●● ●●● ● ● ● ●●●● ● ● ● ●

●● ●● ●● ●
●● ● ● ● ●
●● ● ●
● ●●●●●● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ●
●● ●●● ● ● ●● ● ● ● ● ● ● ●


● ●
●● ●●● ● ● ● ● ●● ● ● ●● ●●● ●● ●● ● ● ● ● ●●
●●● ●●● ● ● ●
●●
●● ● ●●●●● ● ●●●● ●
● ●● ● ●
● ● ● ● ●
● ● ●●
●● ●●● ● ●● ● ● ● ●● ● ● ● ●
●● ● ● ● ●
●● ● ●● ● ● ●●
●● ● ● ●●●● ● ●
●● ● ● ● ● ● ● ●●●● ●● ●
●● ● ● ● ●● ● ●●●● ● ● ●●● ●● ● ● ● ● ●

●●●● ● ● ● ● ● ●●● ● ● ● ●
●● ● ● ● ● ● ● ●●● ●● ● ●●

1.0
● ● ● ● ●
● ● ● ● ● ●● ● ● ●● ●


● ●● ●

● ●●
●● ● ● ●● ●● ● ● ●

● ● ●●●
●● ●●● ●●● ● ● ●● ●● ●● ● ●●●● ●
●●● ●
● ●●

● ● ●●●
● ●● ●● ●●●●●●●● ●●
●● ●
●● ● ● ● ● ●● ●● ● ●●● ● ●●
●● ●●
● ● ●
● ●● ● ●● ● ● ● ●
●●●● ●● ●●● ● ●
● ● ● ● ● ●●● ● ● ● ● ● ●●●●●
● ● ● ●●

● ● ● ●● ● ● ●● ●●● ● ● ●● ● ●● ●●● ●● ●●
● ● ● ● ● ●
●● ●● ● ●● ●● ● ● ● ● ● ●● ●● ●
● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ●
● ● ●●● ●● ●
0.5

● ● ● ●● ● ● ● ●
● ●● ●● ● ● ● ● ●●●●● ●● ● ● ●●● ●●●●●●● ●●● ● ●
● ● ●● ●
●● ●● ● ● ●
● ● ●●●
● ●●●● ●● ● ●● ● ● ●●● ●
● ●● ●
● ●●●● ●
●● ● ● ● ● ●● ●●
● ● ●● ● ●●
● ●● ● ● ● ● ●●● ● ● ●● ●
● ●
● ●●●
● ● ●●●●●● ●● ● ●●
● ●● ● ● ●● ● ● ● ● ● ● ●
● ●●● ● ● ● ● ●● ●● ●●●● ● ● ● ● ● ●
● ●
●●● ● ● ●
●●●
●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ● ●
●●●
● ● ●

x2
● ●
●● ● ● ● ●● ● ● ●● ● ● ●● ● ●● 1.0
● ●●●
●● ● ● ● ●● ●
●● ●●● ●●● ●●

y1
●●● ●●●● ● ● ● ● ●
0.0

● ● ● ●
y1

● ●●● ●
●● ● ●
●● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ●
●● ● ● ●● ●●● ● ● ● ●●●● ●

●●●●●● ● ● ● 0.8
● ● ● ●● ● ● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ● ● ●●
● ● ● ●●
● ●● ●● ● ●
● ●● ●


● ● ●● ●● ●●
0.6

x2
●● ● ●●
● ● ●
●●●● ● ● ●
−0.5

●● ● 0.4

0.2
−1.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
x1

Data 2 True Function 2


●● ● ●● ●
● ●● ● ●● ● ● ● ●
●●
● ● ●●●●● ●● ● ● ● ● ●●● ●
● ● ●●●● ●● ● ● ●● ●● ● ● ● ●
●● ● ● ● ● ●● ●●● ● ●● ● ● ●
●● ● ● ●
● ●●●●● ●● ● ● ●● ● ● ●
● ●● ●●
●●
● ● ●●
●●●
● ● ● ● ●● ●● ● ●●●● ●● ●●●●



●● ●
● ● ● ●●● ●●
● ●● ●
● ● ● ● ●●
● ● ●● ● ● ● ● ●● ● ●● ● ● ●●●●●

●● ●●● ● ● ●● ●● ● ●
●●●
●●●● ● ● ● ●●
●● ● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●

● ●●●
●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●
●●
●● ● ● ●● ● ● ● ● ● ● ●
● ●● ●●●
1.5

●● ● ●●● ●● ●● ●●● ●
● ●● ●

● ● ●●● ●● ● ●● ●
●● ● ● ●
●● ●● ● ● ●

● ●●● ● ●● ● ●● ●● ●●

●●● ●● ●●●●
● ● ●● ● ●
●●● ●●● ●● ●●● ●● ●●● ● ●
● ●● ● ●● ● ● ● ● ● ●●
●● ● ● ● ● ●● ●● ● ● ●● ●
● ● ● ●●● ● ● ● ● ● ●● ● ● ●●
●● ● ● ●● ●● ● ● ● ● ●
●●● ● ●
●●
● ●●● ● ● ●● ● ● ● ●●● ●● ● ●●●●●● ●● ● ●
● ●● ● ● ●
1.0

●●● ● ● ●● ● ● ●
●● ● ● ●● ● ● ●● ● ●● ●
●●● ●●●
● ● ● ● ● ● ●●●
●●● ● ● ●●● ●●● ● ● ● ●
● ●●
●● ●● ● ● ●● ●
● ●● ●● ● ●
●●●● ● ●●
●● ●
● ●● ● ● ● ●●
● ● ● ● ● ● ● ●● ●● ● ●●

●●●●●●●● ● ● ●● ●● ●● ●● ● ● ● ● ● ●●●
● ●


● ● ●
● ●
●● ● ● ● ● ● ●●
●●● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●
●● ●● ● ●●● ●●● ●
● ● ● ●● ●● ●● ●● ● ●
0.5

● ● ●● ● ● ● ●
● ●● ● ●●●● ● ●● ● ●● ●●
y1

● ● ●● ● ● ●● ● ●●● ● ●● ●

●● ●● ● ●
●● ●●● ●● ●● ● ● ● ●● ● ●● ●● ● ● ● ●
● ●
●●●● ●● ● ●●●● ● ● ● ●● ● ● ●● ●● ●
●●
● ● ●● ● ● ●●
●● ● ●
y2

●● ● ●●●● ●● ● ● ●● ●● ● ●●

● ●●
●●●● ●●● ● ●● ●● ● ● ● ●● ●
●● ● ●
●●●
● ● ● ●● ● ● ● ●●
● ● ● ● ●● ● ●● ●
x2

● ●● ● ●● ●● 1.0
● ●●●
●● ●● ● ● ●● ● ●●●● ●
● ●● ●
0.0

●●● ●●● ● ● ● ●●● ● ● ● ● ●●


●●●
● ● ● ● ●
●●●●●● ● ● ● ●
● ● 0.8
● ● ●● ● ●●● ●● ● ● ● ●
● ●● ● ● ● ●● ● ●
x2

● ● ●● ●
● ●● ● ● ● ● ●● ● 0.6
● ●● ● ● ●● ●●●
● ●
●●●● ● ●
−0.5

●● ● 0.4

0.2
−1.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
x1

49
Hypothesis Testing

Is there is at least one Xj that explains Y ?

H0 : β1 “ β2 “ ¨ ¨ ¨ “ βp
Ha : at least one βj is non-zero

Compute the F-statistic

pTSS ´ RSSq{p
F “
RSS{pn ´ p ´ 1q

Recall that the r.v. X “ U 1 {d1 2 2


U2 {d2 is dist. Fd1 ,d2 , for U1 „ χd1 , U2 „ χd2 . If
the p-value is small (e.g. ă 0.05) we reject H0 . This means that there is
at least one Xj that explains Y .

50
Variable Selection

How to select a good subset of Xj ’s out of all predictors?


Idea: try all possible models. Problem: need to evaluate 2p models. If
p “ 30, this is more than a million models.
Forward selection: Start with a model that only uses the intercept to
predict Y (“the best model with zero variables”) and add one
variable to the model at a time. To select “the best model with one
variable”, evaluate all of them and select the one with the smallest
RSS. Repeat the idea to select “the best model with 2, 3, . . . k
variables”. Repeat until some stopping criteria is reached.
Backward selection: Start with a model that has all variables.
Remove one variable at a time selecting the one that has the largest
p-value. Re-run the model and repeat the process until some
stopping criteria is reached.
Note that backward selection cannot be done if p ą n. The stopping
criteria can be related to some target p-value on the models in the model.
51
Model Accuracy

Advantages and disadvantages:

RSS: non scaled.


R-squared: scaled in r0, 1s. In fact it can be shown that
R 2 “ CorpY , Ŷ q2 . Problem: increases wrt p.

52
Predictors

We can compute ŷi “ fˆpxi q, as a predictor of


p
ÿ
yi “ f pxi q ` , f pxi q “ β0 ` βi x i
i“1

Confidence intervals: compares fˆpX q with f pX q, i.e. includes only


reducible errors.
Prediction intervals: compares fˆpX q with f pX q ` , i.e. includes
reducible and irreducible errors.

53
Predictors: Boston Example

Train data ` Estimated f

50

50

● ●
●●
●●
●● ●
●● ● ●● ●
● ●
●●
●●
●● ●
●● ● ●●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ● ● ●

40

40
● ●
● ●
●● ● ●● ●

● ●

● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●
● ●●●● ● ● ●●●● ●
● ● ● ●
●●● ●● ●● ● ● ●●●● ●● ●● ●
● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●
30 ●●● ● ● ●●● ● ●

30
● ●● ● ● ●● ●
● ●● ● ● ● ● ●● ● ● ●
●●●● ● ● ●●●● ● ●
● ●● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ●● ●●
y

y
● ●● ●● ● ●
● ● ● ●● ●● ● ●
● ●
●●●●●● ●● ●
● ● ●

●● ●●●●● ●


●●
● ● ● ● ● ●● ●



● ●●●●● ●


●●
● ● ●
●●
●●● ● ●●●●● ●●●● ●●●●
● ●
● ●●
●●● ● ●●●●● ●●●● ●●●●
● ●

●●●● ● ●
●●● ● ●● ●●●● ● ●
●●● ● ●●
●●●● ● ● ● ●● ●●●●●● ●● ● ● ● ●●●● ● ● ● ●● ●●●●●● ●● ● ● ●
● ●●● ●●●●● ●
● ● ●●● ●● ●● ●
● ● ● ●●● ●●●●● ●
● ● ●●● ●● ●● ●
● ●
●●●●● ● ●
●● ●●●●● ● ● ● ●●●●● ● ●
●● ●●●●● ● ● ●
● ● ●●

● ● ● ●●


● ● ●●●●●
● ● ● ●● ●● ●
●●● ● ●●● ● ● ●●●●●
● ● ● ●● ●● ●
●●● ● ●●●
●● ●● ●
● ● ●● ●● ●
● ●
20

20
●● ● ●●
●● ●● ●●●
● ● ●● ● ●●
●● ●● ●●●
● ●
● ● ●● ● ●●●●●●
●● ●

●●●
● ●●
●●●● ●●

● ● ● ● ●● ● ●●●●●●
●● ●

●●●
● ●●
●●●● ●●

● ●
●● ●●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ● ●● ●●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ●
● ● ●● ● ● ● ●● ●
● ●
●●● ●

●●

● ● ● ● ● ● ● ●
● ● ●
●●● ●

●●

● ● ● ● ● ● ● ●

● ●●● ●● ●
● ● ● ● ●●● ●● ●
● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●● ●
● ● ●●
● ●●● ● ●● ● ●●● ● ●●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
●● ●●● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●
●●
●●● ● ● ●● ● ● ●● ● ● ●
●●
●●● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
10

10
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●●
● ● ● ● ●

● ● ●●
● ● ● ● ●

● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ●

10 20 30 10 20 30

x x

` Confidence Interval ` Prediction Interval


50

50

● ●
●●
●●
●● ●
●● ● ●● ●
● ●
●●
●●
●● ●
●● ● ●●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ● ● ●
40

40
● ●
● ●
●● ● ●● ●

● ●

● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●
● ●●●● ● ● ●●●● ●
● ● ● ●
●●● ●● ●● ● ● ●●●● ●● ●● ●
● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●
●●● ● ● ●●● ● ●
30

30

● ●● ● ● ●● ●
● ●● ● ● ● ● ●● ● ● ●
●●●● ● ● ●●●● ● ●
● ●● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ●● ●●
y

● ●● ●● ● ●
● ● ● ●● ●● ● ●
● ●
●●●●● ● ●● ●
● ● ●
●● ●●●●● ●
●●● ● ● ● ●● ●●
● ●●●●● ●
●●● ●
●● ● ●●●● ●● ● ● ●
●●

● ●●●● ●● ● ● ●
●●● ●●
●●●●● ●● ●● ● ● ●●● ●●
●●●●● ●● ●● ● ●
●●●● ●●●●●●●
● ● ●●●●●●● ●● ●● ●● ● ● ●●●● ●●●●●●●
● ● ●●●●●●● ●● ●● ●● ● ●
● ●●● ●● ●●●●●●● ●
● ● ●●● ●● ●● ●
● ● ● ●●● ●● ●●●●●●● ●
● ● ●●● ●● ●● ●
● ●
●●●●● ●● ●● ●● ● ●●●●● ●● ●● ●● ●
● ●● ● ●● ●● ●● ● ●● ●● ●● ●
● ● ●●●●● ●● ●● ●

●● ● ●●● ● ● ● ● ●●●●●
● ●● ● ●● ●● ●

●● ● ●●● ● ●
●●● ●●● ● ● ●●● ●●● ● ●
20

20

●● ●● ●● ●● ●●
● ● ●● ●● ●● ●● ●●
● ●
● ●● ● ● ●●● ● ● ● ●●● ●
●●● ●●●●●● ●●●
●● ●●●
●●●●
● ● ● ● ●● ●●● ●●●●●● ●●●
●● ●●●
●●●●
● ● ●
●● ●●● ● ●● ●● ● ●● ●●●●
● ● ●● ●●● ● ●● ●● ● ●● ●●●●
● ●
● ● ●● ● ● ● ●● ●
● ●
●●● ●

●●

● ● ● ● ● ● ● ●
● ● ●
●●● ●

●●

● ● ● ● ● ● ● ●

● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●● ●
● ● ●●
● ●●● ● ●● ● ●●● ● ●●
● ●●● ●● ●● ● ● ●
● ● ● ● ●●● ●● ●● ● ● ●
● ● ●
● ●
● ●● ● ● ● ● ●
● ●● ● ● ●
●● ●
●● ● ●● ● ● ●●
●● ● ●
● ● ●● ●
●● ● ●● ● ● ●●
●● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
10

10

● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●●
● ● ● ● ●

● ● ●●
● ● ● ● ●

● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ●

10 20 30 10 20 30

x x

54
Other Considerations

Qualitative predictors. An m-level predictor that induces m


sub-models requires m ´ 1 indicator variables.
Interaction between predictors. If of the form xi ˆ xj , we have no
longer a linear model.

55
Potential Problems

The true relation f pX q is non-linear. See pattern in error (ei ) plots,


e.g. not centered.
Serial correlation between error terms. See persistence in ei plots.
Non-constant variance of errors. See ŷi vs ei plot.
Outliers. See Ŷ vs Y plot.

56
Potential Problems

High leverage points. For simple linear regression, check

1 pxi ´ x̄q2 1
hi “ ` řn 2
, hi P r , 1s. (24)
n j“1 pxj ´ x̄q n

called “leverage statistic”. hi « 1, means high leverage at i.


Collinearity. Check the variance inflation factor
1
VIFpβ̂j q “ , VIFpβ̂j q P r1, 8s. (25)
1 ´ RX2 j |X´j

where RX2 j |X´j is the R-squared of the regression of Xj against all


other predictors except Xj .

57
Comparison with KNN

If p is small and non-linear relation, KNN is better.


If p is large, KNN has problems due to the “curse of dimensionality”.
The plot below shows 10 realizations of x, y , z „ Up0, 1q plotted the
line, de square and the cube. See how the separation of the points
increase with the dimensionality.

1D 2D 3D
1.0

1.0
0.8

0.8

1.0
0.6

0.6

0.8
y

0.6
0.4

0.4

z
1.0

y
0.4
0.8
0.2

0.2

0.6

0.2
0.4

0.2
0.0

0.0

0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x
x x

58
Homework

Due 4.25.19 - 19:30.

1 Chapter 2: Exercises 2, 4, 7, 9 from [JO13].


2 Chapter 3: Exercises 3, 4, 9, 14 from [JO13].

59
Classification Models
Linear Regression

Source: [JO13]
60
Linear Regression

Remark 4.1
Note that linear regression fails for classification problems. Consider: Y
coded as 0{1 and p “ 1. In linear regression we estimate fˆ : Rp Ñ R,
meaning that we are mapping onto the whole real line, and not only onto
t0, 1u. This means we can predict 0.7, which is meaningless.

61
Linear Regression

Source: [JO13]

62
Simple Logistic Regression

Consider the logistic regression model for Y coded as 0{1 and p “ 1.

PpY “ 1|X q “ ppX q, (26)


eX
ppX q “ g ˝ f pX q, f pX q “ β0 ` β1 X , g pX q “ ,
1 ` eX
where g is called the logistic function.
Remark 4.2
Note that β1 is not the marginal contribution of X to PpY “ 1|X q, but
the marginal contribution of X to the log-odds
ˆ ˙
ppX q
log “ β0 ` β1 X , (27)
1 ´ ppX q

where 1 ´ ppX q “ 1 ´ PpY “ 1|X q “ PpY “ 0|X q.

63
Estimation & Prediction

The estimation is done via maximization of a likelihood function

β̂0 , β̂1 “ argmaxβ0 ,β1 `pβ0 , β1 q (28)


(
`pβ0 , β1 q :“ tΠi:yi “1 ppxi qu ˆ Πj:yj “0 1 ´ ppxj q , (29)

where ppxi q “ Ppyi “ k|xi q, and `pβ0 , β1 q is a convex function.


Once the parameters are estimated, the prediction is computed as

e β̂0 `β̂1 X
p̂pX q “ ,
1 ` e β̂0 `β̂1 X
and the classification follows a rule of the form:
#
1 if p̂pX q ą H
ŷi “
0 if p̂pX q ď H,

where H is a threshold value, e.g. H “ 0.5.

64
Multiple Logistic Regression

Consider the logistic regression model for Y coded as 0{1 and p ą 1.


ˆ ˙
ppX q
log “ β0 ` β1 X1 ` β2 X2 ` ¨ ¨ ¨ ` βp Xp , (30)
1 ´ ppX q

where:
e β0 `β1 X1 `β2 X2 `¨¨¨`βp Xp
ppX q “ .
1 ` e β0 `β1 X1 `β2 X2 `¨¨¨`βp Xp

Estimation and prediction: analog to the simple logistic model case.


Logistic regression for ą 2 response classes are not used often. For
these cases we consider linear discriminant analysis.

65
Bayesian Classifier

Consider the Bayes Theorem for K response classes:

πk fk pxq
pk pX q “ řK , (31)
`“1 π` f` pxq

where:

πk “ PpY “ kq: prior probability that a randomly chosen


observation comes from the k-th class.
fk pxq “ PpX “ x|Y “ kq: density function of X for an observation
that comes from the k-th class.
pk pX q “ PpY “ k|X “ xq: posterior probability that a randomly
chosen observation comes from the k-th class

Idea: πk is easy, fk pxq is difficult. With reasonable fˆk pxq, can


approximate the Bayes classifier (classifier with the smallest error rate).

66
Bayesian Classifier (example for p=1)

Let k “ 2, with π1 “ π2 , and fk pxq „ N pµk , σq, thus:

px ´ µk q2
" *
1
fk pxq “ ? exp ´ (32)
2πσ 2σ 2

As consequence of (31), the posteriori distribution follows


! 2
)
1
πk ?2πσ exp ´ px´µ
2σ 2
kq

pk pX q “ ř ! ), (33)
K 1 px´µ` q2
`“1 π ` ?
2πσ
exp ´ 2σ 2

µk µ2
9 2
X ´ k2 ` logpπk q,
σ 2σ
thus the Bayes classifier selects class k that maximizes

µk µ2
δk pX q “ 2
X ´ k2 ` logpπk q. (34)
σ 2σ

67
Bayesian Classifier (example for p=1)

Using (34), class 1 is selected if

δ1 pX q ą δ2 pX q
µ1 µ21 µ2 µ22
X´ ` logpπ1 q ą X ´ ` logpπ2 q
σ2 2σ 2 σ2 2σ 2
µ1 ` µ2
X ą ,
2
and class 2 otherwise.

68
Linear Discriminant Analysis

Let p “ 1, and fk pxq „ N pµk , σq, thus:

LDA uses estimators of σ, µi and πi , and plug them in (33) as

µ̂k µ̂2k
δ̂k pX q “ X ´ ` logpπ̂k q. (35)
σ̂ 2 2σ̂ 2
In particular, the required sample estimates follow
nk
π̂k “ (36)
n
1 ÿ
µ̂k “ xi , (37)
nk i:y “k
i

K
1 ÿ ÿ
σ̂ 2 “ pxi ´ µk q2 (38)
n ´ K k“1 i:y “k
i

LDA is “linear” since (35) is a linear function of X .

69
Linear Discriminant Analysis

Source: [JO13]

70
Linear Discriminant Analysis

Let p ě 1, and fk pxq „ N pµk , Σq, thus:


" *
1 1 J ´1
fk pxq “ exp ´ px ´ µk q Σ px ´ µk q , (39)
p2πqp{2 |Σ|1{2 2

where ErX s “ µ and CovpX q “ Σ. Analogous to the p “ 1 case,

We are interested in selecting k that maximizes


1
δk pxq “ x J Σ´1 µk ´ µJ Σ´1 µk ` logpπk q (40)
2 k
We face decision a boundary between k and ` given by
1 1 J ´1
x J Σ´1 µk ´ µJ ´1 J ´1
k Σ µk ` logpπk q “ x Σ µ` ´ µ` Σ µ` ` logpπ` q.
2 2

71
Linear Discriminant Analysis

Source: [JO13] 72
Assessment of Binary Classifiers

Confusion matrix:

True condition
Condition positive Condition negative
Predicted True positive False positive
condition (Type I error)
positive
Predicted False negative True negative
condition (Type II error)
negative

where:
True positive rate = (sum of true positive)/(sum of condition positive)
False positive rate = (sum of false positive)/(sum of condition negative)

73
Assessment of Binary Classifiers

The confusion matrix:

Shows what kind of mistakes we are making, i.e. predicting 0 when


is 1, and viceversa.
Varies according to the selected threshold level. In some cases one
type of error could be preferable than the other.

74
Assessment of Binary Classifiers

Source: [JO13]
75
Assessment of Binary Classifiers

To evaluate a model in terms of all possible threshold levels we use the


ROC curve. A measure of performance of a model is the AUC P r0, 1s,
i.e. area under the curve.

Source: [JO13]
76
Quadratic Discriminant Analysis

Same as LDA, but each class has its own covariance matrix
Thus, we are interested in finding k that maximizes
1 1
δk pxq “ ´ px ´ µk qJ Σ´1k px ´ µq ´ log |Σk | ` logpπk q
2 2
1 1 J ´1
“ ´ x J Σ´1
k x ´ µk Σk µk ` µk Σk x
J ´1
2 2
1
´ log |Σk | ` logpπk q, (41)
2
where the first term is a quadratic form and thus, the decision
boundaries have curvature.
Equation (41) means that we need to estimate Σk for each k.

77
Quadratic Discriminant Analysis

Remark 4.3
QDA is more flexible than LDA. Intuitively this means that when
compared to each other:

LDA: Ò Bias Ó Variance


QDA: Ó Bias Ò Variance

78
Linear Discriminant Analysis

Source: [JO13]
79
Comparison of Classification Methods

Logistic regression vs. LDA (k “ 2, p “ 1)

In LDA, note that since p2 pX q “ 1 ´ p1 pX q

µ2 ´ µ2
ˆ ˙ ˆ ˙
p1 pxq µ1 ´ µ2
log “ logpπ1 q ´ logpπ2 q ` 2 2 1 ` X
1 ´ p1 pxq 2σ
looooooooooooooooomooooooooooooooooon σ2
looooomooooon
c0 c1
“ c0 ` c1 X ,

which is clearly linear in X , just as in (27).


Difference in fitting procedures. In logistic regression we use ML,
and in LDA we use the plug-in estimators π̂i , µ̂i and σ̂.

80
Comparison of Classification Methods

KNN vs Logistic regression / LDA

KNN is completely non-parametric.


KNN does not deliver a specific form for the decision boundary, e.g.
linear or quadratic.
KNN does not give a list of coefficients, or statistics.

Remark 4.4
KNN is more flexible than QDA. Intuitively this means that when
compared to each other:

QDA: Ò Bias Ó Variance


KNN: Ó Bias Ò Variance

81
Resampling Methods
Overview

Problem: want to select the model with the best test error
performance, but usually do not have available test sets.
Idea: create artificial test sets by sampling.
Limitation: sampling from a population a repeated number of times
demands some computation power.
Sampling methods:
1 Cross validation: Validation set, LOOCV, and k-fold
2 Bootstrap

82
Cross Validation: Validation Set

Main idea: decompose

data set “ training set ` validation set,

and fit the model using “training” (70% obs.). Then compute MSE
using “validation” (30% obs.). Select the model with smallest MSE.
Drawbacks
1 Test MSEs are too variable because samples can be very dissimilar.
2 Test MSEs are too high because we are only using part of the data.

83
Cross Validation: Validation Set

Source: [JO13]

84
Cross Validation: LOOCV

Main idea: decompose

data set “ training set ` one validation point pxi , yi q,

fit with “training” and evaluate using MSEi in “validation”. Do this


in a loop selecting as validation point each point in the data set, and
compute the performance measure
n ¯2
1ÿ ´
p´iq
CVpnq “ MSEi , MSEi “ yi ´ ŷi (42)
n i“1

Advantages (wrt VS)


1 Ó Variance: small sample variability.
2 Ó Bias: uses almost all the data.
Disadvantage: n fits required. Can be very slow.

85
Cross Validation: LOOCV

Remark 5.1
Consider simple linear regression. We stress that the use of
p´iq p´iq p´iq
ŷi “ β̂0 ` β̂1 xi in expression (42) is computed using
ř ř
p´iq j‰i yj j‰i xj
β̂0 “ ´ β̂1 ,
n´1 ř
n ´1 ř
´ ¯ ´ ¯
j‰i xj j‰i yj
ř
j‰i xj ´ n´1 yi ´ n´1
p´iq
β̂1 “ ř ¯2 ,
ř ´ j‰i xj
j‰i x j ´ n´1

A shortcut to compute the expression in (42) is


n ˆ ˙2 2
1 ÿ yi ´ ŷi 1 pxi ´ x̄q
CVpnq “ , hi “ ` řn 2
, (43)
n i“1 1 ´ hi n j“1 pxj ´ x̄q

which is much easier to evaluate.


86
Cross Validation: LOOCV

Source: [JO13]

87
Cross Validation: k-Fold

Main idea: decompose

data set “ training set ` one validation “block”,

fit with “training” and evaluate using MSEk in “validation”. Do this


in a loop selecting as validation block each of the disjoint k blocks
(e.g. k “ 5, 10 are common choices) in the data set, and compute
k ¯2
1ÿ k ÿ´ p´I q
CVpnq “ MSEi , MSEi “ yj ´ ŷj i , (44)
k i“1 n jPI
i

where Ii is the collection of indices in block i.


Advantages (oposite of LOOCV)
1 It is faster.
2 It is not better: Ò variance, Ò bias.

88
Cross Validation: k-Fold

Source: [JO13]

89
Cross-Validation on Classification Problems

Remark 5.2
Instead of the MSE, we use the ER. We write it once again here for
clarity, for the case of n observations
n
1ÿ
ER “ Iy ‰ŷ ,
n i“1 i i

thus, for example, the LOOCV for classification problems follow


n
1ÿ
CVpnq “ ERi , ERi “ Iyi ‰ŷ p´iq
n i“1 i

90
Bootstrap

Idea: sample data with replacement to obtain a data set of the same size.
Used to quantify the uncertainty with a given estimator.
Example 5.1
Consider the problem of selecting the optimal investment allocation α˚
when choosing from assets X and Y in such a way that the minimization
of the variance of the portfolio return αX ` p1 ´ αqY is reached.
Consider availability of sample estimates σ̂X2 , σ̂Y2 , σ̂XY .
Solution. It is easy to see that

α̂˚ “ arg min VarpαX ` p1 ´ αqY q


α
σ̂Y2 ´ σ̂XY
“ .
σ̂X2 ` σ̂Y2 ´ 2σ̂x,y

To quantify the uncertainty of this estimator, we can simply use


bootstrap.
91
Bootstrap

Source: [JO13] 92
Homework

Due 5.2.19 - 19:30.

1 Chapter 4: Exercises 6, 10 from [JO13].


2 Chapter 5: Exercises 3, 8 from [JO13].

93
Model Selection
Subset Selection

Recall:

yi “ β0 ` β1 xi,1 ` β2 xi,2 ` ¨ ¨ ¨ ` βp xi,p ` i , i “ 1, 2, . . . , n,

where p ă n, and Covpi , j q “ 0, i ‰ j.


Problem: if Ò p, the estimate fˆpx0 q has a large variance.
Idea: do variable selection systematically, s.t. Ó p.

Remark 6.1
Note that the number of possible models for each possible model size is:
p ˆ ˙ ˆ ˙ ˆ ˙ ˆ ˙
ÿ p p p p
“ ` ` ¨¨¨ ` “ 2p ,
i“0
i 0 1 p

which makes it unfeasible to test and compare all models in practice.

94
Best Subset Selection

Algorithm 1 (Best Subset Selection)

1 Denote the null model as M0 .


2 For k “ 1, . . . , p do
Fit all kp models containing k predictors.
` ˘
1
2 Pick the model with smallest RSS (or largest R2 ) and call it Mk .
3 Choose the optimal model among M0 , . . . , Mp .

Remark 6.2 (Best Subset Selection in Logistic Regression)


Instead of using RSS in the second sub-step of step 2, consider the
“deviance”, defined as

´2 ˆ Lpβ̂0 , . . . , β̂p q, (45)

where Lp¨q is the corresponding log-likelihood function.

95
Forward Stepwise Selection

Algorithm 2 (Forward Stepwise Selection)

1 Denote the null model as M0 .


2 For k “ 0, . . . , p ´ 1 do
1 Fit all p ´ k models that add one predictor to Mk .
2 Pick the model with smallest RSS (or largest R2 ) and call it Mk`1 .
3 Choose the optimal model among M0 , . . . , Mp .

Remark 6.3
Note that the number of models being tested in Algorithm 2 is

ppp ` 1q
1 ` p ` pp ´ 1q ` ¨ ¨ ¨ ` 1 “ 1 ` ă 2p ,
2
i.e. much smaller than that of Algorithm 1.

96
Backward Stepwise Selection

Algorithm 3 (Backward Stepwise Selection)

1 Denote the full model as Mp .


2 For k “ p, p ´ 1 . . . , 1 do
1 Fit all k ´ 1 models that substract one predictor to Mk .
2 Pick the model with smallest RSS (or largest R2 ) and call it Mk´1 .
3 Choose the optimal model among M0 , . . . , Mp .

Remark 6.4
1 The number of models tested in Algorithms 3 and 2 is the same.
2 The best model resulting from Algorithms 1, 2 and 3 does not need
to coincide.

97
Choosing the Optimal Model

Two ways:

1 Indirectly estimate the test error by making an adjustment to the


training error.
1 Mallow’s Cp
2 AIC
3 BIC
4 Adjusted R 2
2 Directly estimate the test error by
1 Validation set approach
2 Cross validation approach

98
Cp , AIC, BIC, Adjusted R 2

Main idea: Provide selection criteria after controlling for model size d.

1 Mallow’s Cp (the smallest, the better)

Cp “ p1{nq ˆ pRSS ` 2d σ̂ 2 q (46)

2 Akaike’s Information Criterion (the smallest, the better)

AIC “ p1{pnσ̂ 2 qq ˆ pRSS ` 2d σ̂ 2 q (47)

3 Bayes Information Criterion (the smallest, the better)

BIC “ p1{pnσ̂ 2 qq ˆ pRSS ` logpnqd σ̂ 2 q (48)

4 Adjusted R 2 (the largest, the better)

RSS{pn ´ d ´ 1q
Adj. R 2 “ 1 ´ (49)
TSS{pn ´ 1q

99
Cp , AIC, BIC, Adjusted R 2

Remark 6.5

1 The computation of σ̂ 2 in all the alternatives presented before is


done for the full model, that is considering all d predictors.
2 The direct estimation of the test error is less common for variable
selection. It can be done if σ̂ 2 is hard to estimate, or d (the degree’s
of freedom in the model) is not easily computed.

100
Regularization
Regularization Methods

Idea: Don’t do variable selection. Instead run the model with all
variables, but shrink the betas so that:
§
Ò Biasrfˆpx0 qs, đ Varrfˆpx0 qs,
§

§
where the big arrow đ means that we expect the benefits from reducing
§
the variance will compensate for the increase in bias.

Alternatives:

1 Ridge Regression
2 LASSO

101
Ridge Regression

RR minimizes a loss function given by:


˜ p
¸2 p
n
ÿ ÿ ÿ
y i ´ β0 ´ βj xi,j ` λ βk2 , λ ą 0. (50)
i“1 j“1
loooooooooooooooomoooooooooooooooon k“1
looomooon
Least Squares Loss RR Penalty

Where:

The shrinkage penalty, reduces the value of the estimated β̂j ’s.
For a tuning parameter λ Ñ 0, the penalty has no effect, and for
λ Ñ 8, the penalty has all the control.

102
Ridge Regression

Remark 7.1
Note that each β̂j do depend on the scaling of all predictors x1 , . . . , xp .
The usual recommendation is then to transform the variables xj to
xi,j
x̃i,j “ bř . (51)
n
i“1 pxi,j ´ x¯j q2

Remark 7.2
Note that RR performs better than LS if the gain in variance reduction
surpasses the loss for bias increase. This is the case when LS variance is
particularly high, which happens when p Ñ n.

103
LASSO (Least Absolute Shrinkage and Selection Operator)

LASSO minimizes the loss a loss function given by


˜ p
¸2 p
n
ÿ ÿ ÿ
y i ´ β0 ´ βj xi,j ` λ |βk | , λ ą 0. (52)
i“1 j“1
loooooooooooooooomoooooooooooooooon k“1
loooomoooon
Least Squares Loss LASSO Penalty

This is very similar to RR, but LASSO

Penalizes a different norm of β.


Can set some β̂j ’s to zero, thus variable selection is automatic.

104
LASSO

Remark 7.3
Recall that the `-norm of a vector x P Rn is defined as
˜ ¸ p1
ÿn
p
}x}` “ |xi | , (53)
i“1

thus RR and LASSO penalize the norm of vector β P Rp , but RR


penalizes its `2 norm while LASSO penalizes its `1 norm.

105
LASSO

Unit spheres in R2 :

L2 L1
1.0
1.0

0.5
0.5

0.0 0.0

-0.5
-0.5

-1.0
-1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0

L8 Lp
1.0
1.0

0.5
0.5

0.0 0.0

-0.5
-0.5

-1.0
-1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0

106
LASSO

Remark 7.4
Note that finding β̂ in LASSO can be written equivalently as
˜ p
¸2 p
n
ÿ ÿ ÿ
β̂1 , . . . , β̂p “ arg min yi ´ β0 ´ βj xi,j `λ |βk |.
β1 ,...,βp i“1 j“1 k“1

or as
$ ˜ ¸2 ,
& n
ÿ p
ÿ p
ÿ .
β̂1 , . . . , β̂p “ arg min y i ´ β0 ´ βj xi,j : |βk | “ s .
% β1 ,...,βp i“1 j“1
-
k“1

Since the RSS is a convex function, its orthogonal projection onto Rp is


an ellipsoid, the equality restriction creates the possibility of a corner
solution, which means that some β̂j ’s are zero.

107
Source: [JO13]

108
Selecting the Tuning Parameters

In RR and LASSO, the tuning parameter λ has been considered as given.


Indeed this parameter needs to be estimated from the data by Cross
validation, e.g. LOOCV or k-fold CV. A popular choice is LOOCV.

109
Dimension Reduction
Overview

Idea: to reduce the variance, don’t use X1 , . . . , Xp directly. Instead use


Z1 , . . . , ZM , M ă p, such that Zm K Zr , m ‰ r , where:

Zm P spantX1 , . . . , Xp u, m “ 1, . . . , M.

Here we discuss:

1 Principal Component Analysis (PCA)


2 Partial Least Squares (PLS)

110
Overview

Remark 8.1
Note that the change of variables from tXj upj“1 to tZm upm“1 can in itself
reduce the variance of the overall estimator fˆpx0 q. In particular, consider
M
ÿ
yi “ θ0 ` θm zi,m ` i , θ0 , . . . , θm P R.
m“1
p
ÿ
zi,m “ φj,m xi,j , φ1,m , . . . , φp,m P R.
j“1

Accommodating the terms we arrive to


p
˜ ¸
ÿ M
ÿ
yi “ θ 0 ` θm φj,m xi,j ` i ,
m“1
j“1 looooooomooooooon

βj

which is a model where the βj ’s are constraint, i.e. variate less.


111
Principal Component Analysis

First Principal Component:


# « p
ff p
+
ÿ ÿ
φ̂1,1 , . . . , φ̂p,1 “ arg max Var φj,1 px j ´ x̄ j q : φ2j,1 “1
φ1,1 ,...,φp,1 j“1 j“1
p
ÿ
z1 “ φ̂j,1 px j ´ x̄ j q
j“1

where: φ̂1,1 , φ̂2,1 are the “loadings” for z 1 , and zi,1 are its “scores”.
Remark 8.2
The first principal component is the line that minimizes the sum of the
square perpendicular distances between each point ant the line.

112
Principal Component Analysis

Second Principal Component:


# « p
ff +
ÿ
φ̂1,2 , . . . , φ̂p,2 “ arg max Var φj,2 px j ´ x̄ j q : Covpz 2 , z 1 q “ 0
φ1,2 ,...,φp,2 j“1
p
ÿ
z2 “ φ̂j,2 px i ´ x̄ j q,
j“1

and we can proceed in the same manner up to the p-th PC.

113
Principal Component Regression

Consider LS using PCs:


M
ÿ
yi “ θ 0 ` θm zi,m ` i ,
m“1

where i is the noise under the usual assumptions, and M can be selected
via cross validation.
Remark 8.3
PCR assumes the directions where X1 , . . . , Xp show the most variation
are directions associated with Y .

Remark 8.4
Note that PCR uses all features in each component, thus it is a
dimension reduction method, but not a feature selection method.

114
Partial Least Squares

The computation is initialized by setting e p1q “ y .

For k “ 1, . . . , M do:

1 e pkq „ x j , j “ 1 . . . , p, and use the coeffs φ̂j,k to compute


p
ÿ
zk “ φ̂j,k x j .
j“1

2 z k „ x j and save the residuals e pk`1q .

The selection of M can be done also via cross validation.


Remark 8.5
PLS is a supervised PCR in that the information of Y is considered when
computing z i .

115
Homework

Due 5.9.19 - 19:30.

1 Chapter 6: Exercises 2, 3, 8 (a,b,c,d), 11(a,b,c) from [JO13].

116
Lab: Forecasting Soccer Matches
Machine Learning, Big Data and other Fancy Words

Worldwide hype

Effective forecast and classification methods.


Tech: Google, Spotify, Waze, etc.
Finance: high frequency asset pricing, insurance policies

and in Peru...

RIMAC data science challenge1 : predict car insurance default for good
customers. Three best forecasts get a data scientist job!.
Fintech: LatinFintech, Bitinka, Culqui, Kambista, etc.

1 Hackaton organized by Hackspace to recruit data scientists.


117
Statistical Oracles in Soccer

Popular Events

Google2 Ñ FIFA 2014 World Cup Ñ 14/16 (« 88%) accuracy


Goldman Sachs3 Ñ Euro 2016 Ñ got bookmakers’ odds
Mister chip4 Ñ Argentina–Peru Ñ 53% of going to Russia 2018.

5
Google says it can beat “Paul the Octopus”.

Beating “Paul the Octopus” is not trivial. Actually, it means getting an


accuracy above 11/13 (« 85%).

2 https://github.com/GoogleCloudPlatform/ipython-soccer-predictions
3 http://www.goldmansachs.com/our-thinking/macroeconomic-insights/

euro-cup-2016/
4 From twitter account: @2010misterchip.
5 Claim made by J. Tigani (Google I/O) at a big data conference (Strata) in 2014.

118
Objective of this Application

Replicate Google’s idea:

1 Got the code Ñ GitHub repository


2 Bought some data Ñ OPTA6
3 Got the momentum Ñ Peru vs. Argentina

Forecast Argentina–Peru:

1 Update data: Copa América Centenario & Qualifiers (so far).


2 Tweak parameters.
3 Larger effort: Survey of classification methods in finance.

6 www.optasportspro.com

119
Data Structure

Three leagues:

1 United States: Major League Soccer.


2 England: Premier League.
3 Spain: La Liga.

Features ˆ 2:

1 Correct passes 6 Shots


2 Incorrect passes 7 Corners
3 Ratio correct/incorrect passes 8 Cards
4 Good passes at 80% top field 9 Fouls
5 Bad passes at 70% top field 10 Expected goals7

7 Variable generated by OPTA.


120
Idea:

Make an assessment on how the teams are coming for the game, i.e.
write a matrix with teams on one side and features on the other side8 .

Details:

We keep only matches that resulted in a win/loss.


We drop matches without complete information: new teams.
We split the dataset in a test set (30%) and a training set (70%).

8 features are computed in MA terms (last 6 games)


121
Lasso Regularization

We aim to maximize
# +
n
ź
yi 1´yi
Lpβ; λq :“ log pi pβq p1 ´ pi pβqq ´ λ}β}1 ,
i“1
řn
where }x}1 “ i“1 |xi |, x P Rn , and λ is the regularization parameter.

Details:

Regularized logit regression problem with Manhattan norm.


β is the coefficient vector corresponding to the features.
Regularization to avoid overfitting & selected via cross-validation.
Automatic feature selection, i.e. some coefficients are zero.

122
Classification Accuracy:
The measure of accuracy of the model is given by
TP ` TN
ACC “ ,
TP ` TN ` FP ` FN
where TP: true positive, TN: true negative, FP: false positive, and FN:
false negative.

123
Results: 2014 FIFA World Cup

Quarter Finals Matches (no draws) All Matches


Reported 88% 77% –
Replicated X X 53%

Highlights:

Google reported the quarter final’s accuracy (only) at Strata.


and the accuracy excluding draws (only) at its GitHub repository.
The overall accuracy is not reported, but it is very relevant.

124
Results: 2018 FIFA World Cup (Quals)

Highlights:

The accuracy excluding draws is 85%.


The overall accuracy is 68%.

125
Details: Last 10 Games:

Team A Team B PpA ą Bq Expected True


Uruguay Argentina 66% Uruguay draw
Peru Bolivia 72% Peru Peru
Ecuador Brasil 7% Brasil Brasil
Paraguay Chile 13% Chile Paraguay
Venezuela Colombia 31% Colombia draw
Venezuela Argentina 31% Argentina draw
Colombia Brasil 46% Brasil draw
Chile Bolivia 47% Bolivia Bolivia
Peru Ecuador 58% Peru Peru
Uruguay Paraguay 53% Uruguay Uruguay

126
Details: Coming 10 Games:

Team A Team B PpA ą Bq Expected


Peru Argentina 39% Argentina
Brasil Bolivia 78% Brasil
Ecuador Chile 30% Chile
Paraguay Colombia 8% Colombia
Venezuela Uruguay 63% Venezuela
Ecuador Argentina 63% Ecuador
Uruguay Bolivia 63% Uruguay
Chile Brasil 37% Brasil
Peru Colombia 63% Peru
Venezuela Paraguay 37% Paraguay

But that didn’t happen now. Did it?

127
Concluding Remarks

On Google’s exercise:

Transparency is key. Lots of public GitHub repositories.


Google’s statement is not accurate.
Results better than pure luck.

On Argentina–Perú:

We could forecast Ecuador–Peru.


We obtain a 39% probability of Peru beating Argentina.
Put your money where your mouth is?

128
Simple Non-Linear Methods
Simple Non-Linear Methods

Idea: Improve predictive power by relaxing the linearity assumption in

yi “ β0 ` β1 xi ` i , Covpi , j q “ σ 2 1ti“ju .

Examples:

1 Polynomial regression
2 Step Functions
3 Regression Splines
4 Smoothing Splines
5 Local Regression

129
Polynomial Regression

Model:
d
ÿ
yi “ β0 ` βj xij ` i , Covpi , j q “ σ 2 1ti“ju . (54)
j“1

Estimation: direct least squares


Tuning parameter d: selected via CV / ANOVA testing
Problem: boundary effects if d ą 4.

Remark 10.1
The boundary effect creates a big uncertainty at the borders. Note that

Varrfˆpx0 qs “ `J
0 Ĉ `0 , Ĉ i,j “ Covpβ̂i , β̂j q, `0 “ p1, x0 , . . . , x0p qJ

where x0 is an observation in the test set..

130
Polynomial Regression

Source: [JO13]
131
Step Functions

Model:
K
ÿ
yi “ βk Ck pxi q ` i , Covpi , j q “ σ 2 1ti“ju , (55)
k“0

where CK pxi q “ ItcK ďxi u at k “ K , and Ck pxi q “ Itck ďxi ăck`1 u otherwise.

Estimation: direct least squares


Tuning parameter K : selected via cross validation
Problem: location of ck ’s is important, and not easy to determine.

132
Step Functions

Source: [JO13]
133
Regression Splines

Model:
q
ÿ K
ÿ
y i “ β0 ` βk xik ` βk`q bpxi , εk q ` i , Covpi , j q “ σ 2 1ti“ju , (56)
k“1 k“1

where
#
pxi ´ εqq if xi ą ε
bpxi , εq “
0 if otherwise

Estimation: direct least squares


Tuning parameters K and q: for K can use CV, for q, q “ 3 is usual.
Problem 1: location of εk ’s is important, and not easy to determine.
Problem 2: boundary effects if d ą 4.

134
Regression Splines

Source: [JO13]
135
Regression Splines + Boundary Conditions = Natural splines

Remark 10.2
The boundary problems of regression splines can be solved using
additional boundary conditions that force the fit to be linear at the
boundaries. Regression splines for which such conditions hold are called
“natural splines”.

136
Regression Splines + Boundary Conditions = Natural splines

Source: [JO13]

137
Regression Splines + Boundary Conditions = Natural splines

Source: [JO13]
138
Regression Splines + Boundary Conditions = Natural splines

Source: [JO13]

139
Smoothing Splines

Idea: Set K “ n, and q “ 3, and add a roughness penalty λ.

The optimal smoothing spline g p¨q is found as


$ ,

’ /
/
’ /

&ÿn ż /
.
2 2 2
ĝ “ arg min pyi ´ g pxi qq ` λ g ptq dt , λ ą 0, (57)
g ’ /
’i“1

’loooooooomoooooooon loooooomoooooon/
/
/
Penalty term
% -
Loss
where g ptq exactly projects the data using a spline basis.

Estimation: penalized least squares


Tuning parameter λ: estimated using CV.

140
Smoothing Splines

Remark 10.3
Note that in (57):

λ Ñ 0 leads to interpolation the data, i.e. dfλ“0 “ n.


λ Ñ 8 leads to simple least squares, i.e. dfλ“8 “ 2. In general, it
can be shown that

dfλ “ trtS λ u, where ĝ λ “ S λ y . (58)

As λ Ò, the bias Ò, and the variance Ó.

Remark 10.4
It can be shown that given q “ 3, the function g pxq that minimizes (57)
is a natural cubic spline. Hence, cubic smoothing splines do not have
boundary effects.

141
Smoothing Splines

Remark 10.5
It turns out that the computation of LOOCV is particularly fast for
smoothing splines. In fact
n ´
ÿ ¯2
p´iq
RSSCV pλq “ yi ´ ĝλ pxi q
i“1
n „ 2
ÿ yi ´ ĝλ pxi q
“ , (59)
i“1
1 ´ rS λ si,i

is a quantity that can be minimized wrt λ.

142
Smoothing Splines

Source: [JO13]
143
Local Linear Regression

Algorithm 4 (Local Regression At X “ x0 )


1 Gather s “ k{n of training points whose xi are closest to x0 .
2 Assign Ki0 “ K pxi , x0 q to each point in this neighborhood, so that
the point furthest from x0 has weight zero, and the closest has the
highest weight. All but these k nearest neighbors get weight zero.
3 Fit a weighted least squares regression of the yi on the xi using the
aforementioned weights, by finding β̂0 and β̂1 that
n
ÿ
β̂0 , β̂1 “ arg min Ki0 pyi ´ β0 ´ β1 xi q2
β0 ,β1 i“1

4 The fitted value at x0 is given by fˆpx0 q “ β̂0 ` β̂1 x0 .

144
Local Linear Regression

Estimation: least squares


Tuning parameters s, estimated via CV.
Problem: need to select weighting function.

145
Local Linear Regression

Source: [JO13]
146
Local Linear Regression

Source: [JO13]

147
GAMs
Generalized Additive Models

It is an extension to the case of p regressors

yi “ β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p i , Covpi , j q “ σ 2 1ti“ju .

Can be applied to discrete or continuous responses, as


p
ÿ
y i “ β0 ` fj pxi,j q ` i ,
j“1

where contributors of each x j are additive via fj px j q.

148
Generalized Additive Models

Source: [JO13]

149
Advantages and Disadvantages of GAMs

Advantages:

Can fit non-linear fj p¨q’s to each x j .


More accurate predictions.
Can look at the individual effect of x j over y

Disadvantages:

The model is additive only.


If interactions exist, they should be explicitly added.

150
Local Linear Regression

Source: [JO13]

151
Homework

Due 5.23.19 - 19:30.

1 Chapter 7: Exercises 2, 3, 6, 8 from [JO13].

152
Tree Based Methods
Regression Trees

Idea: Segment predictor space into rectangular regions and predict based
on the mean/mode of the response within the region.

Called “trees” because the set of splitting rules can be summarized


in a “tree”.
Typically useful for interpretation, but not so accurate as other
tree-like methods: bagging, random forest, boosting.
We start by considering the cont. resp. case (regression trees), to
then consider the disc. resp. case (classification trees).

153
Regression Trees

Source: [JO13]
154
Binary Splitting

Idea: given region X of the feature space, the select a variable j and
a cut-off point s to make the split

R1 pj, sq “ tX |Xj ď su and R2 pj, sq “ tX |Xj ą su. (60)

The selection of j and s corresponds to


$ ,
& ÿ ÿ .
ˆ ŝ “ arg min
j, pyi ´ ŷR1 pj,sq q2 ` pyi ´ ŷR2 pj,sq q2 , (61)
j,s % -
i:xi PR1 pj,sq i:xi PR2 pj,sq

which can be applied iteratively to the resulting R1 and R2 regions,


until some criteria is reached, e.g. # obs ě 5.

155
Binary Splitting

Source: [JO13]

156
Complexity Cost & Tree Pruning

To avoid the overfitting of large trees, grow a large tree T0 , and


then prune it.
Given α, one can select the optimal tree T Ă T0 using the cost
complexity criteria
|T |
ÿ
Cα pT q “ N m Qm pT q
loooomoooon ` loα|T
omoo|n , (62)
m“1
Weighted Impurity Measure Complexity Penalty

where |T | denotes the number of regions in a given subtree, and

1 ÿ 1 ÿ
Qm pT q “ pyi ´ ĉm q2 , ĉm “ yi
Nm x PR Nm x PR
i m i m

is the so-called impurity cost for Nm “ #txi P Rm u.

157
Building a Regression Tree

Algorithm 5
1 Use binary splitting to grow a large tree T0 on training data,
stopping when each terminal node has less than a minimum of
observations.
2 Apply cost complexity pruning to the large tree and obtain a
sequence of best subtrees T Ă T0 as functions of α.
3 Use k-fold cross validation and pick α̂ “ arg minα MSE in test.
4 Return the subtree from step 2 that corresponds to α̂.

158
Classification Trees

Consider a discrete response yi P t1, 2, . . . , K u and denote

1 ÿ
p̂km “ I
Nm i:x PR tyi “ku
i m

the proportion of k observations in node m. Thus, the classification


k for region Rm follows kpmq “ arg maxk p̂mk .
Measures of node impurity for classification problems:
1 Classification error rate: Em “ 1 ´ maxk pp̂mk q.
Gini index: G “ Kk“1 p̂mk p1 ´ p̂mk q, which measures the variance
ř
2
across the K classes.
Entropy: D “ ´ Kk“1 p̂mk logpp̂mk q, with a similar interpretation to
ř
3
the Gini index.

159
Trees vs. Linear Models

Linear regression assumes a model of the form


p
ÿ
f pX q “ β0 ` Xj βj .
j“1

Regression trees assume a model of the form


M
ÿ
f pX q “ β0 ` cm ¨ 1tX PRm u .
j“1

160
Trees vs. Linear Models

Source: [JO13]
161
Advantages and Disadvantages of Trees

Advantages:

Easy to explain to people, can be displayed graphically, and are


easily interpreted.
Some people believe that decision trees more closely mirror human
decision-making.
Trees can easily handle qualitative predictors without the need to
create dummy variables.

Disadvantages:

High variance because of hierarchical modeling structure.


Bad predictive accuracy.
Non-robust.
Lack of smoothness of prediction surface.
Difficulty in capturing additive structure.
162
Bagging

Definition 12.1 (Bagging)


Bagging is a general purpose procedure for reducing de variance of a
statistical learning method, e.g. regression/classification trees.

Remark 12.1
Recall that averaging a set of independent random variables reduces de
variance. In fact, given independent r.v.s z1 , . . . , zn , each with variance
σ 2 , then the variance of z̄ is σ 2 {n.

163
Bagging

Idea of Bagging: take many training sets from the population, build
different trees for each set and average the predictions.
B
1 ÿ ˆb
fˆavg pX q “ f pX q,
B b“1

where fˆb pX q is the prediction with the training set b. However, since we
do not have access to different training sets, one can use bootstrap
B
1 ÿ ˆ˚b
fˆavg pX q “ f pX q,
B b“1

where fˆ˚b pX q is one bootstrap estimation.

164
Random Forests

Idea: bagging for de-correlated trees.


Build decision trees on bootstrapped training samples.
Each time a split is considered, select m predictors out of p as
?
candidates (not all, e.g. m “ p).

Remark 12.2
Note that if m “ p, all predictors are considered at every split, thus we
are in the bagging case.

165
Boosting

Idea: Take many small modified versions of the data set, and grow trees
sequentially, i.e. each tree grows using info. from previously grown trees.

166
Boosting

Algorithm 6 (Boosting)

1 Set fˆpxq “ 0, ri “ yi for all i in training set,


2 For b “ 1, 2, . . . D do:
(a) Fit fˆpbq with d splits (d ` 1 terminal nodes) to data pX , r q
(b) Update fˆ by adding a shrunken version of the new tree

fˆpxq Ð fˆpxq ` λfˆpbq pxq

3 Update residuals

ri Ð ri ´ λfˆpbq pxi q

4 Output boosted model


B
ÿ
fˆpxq “ λfˆpbq pxq.
b“1

167
Boosting

Tuning parameters:

1 B: number of trees. If too large can lead to overfitting.


2 λ: shrinkage parameter. Controls the learning speed of the model.
3 d: number of splits. Controls the complexity of boosted ensemble.

168
Support Vector Machines
Hyperplanes

Definition 13.1 (Hyperplane)


Let S denote a p-dimensional space. H is a hyperplane in S if it is an
affine subspace of S, i.e. it is a subspace of S that does not contain the
null element. In fact, for some β0 , β1 , . . . , βp P Rp ,

β0 ` β1 x1 ` ¨ ¨ ¨ ` βp xp “ 0, (63)

defines a hyperplane in Rp in the sense that any x P Rp that fulfills (63)


is a point in the hyperplane.
Example 13.1 (Hyperplane)
Note that

1 any line in R2 is a hyperplane in R2 , but only the ones crossing the


origin are subspaces of R2 .
2 any plane in R3 is a hyperplane in R3 , but only the ones crossing the
origin are subspaces of R3 .
169
Classification Using a Separating Hyperplane

Definition 13.2 (Separating Hyperplane)


Consider observations x i “ pxi,1 , xi,2 , . . . , xi,p qJ P Rp , and binary
response yi P t´1, 1u. A separating hyperplane is a hyperplane for which

β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p ą 0 if yi “ `1
β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p ă 0 if yi “ ´1,

or equivalently

yi pβ0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p q ą 0

If a separating hyperplane exists, then it is the natural classifier.


Given x ˚ , the certainty of its classification depends on how close the
quantity β0 ` β1 x1˚ ` ¨ ¨ ¨ ` βp xp˚ is to zero.
A separating hyperplane needs not to be unique.

170
Classification Using a Separating Hyperplane

Source: [JO13]
171
Maximal Margin Classifier

Definition 13.3 (Margin)


The margin is the minimal distance from any observation of the training
set to a given separating hyperplane.

Definition 13.4 (Maximal Margin Hyperplane)


The maximal margin hyperplane is a separating hyperplane that has the
largest margin.

Definition 13.5 (Support Vectors)


All the observations equidistant to the maximal margin hyperplane are
called support vectors.

The maximum margin hyperplane depends directly on the support


vectors, but not on the other observations.
The classification technique that uses the maximal margin
hyperplane is called maximal margin classifier.

172
Maximal Margin Classifier

Source: [JO13]
173
Maximal Margin Classifier

Source: [JO13]

174
Maximal Margin Classifier

The maximal margin hyperplane is the solution of


# p
ÿ
pβ̂0 , . . . , β̂p , M̂q “ arg max M : βj2 “ 1,
β0 ,...,βp ,M j“1
˜ p
¸ +
ÿ
yi β0 ` βj xi,j ě M, i “ 1, . . . , n ,
j“1

A separating hyperplane may not exist. This can be fixed using a


“soft margin”.
The generalization of the maximal margin classifier to the
non-separable case is known as the support vector classifier.

175
Support Vector Hyperplane

The support vector hyperplane is the solution of

pβ̂0 , . . . , β̂p , ˆ1 , . . . , ˆn , M̂q “


# p
˜ p
¸
ÿ ÿ
2
arg max M: βj “ 1, yi β0 ` βj xi,j ě Mp1 ´ i q,
β0 ,...,βp ,1 ,...,n ,M j“1 j“1
+
n
ÿ
i ě 0, i ď C , i “ 1, . . . , n ,
i“1

where i , i “ 1, . . . , n are called “slack variables”.

If i ą 0, the i-th observation is on the wrong side of the margin.


If i ą 1, the i-th observation is on the wrong side of the hyperplane.
All vectors for which i ą 0 or i ą 1 are support vectors.
Parameter C controls for the severity of the accumulated violations.
If C Ó,then Ó bias, Ò variance, and if C Ò,then Ò bias, Ó variance
176
Support Vector Hyperplane

Source: [JO13]
177
Support Vector Hyperplane

178
Support Vector Hyperplane

Source: [JO13]

179
Support Vector Hyperplane

Proposition 3 (Support Vector Hyperplane)


The support vector hyperplane can be presented as
n
ÿ
f pxq “ β0 ` αi xx, x i y, xx, x i y “ x J x i (64)
iPS

where S is the set of indices for the support vectors, and αi P R` .


Proof. From the FOC in the support vector classifier problem it follows
řn
that β “ i“1 δi yi x i . Thus
˜ ¸
n
ÿ n
ÿ ÿ
f pxq “ β0 ` x J δi yi x i “ β0 ` δi yi xx, x i y “ β0 ` αi xx, x i y,
i“1 i“1 iPS

where the last expression follows from αi “ δi yi , and the fact that αi ‰ 0
only for the support vectors. 

180
Support Vector Machines

Definition 13.6 (Support Vector Machines)


The support vector machine is an extension to the support vector
classifier that results from enlarging the feature space using kernels,
ÿ ÿ
f pxq “ β0 ` αi xx, x i y Ñ f pxq “ β0 ` αi K px, x i q,
iPS iPS

where K px, x i q “ xx, x i y is called linear kernel. Other examples are

1 Polynomial kernel
` ˘d
K px, x i q “ 1 ` x J x i , d ą0

2 Radial kernel
` ˘
K px, x i q “ exp ´γ}x ´ x i }22 , γ ą 0.

181
Support Vector Hyperplane

Source: [JO13]

182
SVMs: More than Two Classes

Consider there are k possible classes for the response. We can use two
approaches:
`k ˘
1 One vs. One. Do SVM by pairs in the 2 possible cases. Do voting
at observation level and classify accordingly.
2 One vs. All.
For all observations in class 1, write `1 as response, and ´1 for all
the other clases. Do SVM and compute β0,1 , β1,1 , . . . , βp,1 .
Repeat the previous step for classes 2, 3, . . . , k and collect
β0,i , β1,i , . . . , βp,i , i “ 1, . . . , k in each case.
For out-of-sample observation x ˚ , compute
J
f x ˚ ; βi “ β0 ` x ˚ βi ,
` ˘
i “ 1, . . . , k,
` ˘
and select the case i for which f x ˚ ; β i is the largest.

183
Homework

Due 5.30.19 - 19:30.

1 Chapter 8: Exercises 2, 11 from [JO13].


2 Chapter 9: Exercises 3 (a,b,c), 7 (a,b,c,d) from [JO13].

184

Vous aimerez peut-être aussi