Clases de Machine Leargni

Machine Learning: Business Applications
Francisco Rosales Marticorena, PhD.

frosales@esan.edu.pe
04.04.19 – 30.05.19
ESAN Graduate School of Business
Datos Generales del Curso
Asignatura: Machine Learning: Aplicaciones en los Negocios

Área académica: Programa de Especialización para Ejecutivos
Año y semestre: 2019 – I
Profesor: Francisco Rosales Marticorena, PhD.
Mail: frosales@esan.edu.pe
Teléfono: (511) 317-7200 / 444340
1
Sumilla
Este curso presenta métodos de machine learning con énfasis en

problemas de clasificación y regresión de aprendizaje supervisado.
El curso contiene sesiones de fundamentos matemáticos; sesiones de
desarrollo metodológico y de aplicaciones.
Se usará el software R para la resolución de casos de estudio.
2
Objetivos de la Asignatura
Mejorar las capacidades cuantitativas de analistas y gerentes para

interpretar los resultados de métodos que aprenden de los datos.
Utilizar adecuadamente conceptos matemáticos básicos involucrados
en los métodos de aprendizaje supervisado de Machine Learning.
Utilizar el software R y sus librerı́as especializadas para es desarrollo
de implementaciones propias o de terceros.
3
Programación de Contenidos
1 Estadı́stica básica:
Sesión 1: Introducción
Sesión 2: Software R
Sesión 3: Regresión lineal
2 Métodos Lineales:
Sesión 4: Modelos de Clasificación
Sesión 5: Métodos de Resampleo
Sesión 6: Regularización
Sesión 7: Reducción dimensional
Sesión 8: Taller
3 Métodos No-Lineales:
Sesión 9: Splines
Sesión 10: GAMs
Sesión 11, 12: Árboles de decisión
Sesión 13, 14: Support Vector Machines
Sesión 15: Evaluación Final
4
Metodologı́a
Las exposiciones del profesor se complementarán con actividades que

harán los alumnos en el salón de clase, y fuera de él:
Participar en clase.
Leer la bibliografı́a indicada en el programa.
Hacer las tareas.
Rendir las evaluaciones programadas.
5
Evaluación
Nota Final “ seis tareas p60%q ` un examen final p40%q.
Las tareas se podrán realizar de manera individual o en parejas.

El examen final es individual.
El examen final es obligatorio, y se rendirá el dı́a 30.05.19.
6
Fuentes de Información
[EH06] Everitt, B. and T. Hothorn. A handbook of statistical

analyses using R. Chapman & Hall/CRC, 2006.
[JO13] James, G. et. al. (2013). An Introduction to Statistical
Learning with Applications in R. Springer Series in Statistics.
[HT09] Hastie, T., R. Tibshirani and J. Friedman (2009). The
Elements of Statistical Learning: Data Mining, Inference and
Prediction. Springer Series in Statistics.
[RS05] Ramsay, J.O. and B.W. Silverman (2006). Functional Data
Analysis. Springer Series in Statistics.
[RO03] Ruppert, D., M. Wand and R. Carrol (2003).
Semiparametric Regression. Cambridge Series in Statistical and
Probabilistic Mathematics.
[W06] Wood, S. (2006). Generalized Additive Models: An
Introduction with R. Chapman & Hall/CRC.
7
Docente
Educación:
Doctor. Matemáticas y Ciencia Comp. Universidad de Göttingen.

Magister. Matemáticas Ap. y Estadı́stica. SUNY Stony Brook.
Magister. Matemáticas. PUCP.
Licenciado y Bachiller. Economı́a. UP.
Experiencia:
2019: Profesor Investigador. TI. U. ESAN.

2018: Gerente. Financial Services Office. EY Perú.
2017: Profesor investigador. Finanzas. U. del Pacı́fico.
2011–2016: Investigador asociado. IMS. U. Göttingen.
2005–2008: Cientı́fico Investigador. CGIAR. CIP.
8
Asistentes
Expectativa: analista, gerente, etc.

Sectores: banca, seguros, reguladores, etc.
Lenguajes: Python, R, C++, Matlab, etc.
9
Materiales
Desde un celular:
Desde un navegador:
https://github.com/LFRM/Lectures
10
Introduction
Overview
Machine Learning: is a toolbox to understand data using statistics

Objectives
1 Prediction / to predict something
2 Inference / to explain something
Problems
1 Supervised learning: “input ñ output” structure.
2 Unsupervised learning: only “input” structure.
Methods
1 Regression
2 Classification
3 Clustering
11
Basics
The general model follows:
Y “ f pX q ` , X “ tX1 , . . . , Xp u
where
Y is called response / dependent variable

X are called features / independent variables / predictors
f is a non-random function
is a random error term, independent of X , and with zero mean.
In this course: we find f to predict / explain.
12
Basics
Usual Steps:
1 We observe predictors X and response Y .

2 We characterize their relationship
Y “ f pX q ` . (1)
3 We estimate fˆ “somehow”.
4 We use fˆ in X to make prediction Ŷ
Ŷ “ fˆpX q. (2)
Focus depends on Goals:
Prediction: we care mostly about Ŷ .

Inference: we care mostly about fˆ.
13
Estimation Error
Proposition 1
Erpf pX q ´ fˆpX qq2 s ` lo

ErpY ´ Ŷ q2 s “ loooooooooomoooooooooon Varrs
omoon . (3)
Reducible Irreducible
Proof. Trivial. Direct substitution from (1) and (2).
Interpretation of proposition 1:
The magnitude of the estimation error has a reducible and an

irreducible component.
We cannot reduce the estimation error below Varrs.
14
Estimation of f : Parametric vs. Non-Parametric
Parametric Methods:
Impose rigid structure on f , e.g. f is linear.

Trade-off: easy to interpret vs. bad accuracy.
Non-Parametric Methods:
Impose flexible structure on f , e.g. f is a piecewise polynomial.

Trade-off: difficult to interpret vs. good accuracy.
15
Estimation of f : Parametric vs. Non-Parametric
Source: [JO13]
16
Estimation of f : Regression vs. Classification
Discrete response ñ classification / cont. response ñ regression.

Specific methods for regression or classification.
Some methods deal with both, e.g. K-nearest neighbors, boosting.
17
Estimation of f : Assessing Model Accuracy
Definition 1.1 (Mean Squared Error)

The Mean Squared Error (MSE) is defined as
MSE :“ Avetpyi ´ fˆpxi qq2 u (4)

n
1ÿ
“ pyi ´ fˆpxi qq2 .
n i“1
We call it “train” MSE if we compute it with training data pX , Y q and

“test” MSE if we compute it with “test” data pX0 , Y0 q.
Train MSE can be reduced arbitrarily (overfitting).

Test MSE cannot be reduced arbitrarily.
Test MSE is used for model selection.
18
Estimation of f : Assessing Model Accuracy
Left: data (circles), true function (black), linear fit (orange), spline fit 1 (blue),
spline fit 2 (green). Right: train MSE (gray), test MSE (red).
Source: [JO13]
19
Estimation of f : Bias - Variance Trade-off
Proposition 2
Given x0 , the expected test MSE can be decomposed into the sum of
three quantities: the variance of fˆpx0 q, the squared bias of fˆpx0 q and the
variance of the error term.
Proof. From proposition 1 we known that
Erpy0 ´ fˆpx0 qq2 s “ Erpf px0 q ´ fˆpx0 qq2 s ` Varrs
“ Erptf px0 q ´ Erfˆpx0 qsu
´tfˆpx0 q ´ Erfˆpx0 qsuq2 s ` Varrs,
Thus,
Erpy0 ´ fˆpx0 qq2 s Erpf px0 q ´ Erfˆpx0 qsq2 s ` looooooooooooomooooooooooooon
“ looooooooooooomooooooooooooon Erpfˆpx0 q ´ Erfˆpx0 qsq2 s
Squared Bias of fˆ Variance of fˆ

` Varrs
lo
omoon
Variance of error
20
Estimation of f : Bias - Variance Trade-off
Interpretation of proposition 2:
The reducible error is a mixture of bias and variance.

To minimize the expected test MSE we need a method that
minimizes both Bias and Variance.
Methods:
More flexible methods: Ó bias and Ò variance.

Less flexible methods: Ò bias and Ó variance.
The adequacy of the method depends on the data.
21
Estimation of f : Classification Setting
Definition 1.2 (Error Rate)

The error rate (ER) is the proportion of mistakes that the classifier makes
in a given data set
ER : “ AvetIŷi ‰yi u (5)

n
1ÿ
“ Iŷ ‰y ,
n i“1 i i
Definition 1.3 (Bayes Classifier)

The Bayes classifier minimizes (5), by selecting the value of j such that
maxj PrY “ j|X “ x0 s, (6)
for each x in the data set.
22
Note that the Bayes classifier:
Is theoretical, i.e. the conditional probability in (31) is unknown.

Provides a lower bound on ER,
ER ě 1 ´ E rmaxj PrY “ j|X “ x0 ss
23
Definition 1.4 (K -Nearest Neighbors)

K -Nearest Neighbors (KNN) is a classification method that requires a
positive integer K . For a test observation x0 , it identifies K points near
x0 , called N0 , and estimates the conditional probability for class j as:
1 ÿ
P̂rY “ j|X “ x0 s “ Iy “j ,
K iPN i
0
i.e. as the fraction of the points in N0 whose response values equal j.

Note that:
K has a strong effect on the classification obtained by KNN.

Small K : Ó bias and Ò variance.
Large K : Ò bias and Ó variance.
24
Train data ` True f
0.5
0.5
● ●
● ●
● ● ● ●
●● ● ● ●● ● ●
● ● ● ●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ●● ●
●● ● ● ● ● ●●● ●● ● ● ● ● ●●●
●● ● ●
●●●● ●● ● ●
●●●●
● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
● ● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
●
● ●● ● ●● ●●
●●●
●
● ●●●●●● ●● ● ●● ● ●● ●●
●●●
●
● ●●●●●● ●●
●● ●● ●
●●● ●● ●● ●●●● ● ●● ●● ●● ●
●●● ●● ●● ●●●● ● ●●
● ● ● ●●
●
●●●● ●● ●
● ●
●●●●●●
● ●● ●● ●●
●●● ● ● ● ●●
●
●●●● ●● ●
● ●
●●●●●● ●● ●● ●●
●●●
●●●●● ●●● ●
● ●● ● ●● ●● ●● ● ●● ●●●●● ●●● ●
● ●●
●● ●● ●● ●● ● ●●
●● ● ●●
●●●● ●●●●● ●●●
●● ●●●●● ● ●●
● ●● ●● ●
●●
●●●● ● ● ●● ● ●●
●●●● ●●●●● ●●●
●● ●●●●● ● ●●
● ●● ●● ●
●●
●●●● ● ●
● ● ●●●●●●●●
● ●●● ● ●● ● ●● ●● ●●●● ●
●●
● ●●
●●● ● ● ●●●●●●●●
● ●●● ● ●● ● ●● ●● ●●●● ●
●●
● ●●
●●●
0.4
●●
0.4
● ● ● ● ●● ● ● ●● ● ●●
● ● ● ● ● ● ●●●●●●●●●●●● ● ● ●
●
● ● ● ● ●●●●●●●●●●●● ●
●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ● ●● ●● ●
●●●● ●● ●●
●
●● ● ●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ● ●● ●● ●
●●●● ●● ●●
●
●● ●
● ● ● ●
●● ● ●●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ●● ●
●●
●● ● ●●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ●● ●
●●
● ●●● ●● ● ● ●● ● ●● ● ● ● ●●●●●●
● ● ● ●●● ●● ● ● ●● ● ●● ● ● ● ●●●●●●
● ●
●● ●● ●
●● ● ● ●
● ●● ●●
● ● ● ●●●
● ●●
●● ●●●
●● ●● ●
●● ● ● ●
● ●● ●●
● ● ● ●●●
● ●●
●● ●●●
●
●●
● ●● ●● ● ● ● ● ●
●●●●●
●
●●
●● ● ●
●●
● ●● ●● ● ● ● ● ●
●●●●●
●
●●
●● ●
●● ● ● ● ●●●● ● ● ● ● ●●
●●● ●● ● ● ● ●●●● ● ● ● ● ●●
●●●
● ● ● ●●●
●
●
● ●●●●●● ●●●
●
● ● ● ●
●● ●● ● ● ● ● ●●●
●
●
● ●●●●●● ●●●
●
● ● ● ●
●● ● ● ●
● ● ●● ●
●
● ● ●●● ●●● ●● ● ●● ●● ●● ● ● ●● ●
●
● ● ●●● ●●● ●● ● ●● ●● ●●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●●●● ● ● ● ● ●● ●●●
● ●
● ●● ●●●● ● ● ● ● ●● ●●●
● ●
●
● ●● ● ●
●●● ● ● ●● ●●●● ● ●● ● ●
●●● ● ● ●● ●● ●●
●● ●●●
● ●●● ● ●●
●
● ●● ●● ●●●
● ●●● ● ●●
●
● ●●
●● ●
● ●●
● ●●●●● ● ●● ● ●
● ● ●● ●● ●
● ●●
● ●●●●● ● ●● ● ●
● ● ●
●
●●● ●● ●● ● ● ●● ●●● ● ● ●●● ●● ●● ● ● ●● ●●● ● ●
● ●● ●●● ● ●● ● ● ● ●● ●●● ● ●● ● ●
0.3
0.3
● ● ● ●● ●●●●● ●● ● ● ● ● ●● ●●●●● ●● ●
●●● ●●●●
● ●●●●● ●●
● ●
●●●● ●●● ●●●●
● ●●●●● ●●● ●
●●●●
●● ●● ● ● ●●●
●
●● ●●
● ●● ●● ● ● ●●●
●
●● ●●
●
y
y
●
● ●● ● ●● ●●●● ●
● ●● ● ●● ●●●●
●●● ● ● ● ●●
●●● ●● ●● ●●● ● ● ● ●●
●●● ●● ●●
●● ● ●● ●
● ●
● ●
●●
●
●●●●●● ● ●●● ●
●● ● ●
● ●
●●
●
●●●●●● ● ●●● ●
●●
● ● ● ● ●● ● ● ● ● ●●
●●●● ●●● ●●
●●●●● ●●●● ●●● ●●
●●●●●
●● ● ● ●●
● ●● ●● ●● ● ● ●● ● ● ●●
● ●● ●● ●● ● ●
●● ●
●● ●●
● ●● ●
●● ●●
●
●●●
●●●
●●●●
● ● ●●●●●● ●●●
●●●
●●●●
● ● ●●●●●●
●● ●● ●●●●● ● ●●●● ● ●● ●● ●●●●● ● ●●●● ●
●● ●● ●● ●●
●●●● ●●●●●●● ● ●● ●● ●
● ●●●● ●●●●●●● ● ●● ●● ●
●
● ● ●●
● ●● ●● ● ● ●●
● ●● ●●
0.2
0.2
●●●●● ● ●●●●● ●
●●●● ●● ●●●● ●●
● ●
● ● ●●● ●●● ● ●
●●● ●●●●
● ● ●●● ●●● ● ●
●●● ●●●●
● ●●●● ●● ● ●●●● ●●
●● ● ●● ●● ● ●● ● ●● ●● ●
● ●● ●●● ● ● ● ● ●● ●●● ● ● ●
●● ●
● ● ●●● ● ●● ●
● ● ●●● ●
● ●●●● ●
●● ● ●●●● ●
●●
●●●●●●●●●
●
●
● ●●●●●●●●●
●
●
●
● ● ●
● ●●
●●● ● ● ●
● ●●
●●●
● ●
●
●● ● ●
●● ●
●●
●●●●●●●●
●
● ●●
●●●●●●●●
●
●
●●●●
● ●●
● ●●●●
● ●●
●
0.1
0.1
● ● ● ●
●●●● ● ●●●● ●
● ●
●● ●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
` Test data ` KNN fit

0.5
0.5
● ●
● ●
● ● ● ●
●● ● ● ●● ● ●
● ● ● ●● ●● ●
●
● ●● ● ● ● ● ●● ●● ●
●
● ●● ●
●● ● ● ● ● ●●● ●● ● ● ● ● ●●●
●● ● ●●●●
● ●● ● ●●●●
●
● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
● ● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
●
● ●● ● ●● ●●
●●
●●●●●●●●● ●● ● ●● ● ●● ●●
●●
●●●●●●●●● ●●
●● ●● ●
●●● ●● ●● ●● ● ●●
● ●● ● ●● ●● ●
●●● ●● ●● ●● ● ●●
● ●● ●
●● ● ● ●
●
●
●●
●●●
●●● ●● ●
● ●●●●
●
●●●
● ●●● ●●●●●●●
●●
●
●● ●● ● ● ●
●
●
●●
●●●
●●● ●● ●
● ●●●●
●
●●●
● ●●● ●●●●●●●
●●
●
●●
●● ● ●●●●●●
● ●
● ●●● ●
● ●●
● ●● ● ●●
● ●●
●●● ●●
●
●●●
●●● ● ●
● ●● ● ●●●●●●
● ●
● ●●● ●
● ●●
● ●● ● ●●
● ●●
●●● ●●
●
●●●
●●● ● ●
●
● ● ●●●●●
●
●●●●●● ●●●●
●●●● ●● ● ●●
●● ● ●
●●●● ●
●●● ●● ●●●●●
●
●●●●●● ●●●●
●●●● ●● ● ●●
●● ● ●
●●●● ●
●●● ●●
● ●●● ● ● ●● ● ● ● ●●● ● ● ●●
0.4
● ●
0.4
● ●● ● ● ● ● ●● ● ●
● ● ●
● ●
●
● ● ● ● ●●● ●●●●
●
● ●
●●●● ●●● ● ● ●
●
● ● ● ● ●●● ●●●●
●
● ●
●●●● ●●● ●
●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ●
●
● ●● ●● ●●●● ●
●
● ●●●● ● ●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ●
●
● ●● ●● ●●●● ●
●
● ●●●● ●
●● ● ● ● ●●● ● ●● ● ● ● ●
●● ●● ● ● ● ●●● ● ●● ● ● ● ●
●●
●●● ● ●●●
●● ● ● ● ●● ● ● ● ● ● ● ●
●
●●
●●●●
●●
● ●●● ● ●●●
●● ● ● ● ●● ● ● ● ● ● ● ●
●
●●
●●●●
●●
●
● ●
● ●● ● ● ● ● ● ● ●
● ● ●●● ● ●
●●●
● ●●● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ●●● ● ●
●●●
● ●●● ●
●● ●● ● ● ● ●●●● ● ●● ●● ●● ●● ●● ● ● ● ●●●● ● ●● ●● ●●
● ●
●●
●●●●● ● ● ● ●● ●● ●
●●●
● ●
●● ● ● ●
●●
●●●●● ● ● ● ●● ●● ●
●●●
● ●
●● ●
●●●●●●
● ●●● ●● ●●
●● ● ● ●●
●●● ● ●●●●●●
● ●●● ●● ●●
●● ● ● ●●
●●● ●
● ● ● ●●●●● ●
●
● ●
●●●● ●● ● ● ● ● ●●●●● ●
●
● ●
●●●● ●● ●
● ● ●● ● ●● ●● ●●● ● ● ●● ●●● ● ● ●● ● ●● ●● ●●● ● ● ●● ●●●
● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●● ●
●● ●● ● ● ●●●
● ● ●● ●● ● ● ●●●
● ●
● ●● ● ●
●●● ● ● ●● ●●●● ● ●● ● ●
●●● ● ● ●● ●●●●
● ● ● ●●●●●
●●●●
● ● ● ●●
● ●
● ● ● ● ● ●●●●●
●●●●
● ● ● ●●
● ●
● ●
● ●
● ●●●●● ● ● ● ● ●● ● ●
● ●●●●● ● ● ● ● ●●
● ● ●●
●
● ● ●●
● ● ● ● ●
●●●●●●●●● ●●● ●
● ●●●●● ● ●●●●●●●●● ●●● ●
● ●●●●● ●
0.3
0.3
● ● ● ●● ●●●●● ● ●●● ● ● ● ●● ●●●●● ● ●●●

●●● ●●● ●●
● ● ●●●●
● ●
●●●● ● ●●● ●●● ●●
● ● ●●●●
● ●
●●●● ●
● ●
●● ● ● ●
●● ●
●● ●● ● ●●● ●●
● ● ●● ●● ● ●●● ●●
● ●
y
● ●
●●● ● ●
●
●● ●●●
● ● ●● ●●●● ●● ●●● ● ●
●
●● ●●●
● ● ●● ●●●● ●●
●● ● ●●● ●● ●●● ● ●● ● ●●● ●● ●●● ●
●● ● ●
●●
●
●●●●●● ● ●● ●
●● ●● ● ●
●●
●
●●●●●● ● ●● ●
●●
● ●●●●
● ● ● ●●●●●●● ● ●●●●
● ● ● ●●●●●●●
●● ● ● ●
● ● ●● ●●● ●
●● ● ●
●●
●● ● ● ●
● ● ●● ●●● ●
●● ● ●
●●
● ●●●●● ●●●●● ● ●●●●● ●●●●●
●●●
●● ● ● ●
●● ● ●●●
●● ● ● ●
●● ●
●●
●●●● ● ● ● ●●● ●●
●●●● ● ● ● ●●●
●●
● ●● ●● ●● ● ● ●●●● ● ●●
● ●● ●● ●● ● ● ●●●● ●
●● ● ●●● ●● ● ●●●
●●●● ●●●●●●● ● ●● ●
● ●●●● ●●●●●●● ● ●● ●
●
● ● ●
● ●●
●● ●● ● ● ●
● ●●
●● ●●
0.2
0.2
●●●●● ● ●●●●● ●
●● ●● ●● ●●
●●● ●● ● ●●● ●● ●
● ● ●●● ●●
●●● ● ●●●●
● ● ●●● ●●
●●● ● ●●●●
● ● ●
● ● ●● ● ● ●
● ● ●●
●● ● ●● ●● ● ●● ● ●● ●● ●
● ●● ●● ●●
●● ● ● ●● ●● ●●
●● ●
●● ●
● ●●●●● ● ●● ●
● ●●●●● ●
● ●●●●● ●
●● ● ●●●●● ●
●●
●●●●●●●●●
●
●
● ●●●●●●●●●
●
●
●
● ● ●
● ●●
●●● ● ● ●
● ●●
●●●
● ●
●
●● ● ●
●● ●
●●
●●●●
●
●
●●
●● ●● ●●
●●●●
●
●
●●
●● ●●
●●●●
● ●●
● ●●●●
● ●●
●
0.1
0.1
●● ● ●● ●
●●● ● ●●● ●
● ●
●● ●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
25
R Software
The R Project for Statistical Computing
This is an extremely short introduction to R.

Click here to download the software.
For more information look at my lecture notes in Applied Statistics.
For now, lets run R online:
26
Basic Commands
The call for the funcname function is funcname(arg1,arg2),

where arg1 and arg2 are args. of the function.
1 What does c() evaluated in the arguments 1 and 2 does?
2 Does the order of the arguments matter?
We can store values using “<-” or “=”.
> x <- c(1,2)
> y = c(4,-1)
> x + y
[1] 5 1
Let z1 = 1000 and z2 = rep(1000, 3). Compute
1 x + z1
2 x + z2
27
Basic Commands
To ask for help on function funcname use ?funcname

To list all the objects in the workspace use ls().
To delete object obj from the workspace use rm(obj).
To delete every object use rm(list = ls())
28
Basic Commands
To create a matrix use matrix

> x <- matrix(data = 1:4, nrow = 2, ncol = 2)
> x
[,1] [,2]
[1,] 1 3
[2,] 2 4
Compute
1 z <- x ^ 2
2 r <- matrix(1, 2, 4)
3 q <- z %*% r
29
Basic Commands
To create a realization of a normal random variable use rnorm. To

specigy the mean, use mean and, to specify the standard deviation,
use sd
> x <- rnorm(10000)
> y <- rnorm(10000, mean = 50, sd = 0.1)
To compute the correlation between two r.v.s use cor.
> z <- x + y
> cor(x, z)
[1] 0.9951
To reproduce the exact same random number use set.seed() with
an arbitrar integer argument, e.g. set.seed(123)
> set.seed(123)
> rnorm(5)
[1] -0.56047565 -0.23017749 1.55870831 ...
30
Graphics
There are various functions for plotting. See ?plot

> x = rnorm(100) + 1:100
> y = rnorm(100) + seq(-1, -100, length = 100)
> plot(x, y)
> plot(x, y, xlab = "this is my x-axis",
ylab = "this is my y lab", main = "Plot x vs y")
To save the output use pdf(), or jpeg()
> pdf("myfgure.pdf")
> plot(x, y, color = 2, lwd = 3)
> dev.off()
null device
1
31
Graphics
Let f : R2 Ñ R,
To plot the contour of f use contour or image.

> x = 1:10
> y = x
> f = outer( x, y, function (x, y) cos(y) / (1 + x ^ 2) )
> contour(x, y, f)
> contour(x, y, f, nlevels = 45)
> fa = ( f - t(f) ) / 2
> contour( x, y, fa, nlevels = 15)
> image(x, y, fa)
To plot f in three dimensions use persp.
> persp(x, y, fa)
> persp(x, y, fa, theta = 30, phi = 20)
> persp(x, y, fa, theta = 30, phi = 40)
32
Indexing Matrix Data
Extracting part of a data set can be done in different ways:
> A = t(matrix(1:16, 4, 4))

> A
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
Compute
1 A[2,3]
2 A[1,]
3 A[1:3, c(2,4)]
4 A[,-4]
5 dim(A)
33
Loading Data
To import data use read.table() or read.csv(); and to visualize

the imported data use fix().
> Auto = read.csv("Auto.csv", header = T, sep=" ")
> fix(Auto)
> dim(Auto)
[1] 392 9
To list the variable names in the data set use names().
> names(Auto)
[1] "mpg" "cylinders" "displacement" "horsepower"
[5] "weight" "acceleration" "year" "origin"
[9] "name"
To export data use write.table()
34
Additional Graphical and Numerical Summaries
To access a variable cylinders in data frame Auto, we use

> Auto$cylinders
To avoid using the dollar symbol, we can simply attach the data, so
that all the variables in the data frame are added to the workspace.
> attach(Auto)
> plot(cylinders, mpg)
Plot the following:
1 mpg vs. cylinders using red circles
2 mpg vs. cylinders using a red circles, with axis labels “mpg” and
“cylinders” resp.
35
Additional Graphical and Numerical Summaries
To plot a histogram use hist()

> hist(mpg)
> hist(mpg, col = 2)
To create a scatterplot matrix use pairs()
> pairs(Auto)
> pairs(~ mpg + displacement + horsepower
+ weight + acceleration, Auto)
To print a summary of a given variable or data frame, use
summary()
> summary(mpg)
> summary(Auto)
36
Linear Regression
Motivation
What?
Supervised learning method for continuous response.
Why?
Well document starting point.

Recall: not flexible, tipically Ò bias and Ó variance.
Fancy approaches are extensions/generalizations of linear regression.
37
Simple Linear Regression
Model:
Y “ f pX q ` , f pX q “ β0 ` β1 X , (7)
where is a centered random noise, uncorrelated to X .
Example 1 Example 2
0.5
● ●
●● ●
● ●●
● ●
● ●
●
●●●● ●●● ● ●
● ● ●● ●
●
●
● ●● ● ●
●● ● ● ● ● ● ● ● ● ●● ● ●
●●●●●
●● ●
●
●
●
●●
●
●
●● ● ● ●● ●●● ● ●
●
●
● ●
●●●● ● ●● ●● ● ●●
● ● ●
●●●
●●●
● ●
●●●
●●
●
●
●●●
●
●
●
● ● ● ● ●● ● ●● ●●● ●
● ● ● ●
●● ●●
●
●
●●
●
●● ●
●
● ● ●●●●●●● ● ●●●●●●
● ●● ●● ● ●●●●●
●
●
●●
●●
●
● ●●●●
●●●
●●
●●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
● ●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
● ●● ●●●● ●●● ●● ● ● ●●
●
●
●●
●●●● ●
●●●●●
● ● ●●●
● ● ● ●
●
●●
●●
●
●
●●
●
●
●●
● ● ●
●●
●
●●
●
●●
●
●
●●
● ●●●●● ●● ● ●●●●
●●
● ●
●●
● ●●●●● ●
●
●
●●
●
●
●●
●
●●
●
●●
● ● ●●
● ●● ●
●● ●●●●
●●●
●
●
●
●●
●
●
●●
●
●●
● ●● ● ●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
● ●● ● ●● ● ●
0.8
●
● ●● ● ●
● ●●● ●● ● ●●
●● ●
●●●
●●
●
●
●●
●
●
●
●
●
●●
●●
● ●●
● ●● ● ● ●
● ● ●●● ●
●
●
●
●●
●
●
●●
●
●
● ●● ● ● ●● ●● ●
●●●●
●●
●
●
●
●
●
●●
●●● ●● ●
●●●●
● ● ● ● ● ● ●●●●●
● ●● ●●● ●●● ● ●
●
● ●●
0.4
●
● ● ●● ● ●●●●
● ●● ●●●●● ●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●●● ●●●●●● ● ● ●
●●●●
●●●
●
●
●
●●●
●
●●
●
●●● ●●●
●
●●●● ●●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●●●● ● ●
●● ● ●● ●
●
●
●
●●
● ●●
● ● ● ● ● ●● ●●
●
●
● ●●● ●●●
●
●
● ● ●
●●● ● ●●
●●
● ● ●● ●● ● ●●●● ●● ● ● ●● ● ●
●
● ● ●● ● ●
●●
● ●●● ●
● ●●
●●●
●
●
●●●
●● ●● ● ● ●●● ●● ●● ●● ●●●
●
●
●●
●
●●
●
●●●● ●●●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●● ● ●●● ●
●
●●
●
●●●● ●●●
●●
●
● ● ● ● ● ●● ●
●●
●
●
●●
●●
● ●●●
●●
●
●● ● ●● ● ● ● ● ● ●
●●●
●
●
●● ● ● ●●
● ●●
●
●
●●
●●●
●●
●● ●● ●●●●●●●
●
●
●
●
●●● ●●● ● ●● ● ● ●●● ● ●
●●●
●
●●
●
●●● ●
● ●●
●●
● ●
●
●●
●
●●
●
●
●●●●
●●
● ●●●●● ● ●
●
●
●
●
●●● ●● ●● ● ●●●
●●
●●
●
● ●● ●
● ● ●●
●●●
●●●
●
●●
●
●●
● ●
●●
●● ● ●
●●
●●
●● ● ● ● ● ●
●
●
●
● ● ●●
●●●●●
●
●
●
●
●●
●
●●
●●● ●
●
● ● ●
●●
●
●
●● ●
●
● ● ● ●●
●●●
●●●●
● ●● ●
●●
●
●
●
●●●
● ●
● ●●
● ●
● ●●●
●
●
●
●
●
●
●●●●
● ● ●●● ● ● ● ●●
●
●
●
●
●●
●
●
●●● ● ●
●● ● ●
● ●●
● ●
●●
●●
●
●
●
●
●
●
●●
●
●●●●●●
●
●● ● ●●●●●
●
●●●●
● ●●●●●●● ●●●
● ●
●●●● ● ●
●
●●
●
●
●●
●●●●
● ●
●
● ●●● ●● ● ● ●●● ●
●●● ●● ●
●●
●
●
● ●
●
●
●● ●
● ●
●● ●●●
●
● ●●●● ●● ●● ●● ●
●
●
●●
●●
●
● ●●
●● ●
●●
●● ●● ●● ● ●●●
● ●
●●
● ●
●●●●●
●●
●
●●
●●
●●
●
●●
● ●
●●●
● ●
0.6
●● ●
●
● ● ● ●
●
● ●
● ● ●●● ●●
●
●
●
●
●
●
● ●
●●
●●
●
●
●●●●● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
● ●
●
●●
●
●●
● ● ●● ● ● ● ● ●
● ●
●
●●
●● ●
●●● ● ● ●
● ●● ●
0.3
● ● ● ●●
●
● ● ● ●●
● ●●●● ●●●
●
●
●●●
●
● ●
● ●●
● ●● ● ●
●
●●
●
●●
●
● ●
●● ●●●●●● ● ●●●●●●●●
● ●●● ●●●●
●●
●
●●
●
●●
●●
●
●
● ●
●
● ● ●●●●
●●
●
●
●●● ●● ●●●
●●
●● ● ●● ●●
●
●
●
●
●
●
●●●● ●
●
●●
●
●
●
●●●
● ●● ● ●●●●
●●
●
● ● ●●●●● ●
●
●●
●●
●
●
●
●
●
●●
●●●●
●●●●● ●●
●
●
●
●
● ● ● ●
● ●
●●
●● ● ●● ●
●●
● ●
●
●
●●
●●
●●●●
●● ● ●
●● ●●
●●
●
●
●
●●●● ● ●●● ●
● ● ●
●
●
●
●
●
●
●●
●●
●●●●●●
y
●●● ●
y
● ●● ●● ●
●● ●
●●
● ● ●●
●
● ● ●● ●●
●●● ●●
●●●● ●●
●
●●
●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●
●
●●●●●● ● ●●
●
● ● ●●●
● ●●●
●●
●
●
●
●●
●
●●
●
● ●●
● ●● ● ●
●
●●
●●● ● ●●●
●
●●●
● ●
●● ● ●● ● ●●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●●●● ● ● ●●
●
● ● ●
●● ●● ●●●
●
● ●● ●
●●●
● ●●●●
● ● ● ●●● ●●● ●
● ●
●
●●
●
●
●●
● ● ●
●●●
●●
●●
●● ● ●
● ●●●●
●
●
● ●●● ● ●●
●
●
●●
●
●●
●●
●
●
●● ●
●●●● ●●
●
●● ●●●●
●●● ●
● ●●●
●●
●●
● ●●
0.4
● ● ● ●●● ●●●● ●
●●● ●
●●●●●● ●● ●●●● ●
●●
●
●
● ●●●●
● ●●
●●
●
●●
●● ●
●
●
●●
● ●●●●●
●●
●●●
●
●●
●
●
●●
●
●
●●● ●
● ● ●
●● ●
●
● ●●
●
● ●●● ● ●
●●
●
●
●●
●
●●●
●●●●
● ● ●
0.2
●● ● ●
●● ● ●● ●●
● ●
●
●●
● ●
●●
●
●
●●● ● ● ●●
● ● ●●
●
●●
●
● ●● ●● ●
● ●●●
●
●
●
●
●
● ● ● ●● ●● ●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●●
● ●●●●● ●●●●
●
● ● ● ●● ●●●
●●
●● ●●● ●
●● ●
●
●●●●
●●
● ●
●● ●● ●● ●● ● ● ●●
●●
●●
●
●
●
●●
●●
●
●
●● ●● ●
●
●
●
●
●
●● ●
● ●● ● ●●●● ●
●
●●
●●●●●
●● ●●
●
●
●
● ● ●
● ●●●●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●●
● ● ●●● ● ●●●
●
●
●● ●
● ●●
●
●●
●● ● ● ●●●●●●●
●●
●
●
●●
●●●●●
●
●●
●
●●●
●
●
●●● ● ●● ●●
●● ●●●
●●
●
●
●
●
●
●●
●●
●
●●●●● ●
● ●
●●
●
● ● ● ● ●
●●● ● ●
●●
●
●●
●● ●●
●●● ●
●● ●● ●●● ● ●
●
●● ●
●●
● ●●●●
●
●
●
●
●
●●
●
● ●●● ●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●● ●●
●● ●
●●
●● ● ●●●
●
●
●●● ●● ●
0.2
●●●●●
● ●● ●●●●● ●
●
●
●●
●
●
● ●
●●●●
●●●● ● ● ●
●
●
●
●
●●● ●● ●●●●●● ●
●●
●●
●
●
●
●
●
●
●
●●
●
●● ●
●●●●
● ●●
●
●●●● ●●
●● ●
●
● ●●●●
●●●
●
●●
●●
●
●●●
●
●● ● ●● ●
0.1
●
●
● ● ●●● ●
●
●●
●
●
●●
●
● ●●
●●●●●
● ● ●
● ●
●●
●●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●● ●
●● ● ● ●
●
● ●
●
● ●● ●
●●
● ●
●
●
●
●●
●●
●●
●
●●●
●● ●
● ●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●● ●
● ●
● ●
●●●●●●
● ●
● ●
● ●●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
38
Estimation
Idea: estimate f pX q via
fˆpX q “ β̂0 ` β̂1 X , (8)
e.g. identifying β̂0 , β̂1 that fulfills the least squares criteria.
Least Squares Criteria:
pβ̂0 , β̂1 q “ argminβ̂0 ,β̂1 RSS (9)

n
ÿ
RSS :“ ei2 , ei :“ yi ´ ŷi , (10)
i“1
where RSS denotes the “Residual Sum of Squares” (RSS) of the

model.
39
Estimation
FOC:
BRSS
“ β̂0 ` β̂1 x̄ ´ ȳ “ 0, (11)
B β̂0
n n n
BRSS ÿ ÿ ÿ
“ β̂0 xi ` β̂1 xi2 ´ xi yi “ 0. (12)
B β̂1 i“1 i“1 i“1
with solution
β̂0 “ ȳ ´ β̂1 x̄, (13)

řn
px ´ x̄qpyi ´ ȳ q
β̂1 “ řn i
i“1
2
. (14)
i“1 pxi ´ x̄q
RSS is convex, thus the SOC is fullfilled.
40
Example: Sample Mean Estimator
Recall:
Let Z „ pµ, σq, where the mean µ and s.d. σ are unknown.
Sample mean estimator: Collect n random samples of Z , and
řn
estimate µ by µ̂ “ n1 i“1 zi . Repeat the previous m times with
different samples of the same size: tµ̂p1q , µ̂p2q , . . . , µ̂pmq u.
Unbiasedness: We say µ̂ is an unbiased estimator of µ, since
m
1 ÿ pkq
µ̂ Ñ µ as k Ñ 8.
m k“1
Variance: defined as
σ2
SEpµ̂q2 “ ,
n
measures the precision of the sample mean estimator.
41
Bias in Simple Linear Regression
β0 and β1 are r.v.s, and given some data we estimate them by β̂0
and β̂1 , but the estimators vary according to the random sample.
pkq
Unbiasedness: if we consider m random samples, we obtain β̂0 and
pkq
β̂1 , for k “ 1, . . . , m. It can be shown that
m m
1 ÿ pkq 1 ÿ pkq
β̂ Ñ β0 and β̂ Ñ β1 as k Ñ 8,
m k“1 0 m k“1 1
thus fˆpX q is an unbiased estimator of f pX q, that is

m
1 ÿ ˆ pkq
f pX q Ñ f pX q as k Ñ 8,
m k“1
where f pX q is the “population line” and fˆpX qpkq is a “sample

estimate of the line” corresponding to the least squares criteria.
42
Variance
Variance: The precision of the estimators follow:
x̄ 2
„ 
2 2 1
SErβ̂0 s “ σ ` řn 2
, (15)
n i“1 pxi ´ x̄q
σ2
SErβ̂1 s2 “ řn 2
, (16)
i“1 pxi ´ x̄q
where we have used Varri s “ σ 2 and Covpi , j q “ 0, for i ‰ j.

Note that σ is not known, and it is estimated by
c
RSS
σ̂ “ , (17)
n´2
sometimes called the “residual sum of errors” (RSE).
43
Confidence Bands
Is a range of values that contain the true unknown value of the

parameter with certain probability.
Example: a 95% confidence band says that with 95% probability, the
band contains the true value of the parameter
In linear regression, if we add the assumption that is normally
distributed, the bands for β̂i look approx. like this
rβ̂i ´ 2 ˆ SEpβ̂i q, β̂i ` 2 ˆ SEpβ̂i qs, i “ 0, 1. (18)
44
Hypothesis testing
Consider the hypothesis “there is no relation between X and Y ”.
H0 : β1 “ 0
Ha : β1 ‰ 0
We test H0 by computing the quantity
β̂1 ´ 0
t“ , (19)
SErβ̂1 s
which is distributed t with n ´ 2 degrees of freedom (tn´2 in short),

and is called t-statistic.
We reject H0 if the t-statistic is “large” ô if the p-value is “small”,
where “small” means, e.g., less than ă 0.05.
45
Hypothesis Testing
Remark 3.1
To see that t „ tn´2 in (19), recall that a r.v. X “ ?Z is distributed
V {ν
tν if Z „ N p0, 1q, and V „ χ2ν ,
where χ2ν
denotes the chi-squared
distribution with ν degrees of freedom. The denominator of (19) reads
d d
RSS{σ 2 σ2
SErβ̂1 s “ řn 2
, (20)
pn ´ 2q i“1 pxi ´ x̄q
where RSS 2
σ 2 „ χn´2 , while the numerator follows
Od
σ2
pβ̂1 ´ 0q řn 2
„ N p0, 1q . (21)
i“1 pxi ´ x̄q
Writing (21) over (20) we obtain the desired expression.
46
Model Accuracy
Residual Standard Error (RSE).

dř dř
n n
c
2 2
RSS i“1 ei i“1 pyi ´ ŷi q
RSE :“ “ “ .
n´2 n´2 n´2
Problem: it is not clear what is a “good” RSE.

R-Squared (R 2 )
TSS ´ RSS RSS

R2 “ “1´ , (22)
TSS TSS
n
ÿ
TSS “ pyi ´ ȳ q2 .
i“1
Interpretation: It is between 0 and 1. It indicates the proportion of

the variability of Y that can be explained by X . Further, it can be
2
shown that R 2 “ Cor(X,Y) , by plugging-in (13) and (14) in (22).
47
Multiple Linear Regression
Model:
Y “ f pX q ` , f pX q “ β0 ` β1 X ` ¨ ¨ ¨ ` βp X , (23)
where is a centered random noise, uncorrelated to X .
48
Multiple Linear Regression
Data 1 True Function 1
●
● ●
●●● ● ● ●
●●● ●● ●●● ●
● ● ●● ● ● ●
●
● ●●●●●●●● ●
● ●●●●
●
● ● ●● ●
●● ● ● ●
● ● ●●
● ● ● ●● ● ●●
●● ●●● ● ● ● ●●●● ● ● ● ●
●
●● ●● ●● ●
●● ● ● ● ●
●● ● ●
● ●●●●●● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ●
●● ●●● ● ● ●● ● ● ● ● ● ● ●
●
●
● ●
●● ●●● ● ● ● ● ●● ● ● ●● ●●● ●● ●● ● ● ● ● ●●
●●● ●●● ● ● ●
●●
●● ● ●●●●● ● ●●●● ●
● ●● ● ●
● ● ● ● ●
● ● ●●
●● ●●● ● ●● ● ● ● ●● ● ● ● ●
●● ● ● ● ●
●● ● ●● ● ● ●●
●● ● ● ●●●● ● ●
●● ● ● ● ● ● ● ●●●● ●● ●
●● ● ● ● ●● ● ●●●● ● ● ●●● ●● ● ● ● ● ●
●
●●●● ● ● ● ● ● ●●● ● ● ● ●
●● ● ● ● ● ● ● ●●● ●● ● ●●
1.0
● ● ● ● ●
● ● ● ● ● ●● ● ● ●● ●
●
●
● ●● ●
●
● ●●
●● ● ● ●● ●● ● ● ●
●
● ● ●●●
●● ●●● ●●● ● ● ●● ●● ●● ● ●●●● ●
●●● ●
● ●●
●
● ● ●●●
● ●● ●● ●●●●●●●● ●●
●● ●
●● ● ● ● ● ●● ●● ● ●●● ● ●●
●● ●●
● ● ●
● ●● ● ●● ● ● ● ●
●●●● ●● ●●● ● ●
● ● ● ● ● ●●● ● ● ● ● ● ●●●●●
● ● ● ●●
●
● ● ● ●● ● ● ●● ●●● ● ● ●● ● ●● ●●● ●● ●●
● ● ● ● ● ●
●● ●● ● ●● ●● ● ● ● ● ● ●● ●● ●
● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ●
● ● ●●● ●● ●
0.5
● ● ● ●● ● ● ● ●
● ●● ●● ● ● ● ● ●●●●● ●● ● ● ●●● ●●●●●●● ●●● ● ●
● ● ●● ●
●● ●● ● ● ●
● ● ●●●
● ●●●● ●● ● ●● ● ● ●●● ●
● ●● ●
● ●●●● ●
●● ● ● ● ● ●● ●●
● ● ●● ● ●●
● ●● ● ● ● ● ●●● ● ● ●● ●
● ●
● ●●●
● ● ●●●●●● ●● ● ●●
● ●● ● ● ●● ● ● ● ● ● ● ●
● ●●● ● ● ● ● ●● ●● ●●●● ● ● ● ● ● ●
● ●
●●● ● ● ●
●●●
●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ● ●
●●●
● ● ●
x2
● ●
●● ● ● ● ●● ● ● ●● ● ● ●● ● ●● 1.0
● ●●●
●● ● ● ● ●● ●
●● ●●● ●●● ●●
y1
●●● ●●●● ● ● ● ● ●
0.0
● ● ● ●
y1
● ●●● ●
●● ● ●
●● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ●
●● ● ● ●● ●●● ● ● ● ●●●● ●
●
●●●●●● ● ● ● 0.8
● ● ● ●● ● ● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ● ● ●●
● ● ● ●●
● ●● ●● ● ●
● ●● ●
●
●
● ● ●● ●● ●●
0.6
x2
●● ● ●●
● ● ●
●●●● ● ● ●
−0.5
●● ● 0.4
0.2
−1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
x1
Data 2 True Function 2
●
●● ● ●● ●
● ●● ● ●● ● ● ● ●
●●
● ● ●●●●● ●● ● ● ● ● ●●● ●
● ● ●●●● ●● ● ● ●● ●● ● ● ● ●
●● ● ● ● ● ●● ●●● ● ●● ● ● ●
●● ● ● ●
● ●●●●● ●● ● ● ●● ● ● ●
● ●● ●●
●●
● ● ●●
●●●
● ● ● ● ●● ●● ● ●●●● ●● ●●●●
●
●
●
●● ●
● ● ● ●●● ●●
● ●● ●
● ● ● ● ●●
● ● ●● ● ● ● ● ●● ● ●● ● ● ●●●●●
●
●● ●●● ● ● ●● ●● ● ●
●●●
●●●● ● ● ● ●●
●● ● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●
●
● ●●●
●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●
●●
●● ● ● ●● ● ● ● ● ● ● ●
● ●● ●●●
1.5
●● ● ●●● ●● ●● ●●● ●
● ●● ●
●
● ● ●●● ●● ● ●● ●
●● ● ● ●
●● ●● ● ● ●
●
● ●●● ● ●● ● ●● ●● ●●
●
●●● ●● ●●●●
● ● ●● ● ●
●●● ●●● ●● ●●● ●● ●●● ● ●
● ●● ● ●● ● ● ● ● ● ●●
●● ● ● ● ● ●● ●● ● ● ●● ●
● ● ● ●●● ● ● ● ● ● ●● ● ● ●●
●● ● ● ●● ●● ● ● ● ● ●
●●● ● ●
●●
● ●●● ● ● ●● ● ● ● ●●● ●● ● ●●●●●● ●● ● ●
● ●● ● ● ●
1.0
●●● ● ● ●● ● ● ●
●● ● ● ●● ● ● ●● ● ●● ●
●●● ●●●
● ● ● ● ● ● ●●●
●●● ● ● ●●● ●●● ● ● ● ●
● ●●
●● ●● ● ● ●● ●
● ●● ●● ● ●
●●●● ● ●●
●● ●
● ●● ● ● ● ●●
● ● ● ● ● ● ● ●● ●● ● ●●
●
●●●●●●●● ● ● ●● ●● ●● ●● ● ● ● ● ● ●●●
● ●
●
●
● ● ●
● ●
●● ● ● ● ● ● ●●
●●● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●
●● ●● ● ●●● ●●● ●
● ● ● ●● ●● ●● ●● ● ●
0.5
● ● ●● ● ● ● ●
● ●● ● ●●●● ● ●● ● ●● ●●
y1
● ● ●● ● ● ●● ● ●●● ● ●● ●
●
●● ●● ● ●
●● ●●● ●● ●● ● ● ● ●● ● ●● ●● ● ● ● ●
● ●
●●●● ●● ● ●●●● ● ● ● ●● ● ● ●● ●● ●
●●
● ● ●● ● ● ●●
●● ● ●
y2
●● ● ●●●● ●● ● ● ●● ●● ● ●●
●
● ●●
●●●● ●●● ● ●● ●● ● ● ● ●● ●
●● ● ●
●●●
● ● ● ●● ● ● ● ●●
● ● ● ● ●● ● ●● ●
x2
● ●● ● ●● ●● 1.0
● ●●●
●● ●● ● ● ●● ● ●●●● ●
● ●● ●
0.0
●●● ●●● ● ● ● ●●● ● ● ● ● ●●

●●●
● ● ● ● ●
●●●●●● ● ● ● ●
● ● 0.8
● ● ●● ● ●●● ●● ● ● ● ●
● ●● ● ● ● ●● ● ●
x2
● ● ●● ●
● ●● ● ● ● ● ●● ● 0.6
● ●● ● ● ●● ●●●
● ●
●●●● ● ●
−0.5
●● ● 0.4
0.2
−1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
x1
49
Hypothesis Testing
Is there is at least one Xj that explains Y ?
H0 : β1 “ β2 “ ¨ ¨ ¨ “ βp
Ha : at least one βj is non-zero
Compute the F-statistic
pTSS ´ RSSq{p
F “
RSS{pn ´ p ´ 1q
Recall that the r.v. X “ U 1 {d1 2 2

U2 {d2 is dist. Fd1 ,d2 , for U1 „ χd1 , U2 „ χd2 . If
the p-value is small (e.g. ă 0.05) we reject H0 . This means that there is
at least one Xj that explains Y .
50
Variable Selection
How to select a good subset of Xj ’s out of all predictors?

Idea: try all possible models. Problem: need to evaluate 2p models. If
p “ 30, this is more than a million models.
Forward selection: Start with a model that only uses the intercept to
predict Y (“the best model with zero variables”) and add one
variable to the model at a time. To select “the best model with one
variable”, evaluate all of them and select the one with the smallest
RSS. Repeat the idea to select “the best model with 2, 3, . . . k
variables”. Repeat until some stopping criteria is reached.
Backward selection: Start with a model that has all variables.
Remove one variable at a time selecting the one that has the largest
p-value. Re-run the model and repeat the process until some
stopping criteria is reached.
Note that backward selection cannot be done if p ą n. The stopping
criteria can be related to some target p-value on the models in the model.
51
Model Accuracy
Advantages and disadvantages:
RSS: non scaled.

R-squared: scaled in r0, 1s. In fact it can be shown that
R 2 “ CorpY , Ŷ q2 . Problem: increases wrt p.
52
Predictors
We can compute ŷi “ fˆpxi q, as a predictor of

p
ÿ
yi “ f pxi q ` , f pxi q “ β0 ` βi x i
i“1
Confidence intervals: compares fˆpX q with f pX q, i.e. includes only

reducible errors.
Prediction intervals: compares fˆpX q with f pX q ` , i.e. includes
reducible and irreducible errors.
53
Predictors: Boston Example
Train data ` Estimated f
50
50
●
● ●
●●
●●
●● ●
●● ● ●● ●
● ●
●●
●●
●● ●
●● ● ●●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ● ● ●
40
40
● ●
● ●
●● ● ●● ●
●
● ●
●
● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●
● ●●●● ● ● ●●●● ●
● ● ● ●
●●● ●● ●● ● ● ●●●● ●● ●● ●
● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●
30 ●●● ● ● ●●● ● ●
30
● ●● ● ● ●● ●
● ●● ● ● ● ● ●● ● ● ●
●●●● ● ● ●●●● ● ●
● ●● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ●● ●●
y
y
● ●● ●● ● ●
● ● ● ●● ●● ● ●
● ●
●●●●●● ●● ●
● ● ●
●
●● ●●●●● ●
●
●
●●
● ● ● ● ● ●● ●
●
●
●
● ●●●●● ●
●
●
●●
● ● ●
●●
●●● ● ●●●●● ●●●● ●●●●
● ●
● ●●
●●● ● ●●●●● ●●●● ●●●●
● ●
●
●●●● ● ●
●●● ● ●● ●●●● ● ●
●●● ● ●●
●●●● ● ● ● ●● ●●●●●● ●● ● ● ● ●●●● ● ● ● ●● ●●●●●● ●● ● ● ●
● ●●● ●●●●● ●
● ● ●●● ●● ●● ●
● ● ● ●●● ●●●●● ●
● ● ●●● ●● ●● ●
● ●
●●●●● ● ●
●● ●●●●● ● ● ● ●●●●● ● ●
●● ●●●●● ● ● ●
● ● ●●
●
● ● ● ●●
●
●
● ● ●●●●●
● ● ● ●● ●● ●
●●● ● ●●● ● ● ●●●●●
● ● ● ●● ●● ●
●●● ● ●●●
●● ●● ●
● ● ●● ●● ●
● ●
20
20
●● ● ●●
●● ●● ●●●
● ● ●● ● ●●
●● ●● ●●●
● ●
● ● ●● ● ●●●●●●
●● ●
●
●●●
● ●●
●●●● ●●
●
● ● ● ● ●● ● ●●●●●●
●● ●
●
●●●
● ●●
●●●● ●●
●
● ●
●● ●●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ● ●● ●●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ●
● ● ●● ● ● ● ●● ●
● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
● ● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
●
● ●●● ●● ●
● ● ● ● ●●● ●● ●
● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●● ●
● ● ●●
● ●●● ● ●● ● ●●● ● ●●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
●● ●●● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●
●●
●●● ● ● ●● ● ● ●● ● ● ●
●●
●●● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
10
10
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●●
● ● ● ● ●
●
● ● ●●
● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ●
10 20 30 10 20 30
x x
` Confidence Interval ` Prediction Interval

50
50
●
● ●
●●
●●
●● ●
●● ● ●● ●
● ●
●●
●●
●● ●
●● ● ●●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ● ● ●
40
40
● ●
● ●
●● ● ●● ●
●
● ●
●
● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●
● ●●●● ● ● ●●●● ●
● ● ● ●
●●● ●● ●● ● ● ●●●● ●● ●● ●
● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●
●●● ● ● ●●● ● ●
30
30
● ●● ● ● ●● ●
● ●● ● ● ● ● ●● ● ● ●
●●●● ● ● ●●●● ● ●
● ●● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ●● ●●
y
● ●● ●● ● ●
● ● ● ●● ●● ● ●
● ●
●●●●● ● ●● ●
● ● ●
●● ●●●●● ●
●●● ● ● ● ●● ●●
● ●●●●● ●
●●● ●
●● ● ●●●● ●● ● ● ●
●●
●
● ●●●● ●● ● ● ●
●●● ●●
●●●●● ●● ●● ● ● ●●● ●●
●●●●● ●● ●● ● ●
●●●● ●●●●●●●
● ● ●●●●●●● ●● ●● ●● ● ● ●●●● ●●●●●●●
● ● ●●●●●●● ●● ●● ●● ● ●
● ●●● ●● ●●●●●●● ●
● ● ●●● ●● ●● ●
● ● ● ●●● ●● ●●●●●●● ●
● ● ●●● ●● ●● ●
● ●
●●●●● ●● ●● ●● ● ●●●●● ●● ●● ●● ●
● ●● ● ●● ●● ●● ● ●● ●● ●● ●
● ● ●●●●● ●● ●● ●
●
●● ● ●●● ● ● ● ● ●●●●●
● ●● ● ●● ●● ●
●
●● ● ●●● ● ●
●●● ●●● ● ● ●●● ●●● ● ●
20
20
●● ●● ●● ●● ●●
● ● ●● ●● ●● ●● ●●
● ●
● ●● ● ● ●●● ● ● ● ●●● ●
●●● ●●●●●● ●●●
●● ●●●
●●●●
● ● ● ● ●● ●●● ●●●●●● ●●●
●● ●●●
●●●●
● ● ●
●● ●●● ● ●● ●● ● ●● ●●●●
● ● ●● ●●● ● ●● ●● ● ●● ●●●●
● ●
● ● ●● ● ● ● ●● ●
● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
● ● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
●
● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●● ●
● ● ●●
● ●●● ● ●● ● ●●● ● ●●
● ●●● ●● ●● ● ● ●
● ● ● ● ●●● ●● ●● ● ● ●
● ● ●
● ●
● ●● ● ● ● ● ●
● ●● ● ● ●
●● ●
●● ● ●● ● ● ●●
●● ● ●
● ● ●● ●
●● ● ●● ● ● ●●
●● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
10
10
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●●
● ● ● ● ●
●
● ● ●●
● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ●
10 20 30 10 20 30
x x
54
Other Considerations
Qualitative predictors. An m-level predictor that induces m

sub-models requires m ´ 1 indicator variables.
Interaction between predictors. If of the form xi ˆ xj , we have no
longer a linear model.
55
Potential Problems
The true relation f pX q is non-linear. See pattern in error (ei ) plots,

e.g. not centered.
Serial correlation between error terms. See persistence in ei plots.
Non-constant variance of errors. See ŷi vs ei plot.
Outliers. See Ŷ vs Y plot.
56
Potential Problems
High leverage points. For simple linear regression, check
1 pxi ´ x̄q2 1
hi “ ` řn 2
, hi P r , 1s. (24)
n j“1 pxj ´ x̄q n
called “leverage statistic”. hi « 1, means high leverage at i.

Collinearity. Check the variance inflation factor
1
VIFpβ̂j q “ , VIFpβ̂j q P r1, 8s. (25)
1 ´ RX2 j |X´j
where RX2 j |X´j is the R-squared of the regression of Xj against all

other predictors except Xj .
57
Comparison with KNN
If p is small and non-linear relation, KNN is better.

If p is large, KNN has problems due to the “curse of dimensionality”.
The plot below shows 10 realizations of x, y , z „ Up0, 1q plotted the
line, de square and the cube. See how the separation of the points
increase with the dimensionality.
1D 2D 3D
1.0
1.0
0.8
0.8
1.0
0.6
0.6
0.8
y
0.6
0.4
0.4
z
1.0
y
0.4
0.8
0.2
0.2
0.6
0.2
0.4
0.2
0.0
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x
x x
58
Homework
Due 4.25.19 - 19:30.
1 Chapter 2: Exercises 2, 4, 7, 9 from [JO13].

59
Classification Models
Linear Regression
Source: [JO13]
60
Linear Regression
Remark 4.1
Note that linear regression fails for classification problems. Consider: Y
coded as 0{1 and p “ 1. In linear regression we estimate fˆ : Rp Ñ R,
meaning that we are mapping onto the whole real line, and not only onto
t0, 1u. This means we can predict 0.7, which is meaningless.
61
Linear Regression
Source: [JO13]
62
Simple Logistic Regression
Consider the logistic regression model for Y coded as 0{1 and p “ 1.
PpY “ 1|X q “ ppX q, (26)

eX
ppX q “ g ˝ f pX q, f pX q “ β0 ` β1 X , g pX q “ ,
1 ` eX
where g is called the logistic function.
Remark 4.2
Note that β1 is not the marginal contribution of X to PpY “ 1|X q, but
the marginal contribution of X to the log-odds
ˆ ˙
ppX q
log “ β0 ` β1 X , (27)
1 ´ ppX q
where 1 ´ ppX q “ 1 ´ PpY “ 1|X q “ PpY “ 0|X q.
63
Estimation & Prediction
The estimation is done via maximization of a likelihood function
β̂0 , β̂1 “ argmaxβ0 ,β1 `pβ0 , β1 q (28)

(
`pβ0 , β1 q :“ tΠi:yi “1 ppxi qu ˆ Πj:yj “0 1 ´ ppxj q , (29)
where ppxi q “ Ppyi “ k|xi q, and `pβ0 , β1 q is a convex function.

Once the parameters are estimated, the prediction is computed as
e β̂0 `β̂1 X
p̂pX q “ ,
1 ` e β̂0 `β̂1 X
and the classification follows a rule of the form:
#
1 if p̂pX q ą H
ŷi “
0 if p̂pX q ď H,
where H is a threshold value, e.g. H “ 0.5.
64
Multiple Logistic Regression
Consider the logistic regression model for Y coded as 0{1 and p ą 1.

ˆ ˙
ppX q
log “ β0 ` β1 X1 ` β2 X2 ` ¨ ¨ ¨ ` βp Xp , (30)
1 ´ ppX q
where:
e β0 `β1 X1 `β2 X2 `¨¨¨`βp Xp
ppX q “ .
1 ` e β0 `β1 X1 `β2 X2 `¨¨¨`βp Xp
Estimation and prediction: analog to the simple logistic model case.

Logistic regression for ą 2 response classes are not used often. For
these cases we consider linear discriminant analysis.
65
Bayesian Classifier
Consider the Bayes Theorem for K response classes:
πk fk pxq
pk pX q “ řK , (31)
`“1 π` f` pxq
where:
πk “ PpY “ kq: prior probability that a randomly chosen

observation comes from the k-th class.
fk pxq “ PpX “ x|Y “ kq: density function of X for an observation
that comes from the k-th class.
pk pX q “ PpY “ k|X “ xq: posterior probability that a randomly
chosen observation comes from the k-th class
Idea: πk is easy, fk pxq is difficult. With reasonable fˆk pxq, can

approximate the Bayes classifier (classifier with the smallest error rate).
66
Bayesian Classifier (example for p=1)
Let k “ 2, with π1 “ π2 , and fk pxq „ N pµk , σq, thus:
px ´ µk q2
" *
1
fk pxq “ ? exp ´ (32)
2πσ 2σ 2
As consequence of (31), the posteriori distribution follows

! 2
)
1
πk ?2πσ exp ´ px´µ
2σ 2
kq
pk pX q “ ř ! ), (33)
K 1 px´µ` q2
`“1 π ` ?
2πσ
exp ´ 2σ 2
µk µ2
9 2
X ´ k2 ` logpπk q,
σ 2σ
thus the Bayes classifier selects class k that maximizes
µk µ2
δk pX q “ 2
X ´ k2 ` logpπk q. (34)
σ 2σ
67
Bayesian Classifier (example for p=1)
Using (34), class 1 is selected if
δ1 pX q ą δ2 pX q
µ1 µ21 µ2 µ22
X´ ` logpπ1 q ą X ´ ` logpπ2 q
σ2 2σ 2 σ2 2σ 2
µ1 ` µ2
X ą ,
2
and class 2 otherwise.
68
Linear Discriminant Analysis
Let p “ 1, and fk pxq „ N pµk , σq, thus:
LDA uses estimators of σ, µi and πi , and plug them in (33) as
µ̂k µ̂2k
δ̂k pX q “ X ´ ` logpπ̂k q. (35)
σ̂ 2 2σ̂ 2
In particular, the required sample estimates follow
nk
π̂k “ (36)
n
1 ÿ
µ̂k “ xi , (37)
nk i:y “k
i
K
1 ÿ ÿ
σ̂ 2 “ pxi ´ µk q2 (38)
n ´ K k“1 i:y “k
i
LDA is “linear” since (35) is a linear function of X .
69
Source: [JO13]
70
Let p ě 1, and fk pxq „ N pµk , Σq, thus:

" *
1 1 J ´1
fk pxq “ exp ´ px ´ µk q Σ px ´ µk q , (39)
p2πqp{2 |Σ|1{2 2
where ErX s “ µ and CovpX q “ Σ. Analogous to the p “ 1 case,
We are interested in selecting k that maximizes

1
δk pxq “ x J Σ´1 µk ´ µJ Σ´1 µk ` logpπk q (40)
2 k
We face decision a boundary between k and ` given by
1 1 J ´1
x J Σ´1 µk ´ µJ ´1 J ´1
k Σ µk ` logpπk q “ x Σ µ` ´ µ` Σ µ` ` logpπ` q.
2 2
71
Source: [JO13] 72
Assessment of Binary Classifiers
Confusion matrix:
True condition
Condition positive Condition negative
Predicted True positive False positive
condition (Type I error)
positive
Predicted False negative True negative
condition (Type II error)
negative
where:
True positive rate = (sum of true positive)/(sum of condition positive)
False positive rate = (sum of false positive)/(sum of condition negative)
73
The confusion matrix:
Shows what kind of mistakes we are making, i.e. predicting 0 when

is 1, and viceversa.
Varies according to the selected threshold level. In some cases one
type of error could be preferable than the other.
74
Source: [JO13]
75
To evaluate a model in terms of all possible threshold levels we use the

ROC curve. A measure of performance of a model is the AUC P r0, 1s,
i.e. area under the curve.
Source: [JO13]
76
Quadratic Discriminant Analysis
Same as LDA, but each class has its own covariance matrix
Thus, we are interested in finding k that maximizes
1 1
δk pxq “ ´ px ´ µk qJ Σ´1k px ´ µq ´ log |Σk | ` logpπk q
2 2
1 1 J ´1
“ ´ x J Σ´1
k x ´ µk Σk µk ` µk Σk x
J ´1
2 2
1
´ log |Σk | ` logpπk q, (41)
2
where the first term is a quadratic form and thus, the decision
boundaries have curvature.
Equation (41) means that we need to estimate Σk for each k.
77
Quadratic Discriminant Analysis
Remark 4.3
QDA is more flexible than LDA. Intuitively this means that when
compared to each other:
LDA: Ò Bias Ó Variance

QDA: Ó Bias Ò Variance
78
Source: [JO13]
79
Comparison of Classification Methods
Logistic regression vs. LDA (k “ 2, p “ 1)
In LDA, note that since p2 pX q “ 1 ´ p1 pX q
µ2 ´ µ2
ˆ ˙ ˆ ˙
p1 pxq µ1 ´ µ2
log “ logpπ1 q ´ logpπ2 q ` 2 2 1 ` X
1 ´ p1 pxq 2σ
looooooooooooooooomooooooooooooooooon σ2
looooomooooon
c0 c1
“ c0 ` c1 X ,
which is clearly linear in X , just as in (27).

Difference in fitting procedures. In logistic regression we use ML,
and in LDA we use the plug-in estimators π̂i , µ̂i and σ̂.
80
Comparison of Classification Methods
KNN vs Logistic regression / LDA
KNN is completely non-parametric.

KNN does not deliver a specific form for the decision boundary, e.g.
linear or quadratic.
KNN does not give a list of coefficients, or statistics.
Remark 4.4
KNN is more flexible than QDA. Intuitively this means that when
compared to each other:
QDA: Ò Bias Ó Variance

KNN: Ó Bias Ò Variance
81
Resampling Methods
Overview
Problem: want to select the model with the best test error
performance, but usually do not have available test sets.
Idea: create artificial test sets by sampling.
Limitation: sampling from a population a repeated number of times
demands some computation power.
Sampling methods:
1 Cross validation: Validation set, LOOCV, and k-fold
2 Bootstrap
82
Cross Validation: Validation Set
Main idea: decompose
data set “ training set ` validation set,
and fit the model using “training” (70% obs.). Then compute MSE
using “validation” (30% obs.). Select the model with smallest MSE.
Drawbacks
1 Test MSEs are too variable because samples can be very dissimilar.
2 Test MSEs are too high because we are only using part of the data.
83
Cross Validation: Validation Set
Source: [JO13]
84
Cross Validation: LOOCV
data set “ training set ` one validation point pxi , yi q,
fit with “training” and evaluate using MSEi in “validation”. Do this

in a loop selecting as validation point each point in the data set, and
compute the performance measure
n ¯2
1ÿ ´
píq
CVpnq “ MSEi , MSEi “ yi ´ ŷi (42)
n i“1
Advantages (wrt VS)

1 Ó Variance: small sample variability.
2 Ó Bias: uses almost all the data.
Disadvantage: n fits required. Can be very slow.
85
Remark 5.1
Consider simple linear regression. We stress that the use of
píq píq píq
ŷi “ β̂0 ` β̂1 xi in expression (42) is computed using
ř ř
píq j‰i yj j‰i xj
β̂0 “ ´ β̂1 ,
n´1 ř
n ´1 ř
´ ¯ ´ ¯
j‰i xj j‰i yj
ř
j‰i xj ´ n´1 yi ´ n´1
píq
β̂1 “ ř ¯2 ,
ř ´ j‰i xj
j‰i x j ´ n´1
A shortcut to compute the expression in (42) is

n ˆ ˙2 2
1 ÿ yi ´ ŷi 1 pxi ´ x̄q
CVpnq “ , hi “ ` řn 2
, (43)
n i“1 1 ´ hi n j“1 pxj ´ x̄q
which is much easier to evaluate.

86
Source: [JO13]
87
Cross Validation: k-Fold
data set “ training set ` one validation “block”,
fit with “training” and evaluate using MSEk in “validation”. Do this

in a loop selecting as validation block each of the disjoint k blocks
(e.g. k “ 5, 10 are common choices) in the data set, and compute
k ¯2
1ÿ k ÿ´ pÍ q
CVpnq “ MSEi , MSEi “ yj ´ ŷj i , (44)
k i“1 n jPI
i
where Ii is the collection of indices in block i.

Advantages (oposite of LOOCV)
1 It is faster.
2 It is not better: Ò variance, Ò bias.
88
Cross Validation: k-Fold
Source: [JO13]
89
Cross-Validation on Classification Problems
Remark 5.2
Instead of the MSE, we use the ER. We write it once again here for
clarity, for the case of n observations
n
1ÿ
ER “ Iy ‰ŷ ,
n i“1 i i
thus, for example, the LOOCV for classification problems follow

n
1ÿ
CVpnq “ ERi , ERi “ Iyi ‰ŷ píq
n i“1 i
90
Bootstrap
Idea: sample data with replacement to obtain a data set of the same size.
Used to quantify the uncertainty with a given estimator.
Example 5.1
Consider the problem of selecting the optimal investment allocation α˚
when choosing from assets X and Y in such a way that the minimization
of the variance of the portfolio return αX ` p1 ´ αqY is reached.
Consider availability of sample estimates σ̂X2 , σ̂Y2 , σ̂XY .
Solution. It is easy to see that
α̂˚ “ arg min VarpαX ` p1 ´ αqY q

α
σ̂Y2 ´ σ̂XY
“ .
σ̂X2 ` σ̂Y2 ´ 2σ̂x,y
To quantify the uncertainty of this estimator, we can simply use

bootstrap.
91
Bootstrap
Source: [JO13] 92
Homework
Due 5.2.19 - 19:30.
1 Chapter 4: Exercises 6, 10 from [JO13].

93
Model Selection
Subset Selection
Recall:
yi “ β0 ` β1 xi,1 ` β2 xi,2 ` ¨ ¨ ¨ ` βp xi,p ` i , i “ 1, 2, . . . , n,
where p ă n, and Covpi , j q “ 0, i ‰ j.

Problem: if Ò p, the estimate fˆpx0 q has a large variance.
Idea: do variable selection systematically, s.t. Ó p.
Remark 6.1
Note that the number of possible models for each possible model size is:
p ˆ ˙ ˆ ˙ ˆ ˙ ˆ ˙
ÿ p p p p
“ ` ` ¨¨¨ ` “ 2p ,
i“0
i 0 1 p
which makes it unfeasible to test and compare all models in practice.
94
Best Subset Selection
Algorithm 1 (Best Subset Selection)
1 Denote the null model as M0 .

2 For k “ 1, . . . , p do
Fit all kp models containing k predictors.
` ˘
1
2 Pick the model with smallest RSS (or largest R2 ) and call it Mk .
3 Choose the optimal model among M0 , . . . , Mp .
Remark 6.2 (Best Subset Selection in Logistic Regression)

Instead of using RSS in the second sub-step of step 2, consider the
“deviance”, defined as
´2 ˆ Lpβ̂0 , . . . , β̂p q, (45)
where Lp¨q is the corresponding log-likelihood function.
95
Forward Stepwise Selection
Algorithm 2 (Forward Stepwise Selection)
1 Denote the null model as M0 .

2 For k “ 0, . . . , p ´ 1 do
1 Fit all p ´ k models that add one predictor to Mk .
2 Pick the model with smallest RSS (or largest R2 ) and call it Mk`1 .
Remark 6.3
Note that the number of models being tested in Algorithm 2 is
ppp ` 1q
1 ` p ` pp ´ 1q ` ¨ ¨ ¨ ` 1 “ 1 ` ă 2p ,
2
i.e. much smaller than that of Algorithm 1.
96
Backward Stepwise Selection
Algorithm 3 (Backward Stepwise Selection)
1 Denote the full model as Mp .

2 For k “ p, p ´ 1 . . . , 1 do
1 Fit all k ´ 1 models that substract one predictor to Mk .
2 Pick the model with smallest RSS (or largest R2 ) and call it Mk´1 .
Remark 6.4
1 The number of models tested in Algorithms 3 and 2 is the same.
2 The best model resulting from Algorithms 1, 2 and 3 does not need
to coincide.
97
Choosing the Optimal Model
Two ways:
1 Indirectly estimate the test error by making an adjustment to the

training error.
1 Mallow’s Cp
2 AIC
3 BIC
4 Adjusted R 2
2 Directly estimate the test error by
1 Validation set approach
2 Cross validation approach
98
Cp , AIC, BIC, Adjusted R 2
Main idea: Provide selection criteria after controlling for model size d.
1 Mallow’s Cp (the smallest, the better)
Cp “ p1{nq ˆ pRSS ` 2d σ̂ 2 q (46)
2 Akaike’s Information Criterion (the smallest, the better)
AIC “ p1{pnσ̂ 2 qq ˆ pRSS ` 2d σ̂ 2 q (47)
3 Bayes Information Criterion (the smallest, the better)
BIC “ p1{pnσ̂ 2 qq ˆ pRSS ` logpnqd σ̂ 2 q (48)
4 Adjusted R 2 (the largest, the better)
RSS{pn ´ d ´ 1q
Adj. R 2 “ 1 ´ (49)
TSS{pn ´ 1q
99
Cp , AIC, BIC, Adjusted R 2
Remark 6.5
1 The computation of σ̂ 2 in all the alternatives presented before is

done for the full model, that is considering all d predictors.
2 The direct estimation of the test error is less common for variable
selection. It can be done if σ̂ 2 is hard to estimate, or d (the degree’s
of freedom in the model) is not easily computed.
100
Regularization
Regularization Methods
Idea: Don’t do variable selection. Instead run the model with all
variables, but shrink the betas so that:
§
Ò Biasrfˆpx0 qs, đ Varrfˆpx0 qs,
§
§
where the big arrow đ means that we expect the benefits from reducing
§
the variance will compensate for the increase in bias.
Alternatives:
1 Ridge Regression
2 LASSO
101
Ridge Regression
RR minimizes a loss function given by:

˜ p
¸2 p
n
ÿ ÿ ÿ
y i ´ β0 ´ βj xi,j ` λ βk2 , λ ą 0. (50)
i“1 j“1
loooooooooooooooomoooooooooooooooon k“1
looomooon
Least Squares Loss RR Penalty
Where:
The shrinkage penalty, reduces the value of the estimated β̂j ’s.
For a tuning parameter λ Ñ 0, the penalty has no effect, and for
λ Ñ 8, the penalty has all the control.
102
Ridge Regression
Remark 7.1
Note that each β̂j do depend on the scaling of all predictors x1 , . . . , xp .
The usual recommendation is then to transform the variables xj to
xi,j
x̃i,j “ bř . (51)
n
i“1 pxi,j ´ x¯j q2
Remark 7.2
Note that RR performs better than LS if the gain in variance reduction
surpasses the loss for bias increase. This is the case when LS variance is
particularly high, which happens when p Ñ n.
103
LASSO (Least Absolute Shrinkage and Selection Operator)
LASSO minimizes the loss a loss function given by

˜ p
¸2 p
n
ÿ ÿ ÿ
y i ´ β0 ´ βj xi,j ` λ |βk | , λ ą 0. (52)
i“1 j“1
loooooooooooooooomoooooooooooooooon k“1
loooomoooon
Least Squares Loss LASSO Penalty
This is very similar to RR, but LASSO
Penalizes a different norm of β.

Can set some β̂j ’s to zero, thus variable selection is automatic.
104
LASSO
Remark 7.3
Recall that the `-norm of a vector x P Rn is defined as
˜ ¸ p1
ÿn
p
}x}` “ |xi | , (53)
i“1
thus RR and LASSO penalize the norm of vector β P Rp , but RR

penalizes its `2 norm while LASSO penalizes its `1 norm.
105
LASSO
Unit spheres in R2 :
L2 L1
1.0
1.0
0.5
0.5
0.0 0.0
-0.5
-0.5
-1.0
-1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
L8 Lp
1.0
1.0
0.5
0.5
0.0 0.0
-0.5
-0.5
-1.0
-1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
106
LASSO
Remark 7.4
Note that finding β̂ in LASSO can be written equivalently as
˜ p
¸2 p
n
ÿ ÿ ÿ
β̂1 , . . . , β̂p “ arg min yi ´ β0 ´ βj xi,j `λ |βk |.
β1 ,...,βp i“1 j“1 k“1
or as
$ ˜ ¸2 ,
& n
ÿ p
ÿ p
ÿ .
β̂1 , . . . , β̂p “ arg min y i ´ β0 ´ βj xi,j : |βk | “ s .
% β1 ,...,βp i“1 j“1
-
k“1
Since the RSS is a convex function, its orthogonal projection onto Rp is

an ellipsoid, the equality restriction creates the possibility of a corner
solution, which means that some β̂j ’s are zero.
107
Source: [JO13]
108
Selecting the Tuning Parameters
In RR and LASSO, the tuning parameter λ has been considered as given.

Indeed this parameter needs to be estimated from the data by Cross
validation, e.g. LOOCV or k-fold CV. A popular choice is LOOCV.
109
Dimension Reduction
Overview
Idea: to reduce the variance, don’t use X1 , . . . , Xp directly. Instead use

Z1 , . . . , ZM , M ă p, such that Zm K Zr , m ‰ r , where:
Zm P spantX1 , . . . , Xp u, m “ 1, . . . , M.
Here we discuss:
1 Principal Component Analysis (PCA)

2 Partial Least Squares (PLS)
110
Overview
Remark 8.1
Note that the change of variables from tXj upj“1 to tZm upm“1 can in itself
reduce the variance of the overall estimator fˆpx0 q. In particular, consider
M
ÿ
yi “ θ0 ` θm zi,m ` i , θ0 , . . . , θm P R.
m“1
p
ÿ
zi,m “ φj,m xi,j , φ1,m , . . . , φp,m P R.
j“1
Accommodating the terms we arrive to

p
˜ ¸
ÿ M
ÿ
yi “ θ 0 ` θm φj,m xi,j ` i ,
m“1
j“1 looooooomooooooon
βj
which is a model where the βj ’s are constraint, i.e. variate less.

111
Principal Component Analysis
First Principal Component:

# « p
ff p
+
ÿ ÿ
φ̂1,1 , . . . , φ̂p,1 “ arg max Var φj,1 px j ´ x̄ j q : φ2j,1 “1
φ1,1 ,...,φp,1 j“1 j“1
p
ÿ
z1 “ φ̂j,1 px j ´ x̄ j q
j“1
where: φ̂1,1 , φ̂2,1 are the “loadings” for z 1 , and zi,1 are its “scores”.
Remark 8.2
The first principal component is the line that minimizes the sum of the
square perpendicular distances between each point ant the line.
112
Principal Component Analysis
Second Principal Component:

# « p
ff +
ÿ
φ̂1,2 , . . . , φ̂p,2 “ arg max Var φj,2 px j ´ x̄ j q : Covpz 2 , z 1 q “ 0
φ1,2 ,...,φp,2 j“1
p
ÿ
z2 “ φ̂j,2 px i ´ x̄ j q,
j“1
and we can proceed in the same manner up to the p-th PC.
113
Principal Component Regression
Consider LS using PCs:

M
ÿ
yi “ θ 0 ` θm zi,m ` i ,
m“1
where i is the noise under the usual assumptions, and M can be selected
via cross validation.
Remark 8.3
PCR assumes the directions where X1 , . . . , Xp show the most variation
are directions associated with Y .
Remark 8.4
Note that PCR uses all features in each component, thus it is a
dimension reduction method, but not a feature selection method.
114
Partial Least Squares
The computation is initialized by setting e p1q “ y .
For k “ 1, . . . , M do:
1 e pkq „ x j , j “ 1 . . . , p, and use the coeffs φ̂j,k to compute

p
ÿ
zk “ φ̂j,k x j .
j“1
2 z k „ x j and save the residuals e pk`1q .
The selection of M can be done also via cross validation.

Remark 8.5
PLS is a supervised PCR in that the information of Y is considered when
computing z i .
115
Homework
Due 5.9.19 - 19:30.
1 Chapter 6: Exercises 2, 3, 8 (a,b,c,d), 11(a,b,c) from [JO13].
116
Lab: Forecasting Soccer Matches
Machine Learning, Big Data and other Fancy Words
Worldwide hype
Effective forecast and classification methods.

Tech: Google, Spotify, Waze, etc.
Finance: high frequency asset pricing, insurance policies
and in Peru...
RIMAC data science challenge1 : predict car insurance default for good
customers. Three best forecasts get a data scientist job!.
Fintech: LatinFintech, Bitinka, Culqui, Kambista, etc.
1 Hackaton organized by Hackspace to recruit data scientists.

117
Statistical Oracles in Soccer
Popular Events
Google2 Ñ FIFA 2014 World Cup Ñ 14/16 (« 88%) accuracy

Goldman Sachs3 Ñ Euro 2016 Ñ got bookmakers’ odds
Mister chip4 Ñ Argentina–Peru Ñ 53% of going to Russia 2018.
5
Google says it can beat “Paul the Octopus”.
Beating “Paul the Octopus” is not trivial. Actually, it means getting an

accuracy above 11/13 (« 85%).
2 https://github.com/GoogleCloudPlatform/ipython-soccer-predictions
3 http://www.goldmansachs.com/our-thinking/macroeconomic-insights/
euro-cup-2016/
4 From twitter account: @2010misterchip.
5 Claim made by J. Tigani (Google I/O) at a big data conference (Strata) in 2014.
118
Objective of this Application
Replicate Google’s idea:
1 Got the code Ñ GitHub repository

2 Bought some data Ñ OPTA6
3 Got the momentum Ñ Peru vs. Argentina
Forecast Argentina–Peru:
1 Update data: Copa América Centenario & Qualifiers (so far).

2 Tweak parameters.
3 Larger effort: Survey of classification methods in finance.
6 www.optasportspro.com
119
Data Structure
Three leagues:
1 United States: Major League Soccer.

2 England: Premier League.
3 Spain: La Liga.
Features ˆ 2:
1 Correct passes 6 Shots

2 Incorrect passes 7 Corners
3 Ratio correct/incorrect passes 8 Cards
4 Good passes at 80% top field 9 Fouls
5 Bad passes at 70% top field 10 Expected goals7
7 Variable generated by OPTA.

120
Idea:
Make an assessment on how the teams are coming for the game, i.e.
write a matrix with teams on one side and features on the other side8 .
Details:
We keep only matches that resulted in a win/loss.

We drop matches without complete information: new teams.
We split the dataset in a test set (30%) and a training set (70%).
8 features are computed in MA terms (last 6 games)

121
Lasso Regularization
We aim to maximize
# +
n
ź
yi 1ýi
Lpβ; λq :“ log pi pβq p1 ´ pi pβqq ´ λ}β}1 ,
i“1
řn
where }x}1 “ i“1 |xi |, x P Rn , and λ is the regularization parameter.
Details:
Regularized logit regression problem with Manhattan norm.

β is the coefficient vector corresponding to the features.
Regularization to avoid overfitting & selected via cross-validation.
Automatic feature selection, i.e. some coefficients are zero.
122
Classification Accuracy:
The measure of accuracy of the model is given by
TP ` TN
ACC “ ,
TP ` TN ` FP ` FN
where TP: true positive, TN: true negative, FP: false positive, and FN:
false negative.
123
Results: 2014 FIFA World Cup
Quarter Finals Matches (no draws) All Matches

Reported 88% 77% –
Replicated X X 53%
Highlights:
Google reported the quarter final’s accuracy (only) at Strata.

and the accuracy excluding draws (only) at its GitHub repository.
The overall accuracy is not reported, but it is very relevant.
124
Results: 2018 FIFA World Cup (Quals)
Highlights:
The accuracy excluding draws is 85%.

The overall accuracy is 68%.
125
Details: Last 10 Games:
Team A Team B PpA ą Bq Expected True

Uruguay Argentina 66% Uruguay draw
Peru Bolivia 72% Peru Peru
Ecuador Brasil 7% Brasil Brasil
Paraguay Chile 13% Chile Paraguay
Venezuela Colombia 31% Colombia draw
Venezuela Argentina 31% Argentina draw
Colombia Brasil 46% Brasil draw
Chile Bolivia 47% Bolivia Bolivia
Peru Ecuador 58% Peru Peru
Uruguay Paraguay 53% Uruguay Uruguay
126
Details: Coming 10 Games:
Team A Team B PpA ą Bq Expected

Peru Argentina 39% Argentina
Brasil Bolivia 78% Brasil
Ecuador Chile 30% Chile
Paraguay Colombia 8% Colombia
Venezuela Uruguay 63% Venezuela
Ecuador Argentina 63% Ecuador
Uruguay Bolivia 63% Uruguay
Chile Brasil 37% Brasil
Peru Colombia 63% Peru
Venezuela Paraguay 37% Paraguay
But that didn’t happen now. Did it?
127
Concluding Remarks
On Google’s exercise:
Transparency is key. Lots of public GitHub repositories.

Google’s statement is not accurate.
Results better than pure luck.
On Argentina–Perú:
We could forecast Ecuador–Peru.

We obtain a 39% probability of Peru beating Argentina.
Put your money where your mouth is?
128
Simple Non-Linear Methods
Simple Non-Linear Methods
Idea: Improve predictive power by relaxing the linearity assumption in
yi “ β0 ` β1 xi ` i , Covpi , j q “ σ 2 1ti“ju .
Examples:
1 Polynomial regression
2 Step Functions
3 Regression Splines
4 Smoothing Splines
5 Local Regression
129
Polynomial Regression
Model:
d
ÿ
yi “ β0 ` βj xij ` i , Covpi , j q “ σ 2 1ti“ju . (54)
j“1
Estimation: direct least squares

Tuning parameter d: selected via CV / ANOVA testing
Problem: boundary effects if d ą 4.
Remark 10.1
The boundary effect creates a big uncertainty at the borders. Note that
Varrfˆpx0 qs “ `J
0 Ĉ `0 , Ĉ i,j “ Covpβ̂i , β̂j q, `0 “ p1, x0 , . . . , x0p qJ
where x0 is an observation in the test set..
130
Polynomial Regression
Source: [JO13]
131
Step Functions
Model:
K
ÿ
yi “ βk Ck pxi q ` i , Covpi , j q “ σ 2 1ti“ju , (55)
k“0
where CK pxi q “ ItcK ďxi u at k “ K , and Ck pxi q “ Itck ďxi ăck`1 u otherwise.

Tuning parameter K : selected via cross validation
Problem: location of ck ’s is important, and not easy to determine.
132
Step Functions
Source: [JO13]
133
Regression Splines
Model:
q
ÿ K
ÿ
y i “ β0 ` βk xik ` βk`q bpxi , εk q ` i , Covpi , j q “ σ 2 1ti“ju , (56)
k“1 k“1
where
#
pxi ´ εqq if xi ą ε
bpxi , εq “
0 if otherwise

Tuning parameters K and q: for K can use CV, for q, q “ 3 is usual.
Problem 1: location of εk ’s is important, and not easy to determine.
Problem 2: boundary effects if d ą 4.
134
Regression Splines
Source: [JO13]
135
Regression Splines + Boundary Conditions = Natural splines
Remark 10.2
The boundary problems of regression splines can be solved using
additional boundary conditions that force the fit to be linear at the
boundaries. Regression splines for which such conditions hold are called
“natural splines”.
136
Source: [JO13]
137
Source: [JO13]
138
Source: [JO13]
139
Smoothing Splines
Idea: Set K “ n, and q “ 3, and add a roughness penalty λ.
The optimal smoothing spline g p¨q is found as

$ ,
’
’ /
/
’ /
’
&ÿn ż /
.
2 2 2
ĝ “ arg min pyi ´ g pxi qq ` λ g ptq dt , λ ą 0, (57)
g ’ /
’i“1
’
’loooooooomoooooooon loooooomoooooon/
/
/
Penalty term
% -
Loss
where g ptq exactly projects the data using a spline basis.
Estimation: penalized least squares

Tuning parameter λ: estimated using CV.
140
Smoothing Splines
Remark 10.3
Note that in (57):
λ Ñ 0 leads to interpolation the data, i.e. dfλ“0 “ n.

λ Ñ 8 leads to simple least squares, i.e. dfλ“8 “ 2. In general, it
can be shown that
dfλ “ trtS λ u, where ĝ λ “ S λ y . (58)
As λ Ò, the bias Ò, and the variance Ó.
Remark 10.4
It can be shown that given q “ 3, the function g pxq that minimizes (57)
is a natural cubic spline. Hence, cubic smoothing splines do not have
boundary effects.
141
Smoothing Splines
Remark 10.5
It turns out that the computation of LOOCV is particularly fast for
smoothing splines. In fact
n ´
ÿ ¯2
píq
RSSCV pλq “ yi ´ ĝλ pxi q
i“1
n „ 2
ÿ yi ´ ĝλ pxi q
“ , (59)
i“1
1 ´ rS λ si,i
is a quantity that can be minimized wrt λ.
142
Smoothing Splines
Source: [JO13]
143
Local Linear Regression
Algorithm 4 (Local Regression At X “ x0 )

1 Gather s “ k{n of training points whose xi are closest to x0 .
2 Assign Ki0 “ K pxi , x0 q to each point in this neighborhood, so that
the point furthest from x0 has weight zero, and the closest has the
highest weight. All but these k nearest neighbors get weight zero.
3 Fit a weighted least squares regression of the yi on the xi using the
aforementioned weights, by finding β̂0 and β̂1 that
n
ÿ
β̂0 , β̂1 “ arg min Ki0 pyi ´ β0 ´ β1 xi q2
β0 ,β1 i“1
4 The fitted value at x0 is given by fˆpx0 q “ β̂0 ` β̂1 x0 .
144
Estimation: least squares

Tuning parameters s, estimated via CV.
Problem: need to select weighting function.
145
Source: [JO13]
146
Source: [JO13]
147
GAMs
Generalized Additive Models
It is an extension to the case of p regressors
yi “ β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p i , Covpi , j q “ σ 2 1ti“ju .
Can be applied to discrete or continuous responses, as

p
ÿ
y i “ β0 ` fj pxi,j q ` i ,
j“1
where contributors of each x j are additive via fj px j q.
148
Generalized Additive Models
Source: [JO13]
149
Advantages and Disadvantages of GAMs
Advantages:
Can fit non-linear fj p¨q’s to each x j .

More accurate predictions.
Can look at the individual effect of x j over y
Disadvantages:
The model is additive only.

If interactions exist, they should be explicitly added.
150
Source: [JO13]
151
Homework
Due 5.23.19 - 19:30.
152
Tree Based Methods
Regression Trees
Idea: Segment predictor space into rectangular regions and predict based
on the mean/mode of the response within the region.
Called “trees” because the set of splitting rules can be summarized

in a “tree”.
Typically useful for interpretation, but not so accurate as other
tree-like methods: bagging, random forest, boosting.
We start by considering the cont. resp. case (regression trees), to
then consider the disc. resp. case (classification trees).
153
Regression Trees
Source: [JO13]
154
Binary Splitting
Idea: given region X of the feature space, the select a variable j and
a cut-off point s to make the split
R1 pj, sq “ tX |Xj ď su and R2 pj, sq “ tX |Xj ą su. (60)
The selection of j and s corresponds to

$ ,
& ÿ ÿ .
ˆ ŝ “ arg min
j, pyi ´ ŷR1 pj,sq q2 ` pyi ´ ŷR2 pj,sq q2 , (61)
j,s % -
i:xi PR1 pj,sq i:xi PR2 pj,sq
which can be applied iteratively to the resulting R1 and R2 regions,

until some criteria is reached, e.g. # obs ě 5.
155
Binary Splitting
Source: [JO13]
156
Complexity Cost & Tree Pruning
To avoid the overfitting of large trees, grow a large tree T0 , and

then prune it.
Given α, one can select the optimal tree T Ă T0 using the cost
complexity criteria
|T |
ÿ
Cα pT q “ N m Qm pT q
loooomoooon ` loα|T
omoo|n , (62)
m“1
Weighted Impurity Measure Complexity Penalty
where |T | denotes the number of regions in a given subtree, and
1 ÿ 1 ÿ
Qm pT q “ pyi ´ ĉm q2 , ĉm “ yi
Nm x PR Nm x PR
i m i m
is the so-called impurity cost for Nm “ #txi P Rm u.
157
Building a Regression Tree
Algorithm 5
1 Use binary splitting to grow a large tree T0 on training data,
stopping when each terminal node has less than a minimum of
observations.
2 Apply cost complexity pruning to the large tree and obtain a
sequence of best subtrees T Ă T0 as functions of α.
3 Use k-fold cross validation and pick α̂ “ arg minα MSE in test.
4 Return the subtree from step 2 that corresponds to α̂.
158
Classification Trees
Consider a discrete response yi P t1, 2, . . . , K u and denote
1 ÿ
p̂km “ I
Nm i:x PR tyi “ku
i m
the proportion of k observations in node m. Thus, the classification

k for region Rm follows kpmq “ arg maxk p̂mk .
Measures of node impurity for classification problems:
1 Classification error rate: Em “ 1 ´ maxk pp̂mk q.
Gini index: G “ Kk“1 p̂mk p1 ´ p̂mk q, which measures the variance
ř
2
across the K classes.
Entropy: D “ ´ Kk“1 p̂mk logpp̂mk q, with a similar interpretation to
ř
3
the Gini index.
159
Trees vs. Linear Models
Linear regression assumes a model of the form

p
ÿ
f pX q “ β0 ` Xj βj .
j“1
Regression trees assume a model of the form

M
ÿ
f pX q “ β0 ` cm ¨ 1tX PRm u .
j“1
160
Trees vs. Linear Models
Source: [JO13]
161
Advantages and Disadvantages of Trees
Advantages:
Easy to explain to people, can be displayed graphically, and are

easily interpreted.
Some people believe that decision trees more closely mirror human
decision-making.
Trees can easily handle qualitative predictors without the need to
create dummy variables.
Disadvantages:
High variance because of hierarchical modeling structure.

Bad predictive accuracy.
Non-robust.
Lack of smoothness of prediction surface.
Difficulty in capturing additive structure.
162
Bagging
Definition 12.1 (Bagging)

Bagging is a general purpose procedure for reducing de variance of a
statistical learning method, e.g. regression/classification trees.
Remark 12.1
Recall that averaging a set of independent random variables reduces de
variance. In fact, given independent r.v.s z1 , . . . , zn , each with variance
σ 2 , then the variance of z̄ is σ 2 {n.
163
Bagging
Idea of Bagging: take many training sets from the population, build
different trees for each set and average the predictions.
B
1 ÿ ˆb
fâvg pX q “ f pX q,
B b“1
where fˆb pX q is the prediction with the training set b. However, since we
do not have access to different training sets, one can use bootstrap
B
1 ÿ ˆ˚b
fâvg pX q “ f pX q,
B b“1
where fˆ˚b pX q is one bootstrap estimation.
164
Random Forests
Idea: bagging for de-correlated trees.

Build decision trees on bootstrapped training samples.
Each time a split is considered, select m predictors out of p as
?
candidates (not all, e.g. m “ p).
Remark 12.2
Note that if m “ p, all predictors are considered at every split, thus we
are in the bagging case.
165
Boosting
Idea: Take many small modified versions of the data set, and grow trees
sequentially, i.e. each tree grows using info. from previously grown trees.
166
Boosting
Algorithm 6 (Boosting)
1 Set fˆpxq “ 0, ri “ yi for all i in training set,

2 For b “ 1, 2, . . . D do:
(a) Fit fˆpbq with d splits (d ` 1 terminal nodes) to data pX , r q
(b) Update fˆ by adding a shrunken version of the new tree
fˆpxq Ð fˆpxq ` λfˆpbq pxq
3 Update residuals
ri Ð ri ´ λfˆpbq pxi q
4 Output boosted model

B
ÿ
fˆpxq “ λfˆpbq pxq.
b“1
167
Boosting
Tuning parameters:
1 B: number of trees. If too large can lead to overfitting.

2 λ: shrinkage parameter. Controls the learning speed of the model.
3 d: number of splits. Controls the complexity of boosted ensemble.
168
Support Vector Machines
Hyperplanes
Definition 13.1 (Hyperplane)

Let S denote a p-dimensional space. H is a hyperplane in S if it is an
affine subspace of S, i.e. it is a subspace of S that does not contain the
null element. In fact, for some β0 , β1 , . . . , βp P Rp ,
β0 ` β1 x1 ` ¨ ¨ ¨ ` βp xp “ 0, (63)
defines a hyperplane in Rp in the sense that any x P Rp that fulfills (63)

is a point in the hyperplane.
Example 13.1 (Hyperplane)
Note that
1 any line in R2 is a hyperplane in R2 , but only the ones crossing the

origin are subspaces of R2 .
2 any plane in R3 is a hyperplane in R3 , but only the ones crossing the
origin are subspaces of R3 .
169
Classification Using a Separating Hyperplane
Definition 13.2 (Separating Hyperplane)

Consider observations x i “ pxi,1 , xi,2 , . . . , xi,p qJ P Rp , and binary
response yi P t´1, 1u. A separating hyperplane is a hyperplane for which
β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p ą 0 if yi “ `1
β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p ă 0 if yi “ ´1,
or equivalently
yi pβ0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p q ą 0
If a separating hyperplane exists, then it is the natural classifier.

Given x ˚ , the certainty of its classification depends on how close the
quantity β0 ` β1 x1˚ ` ¨ ¨ ¨ ` βp xp˚ is to zero.
A separating hyperplane needs not to be unique.
170
Classification Using a Separating Hyperplane
Source: [JO13]
171
Maximal Margin Classifier
Definition 13.3 (Margin)

The margin is the minimal distance from any observation of the training
set to a given separating hyperplane.
Definition 13.4 (Maximal Margin Hyperplane)

The maximal margin hyperplane is a separating hyperplane that has the
largest margin.
Definition 13.5 (Support Vectors)

All the observations equidistant to the maximal margin hyperplane are
called support vectors.
The maximum margin hyperplane depends directly on the support

vectors, but not on the other observations.
The classification technique that uses the maximal margin
hyperplane is called maximal margin classifier.
172
Source: [JO13]
173
Source: [JO13]
174
The maximal margin hyperplane is the solution of

# p
ÿ
pβ̂0 , . . . , β̂p , M̂q “ arg max M : βj2 “ 1,
β0 ,...,βp ,M j“1
˜ p
¸ +
ÿ
yi β0 ` βj xi,j ě M, i “ 1, . . . , n ,
j“1
A separating hyperplane may not exist. This can be fixed using a

“soft margin”.
The generalization of the maximal margin classifier to the
non-separable case is known as the support vector classifier.
175
Support Vector Hyperplane
The support vector hyperplane is the solution of
pβ̂0 , . . . , β̂p , ˆ1 , . . . , ˆn , M̂q “

# p
˜ p
¸
ÿ ÿ
2
arg max M: βj “ 1, yi β0 ` βj xi,j ě Mp1 ´ i q,
β0 ,...,βp ,1 ,...,n ,M j“1 j“1
+
n
ÿ
i ě 0, i ď C , i “ 1, . . . , n ,
i“1
where i , i “ 1, . . . , n are called “slack variables”.
If i ą 0, the i-th observation is on the wrong side of the margin.

If i ą 1, the i-th observation is on the wrong side of the hyperplane.
All vectors for which i ą 0 or i ą 1 are support vectors.
Parameter C controls for the severity of the accumulated violations.
If C Ó,then Ó bias, Ò variance, and if C Ò,then Ò bias, Ó variance
176
Source: [JO13]
177
178
Source: [JO13]
179
Proposition 3 (Support Vector Hyperplane)

The support vector hyperplane can be presented as
n
ÿ
f pxq “ β0 ` αi xx, x i y, xx, x i y “ x J x i (64)
iPS
where S is the set of indices for the support vectors, and αi P R` .

Proof. From the FOC in the support vector classifier problem it follows
řn
that β “ i“1 δi yi x i . Thus
˜ ¸
n
ÿ n
ÿ ÿ
f pxq “ β0 ` x J δi yi x i “ β0 ` δi yi xx, x i y “ β0 ` αi xx, x i y,
i“1 i“1 iPS
where the last expression follows from αi “ δi yi , and the fact that αi ‰ 0
only for the support vectors.
180
Support Vector Machines
Definition 13.6 (Support Vector Machines)

The support vector machine is an extension to the support vector
classifier that results from enlarging the feature space using kernels,
ÿ ÿ
f pxq “ β0 ` αi xx, x i y Ñ f pxq “ β0 ` αi K px, x i q,
iPS iPS
where K px, x i q “ xx, x i y is called linear kernel. Other examples are
1 Polynomial kernel
` ˘d
K px, x i q “ 1 ` x J x i , d ą0
2 Radial kernel
` ˘
K px, x i q “ exp ´γ}x ´ x i }22 , γ ą 0.
181
Source: [JO13]
182
SVMs: More than Two Classes
Consider there are k possible classes for the response. We can use two
approaches:
`k ˘
1 One vs. One. Do SVM by pairs in the 2 possible cases. Do voting
at observation level and classify accordingly.
2 One vs. All.
For all observations in class 1, write `1 as response, and ´1 for all
the other clases. Do SVM and compute β0,1 , β1,1 , . . . , βp,1 .
Repeat the previous step for classes 2, 3, . . . , k and collect
β0,i , β1,i , . . . , βp,i , i “ 1, . . . , k in each case.
For out-of-sample observation x ˚ , compute
J
f x ˚ ; βi “ β0 ` x ˚ βi ,
` ˘
i “ 1, . . . , k,
` ˘
and select the case i for which f x ˚ ; β i is the largest.
183
Homework
Due 5.30.19 - 19:30.

2 Chapter 9: Exercises 3 (a,b,c), 7 (a,b,c,d) from [JO13].
184

Clases de Machine Leargni

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clases de Machine Leargni

Transféré par

Droits d'auteur :

Formats disponibles

Machine Learning: Business Applications

Francisco Rosales Marticorena, PhD.

Asignatura: Machine Learning: Aplicaciones en los Negocios

Este curso presenta métodos de machine learning con énfasis en

Mejorar las capacidades cuantitativas de analistas y gerentes para

Las exposiciones del profesor se complementarán con actividades que

Nota Final “ seis tareas p60%q ` un examen final p40%q.

Las tareas se podrán realizar de manera individual o en parejas.

[EH06] Everitt, B. and T. Hothorn. A handbook of statistical

Doctor. Matemáticas y Ciencia Comp. Universidad de Göttingen.

2019: Profesor Investigador. TI. U. ESAN.

Expectativa: analista, gerente, etc.

Machine Learning: is a toolbox to understand data using statistics

The general model follows:

Y is called response / dependent variable

In this course: we find f to predict / explain.

1 We observe predictors X and response Y .

Focus depends on Goals:

Prediction: we care mostly about Ŷ .

Erpf pX q ´ fˆpX qq2 s ` lo

Proof. Trivial. Direct substitution from (1) and (2).

The magnitude of the estimation error has a reducible and an

Impose rigid structure on f , e.g. f is linear.

Impose flexible structure on f , e.g. f is a piecewise polynomial.

Discrete response ñ classification / cont. response ñ regression.

Definition 1.1 (Mean Squared Error)

MSE :“ Avetpyi ´ fˆpxi qq2 u (4)

We call it “train” MSE if we compute it with training data pX , Y q and

Train MSE can be reduced arbitrarily (overfitting).

Squared Bias of fˆ Variance of fˆ

The reducible error is a mixture of bias and variance.

More flexible methods: Ó bias and Ò variance.

Definition 1.2 (Error Rate)

ER : “ AvetIŷi ‰yi u (5)

Definition 1.3 (Bayes Classifier)

maxj PrY “ j|X “ x0 s, (6)

for each x in the data set.

Note that the Bayes classifier:

Is theoretical, i.e. the conditional probability in (31) is unknown.

ER ě 1 ´ E rmaxj PrY “ j|X “ x0 ss

Definition 1.4 (K -Nearest Neighbors)

i.e. as the fraction of the points in N0 whose response values equal j.

K has a strong effect on the classification obtained by KNN.

Train data ` True f

` Test data ` KNN fit

● ● ● ●● ●●●●● ● ●●● ● ● ● ●● ●●●●● ● ●●●

This is an extremely short introduction to R.

For now, lets run R online:

The call for the funcname function is funcname(arg1,arg2),

To ask for help on function funcname use ?funcname

To create a matrix use matrix

To create a realization of a normal random variable use rnorm. To

There are various functions for plotting. See ?plot

To plot the contour of f use contour or image.

Extracting part of a data set can be done in different ways:

> A = t(matrix(1:16, 4, 4))

To import data use read.table() or read.csv(); and to visualize

To access a variable cylinders in data frame Auto, we use

To plot a histogram use hist()

Supervised learning method for continuous response.

Well document starting point.

Idea: estimate f pX q via

fˆpX q “ β̂0 ` β̂1 X , (8)

where we have used Varri s “ σ 2 and Covpi , j q “ 0, for i ‰ j.

where is a centered random noise, uncorrelated to X .