Académique Documents
Professionnel Documents
Culture Documents
04.04.19 – 30.05.19
ESAN Graduate School of Business
Datos Generales del Curso
1
Sumilla
2
Objetivos de la Asignatura
3
Programación de Contenidos
1 Estadı́stica básica:
Sesión 1: Introducción
Sesión 2: Software R
Sesión 3: Regresión lineal
2 Métodos Lineales:
Sesión 4: Modelos de Clasificación
Sesión 5: Métodos de Resampleo
Sesión 6: Regularización
Sesión 7: Reducción dimensional
Sesión 8: Taller
3 Métodos No-Lineales:
Sesión 9: Splines
Sesión 10: GAMs
Sesión 11, 12: Árboles de decisión
Sesión 13, 14: Support Vector Machines
Sesión 15: Evaluación Final
4
Metodologı́a
Participar en clase.
Leer la bibliografı́a indicada en el programa.
Hacer las tareas.
Rendir las evaluaciones programadas.
5
Evaluación
6
Fuentes de Información
7
Docente
Educación:
Experiencia:
8
Asistentes
9
Materiales
Desde un celular:
Desde un navegador:
https://github.com/LFRM/Lectures
10
Introduction
Overview
11
Basics
Y “ f pX q ` , X “ tX1 , . . . , Xp u
where
12
Basics
Usual Steps:
Y “ f pX q ` . (1)
3 We estimate fˆ “somehow”.
4 We use fˆ in X to make prediction Ŷ
Ŷ “ fˆpX q. (2)
13
Estimation Error
Proposition 1
Interpretation of proposition 1:
14
Estimation of f : Parametric vs. Non-Parametric
Parametric Methods:
Non-Parametric Methods:
15
Estimation of f : Parametric vs. Non-Parametric
Source: [JO13]
16
Estimation of f : Regression vs. Classification
17
Estimation of f : Assessing Model Accuracy
18
Estimation of f : Assessing Model Accuracy
Left: data (circles), true function (black), linear fit (orange), spline fit 1 (blue),
spline fit 2 (green). Right: train MSE (gray), test MSE (red).
Source: [JO13]
19
Estimation of f : Bias - Variance Trade-off
Proposition 2
Given x0 , the expected test MSE can be decomposed into the sum of
three quantities: the variance of fˆpx0 q, the squared bias of fˆpx0 q and the
variance of the error term.
Proof. From proposition 1 we known that
Erpy0 ´ fˆpx0 qq2 s “ Erpf px0 q ´ fˆpx0 qq2 s ` Varrs
“ Erptf px0 q ´ Erfˆpx0 qsu
´tfˆpx0 q ´ Erfˆpx0 qsuq2 s ` Varrs,
Thus,
Erpy0 ´ fˆpx0 qq2 s Erpf px0 q ´ Erfˆpx0 qsq2 s ` looooooooooooomooooooooooooon
“ looooooooooooomooooooooooooon Erpfˆpx0 q ´ Erfˆpx0 qsq2 s
Interpretation of proposition 2:
Methods:
21
Estimation of f : Classification Setting
22
Estimation of f : Classification Setting
23
Estimation of f : Classification Setting
24
Estimation of f : Classification Setting
0.5
0.5
● ●
● ●
● ● ● ●
●● ● ● ●● ● ●
● ● ● ●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ●● ●
●● ● ● ● ● ●●● ●● ● ● ● ● ●●●
●● ● ●
●●●● ●● ● ●
●●●●
● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
● ● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
●
● ●● ● ●● ●●
●●●
●
● ●●●●●● ●● ● ●● ● ●● ●●
●●●
●
● ●●●●●● ●●
●● ●● ●
●●● ●● ●● ●●●● ● ●● ●● ●● ●
●●● ●● ●● ●●●● ● ●●
● ● ● ●●
●
●●●● ●● ●
● ●
●●●●●●
● ●● ●● ●●
●●● ● ● ● ●●
●
●●●● ●● ●
● ●
●●●●●● ●● ●● ●●
●●●
●●●●● ●●● ●
● ●● ● ●● ●● ●● ● ●● ●●●●● ●●● ●
● ●●
●● ●● ●● ●● ● ●●
●● ● ●●
●●●● ●●●●● ●●●
●● ●●●●● ● ●●
● ●● ●● ●
●●
●●●● ● ● ●● ● ●●
●●●● ●●●●● ●●●
●● ●●●●● ● ●●
● ●● ●● ●
●●
●●●● ● ●
● ● ●●●●●●●●
● ●●● ● ●● ● ●● ●● ●●●● ●
●●
● ●●
●●● ● ● ●●●●●●●●
● ●●● ● ●● ● ●● ●● ●●●● ●
●●
● ●●
●●●
0.4
●●
0.4
● ● ● ● ●● ● ● ●● ● ●●
● ● ● ● ● ● ●●●●●●●●●●●● ● ● ●
●
● ● ● ● ●●●●●●●●●●●● ●
●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ● ●● ●● ●
●●●● ●● ●●
●
●● ● ●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ● ●● ●● ●
●●●● ●● ●●
●
●● ●
● ● ● ●
●● ● ●●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ●● ●
●●
●● ● ●●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ●● ●
●●
● ●●● ●● ● ● ●● ● ●● ● ● ● ●●●●●●
● ● ● ●●● ●● ● ● ●● ● ●● ● ● ● ●●●●●●
● ●
●● ●● ●
●● ● ● ●
● ●● ●●
● ● ● ●●●
● ●●
●● ●●●
●● ●● ●
●● ● ● ●
● ●● ●●
● ● ● ●●●
● ●●
●● ●●●
●
●●
● ●● ●● ● ● ● ● ●
●●●●●
●
●●
●● ● ●
●●
● ●● ●● ● ● ● ● ●
●●●●●
●
●●
●● ●
●● ● ● ● ●●●● ● ● ● ● ●●
●●● ●● ● ● ● ●●●● ● ● ● ● ●●
●●●
● ● ● ●●●
●
●
● ●●●●●● ●●●
●
● ● ● ●
●● ●● ● ● ● ● ●●●
●
●
● ●●●●●● ●●●
●
● ● ● ●
●● ● ● ●
● ● ●● ●
●
● ● ●●● ●●● ●● ● ●● ●● ●● ● ● ●● ●
●
● ● ●●● ●●● ●● ● ●● ●● ●●
● ● ● ● ● ● ● ● ● ● ● ●
●● ●●●● ● ● ● ● ●● ●●●
● ●
● ●● ●●●● ● ● ● ● ●● ●●●
● ●
●
● ●● ● ●
●●● ● ● ●● ●●●● ● ●● ● ●
●●● ● ● ●● ●● ●●
●● ●●●
● ●●● ● ●●
●
● ●● ●● ●●●
● ●●● ● ●●
●
● ●●
●● ●
● ●●
● ●●●●● ● ●● ● ●
● ● ●● ●● ●
● ●●
● ●●●●● ● ●● ● ●
● ● ●
●
●●● ●● ●● ● ● ●● ●●● ● ● ●●● ●● ●● ● ● ●● ●●● ● ●
● ●● ●●● ● ●● ● ● ● ●● ●●● ● ●● ● ●
0.3
0.3
● ● ● ●● ●●●●● ●● ● ● ● ● ●● ●●●●● ●● ●
●●● ●●●●
● ●●●●● ●●
● ●
●●●● ●●● ●●●●
● ●●●●● ●●● ●
●●●●
●● ●● ● ● ●●●
●
●● ●●
● ●● ●● ● ● ●●●
●
●● ●●
●
y
y
●
● ●● ● ●● ●●●● ●
● ●● ● ●● ●●●●
●●● ● ● ● ●●
●●● ●● ●● ●●● ● ● ● ●●
●●● ●● ●●
●● ● ●● ●
● ●
● ●
●●
●
●●●●●● ● ●●● ●
●● ● ●
● ●
●●
●
●●●●●● ● ●●● ●
●●
● ● ● ● ●● ● ● ● ● ●●
●●●● ●●● ●●
●●●●● ●●●● ●●● ●●
●●●●●
●● ● ● ●●
● ●● ●● ●● ● ● ●● ● ● ●●
● ●● ●● ●● ● ●
●● ●
●● ●●
● ●● ●
●● ●●
●
●●●
●●●
●●●●
● ● ●●●●●● ●●●
●●●
●●●●
● ● ●●●●●●
●● ●● ●●●●● ● ●●●● ● ●● ●● ●●●●● ● ●●●● ●
●● ●● ●● ●●
●●●● ●●●●●●● ● ●● ●● ●
● ●●●● ●●●●●●● ● ●● ●● ●
●
● ● ●●
● ●● ●● ● ● ●●
● ●● ●●
0.2
0.2
●●●●● ● ●●●●● ●
●●●● ●● ●●●● ●●
● ●
● ● ●●● ●●● ● ●
●●● ●●●●
● ● ●●● ●●● ● ●
●●● ●●●●
● ●●●● ●● ● ●●●● ●●
●● ● ●● ●● ● ●● ● ●● ●● ●
● ●● ●●● ● ● ● ● ●● ●●● ● ● ●
●● ●
● ● ●●● ● ●● ●
● ● ●●● ●
● ●●●● ●
●● ● ●●●● ●
●●
●●●●●●●●●
●
●
● ●●●●●●●●●
●
●
●
● ● ●
● ●●
●●● ● ● ●
● ●●
●●●
● ●
●
●● ● ●
●● ●
●●
●●●●●●●●
●
● ●●
●●●●●●●●
●
●
●●●●
● ●●
● ●●●●
● ●●
●
0.1
0.1
● ● ● ●
●●●● ● ●●●● ●
● ●
●● ●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
0.5
● ●
● ●
● ● ● ●
●● ● ● ●● ● ●
● ● ● ●● ●● ●
●
● ●● ● ● ● ● ●● ●● ●
●
● ●● ●
●● ● ● ● ● ●●● ●● ● ● ● ● ●●●
●● ● ●●●●
● ●● ● ●●●●
●
● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
● ● ●●
● ●● ●● ●●●●
● ●● ● ●
●●●●●● ● ● ● ●
●
● ●● ● ●● ●●
●●
●●●●●●●●● ●● ● ●● ● ●● ●●
●●
●●●●●●●●● ●●
●● ●● ●
●●● ●● ●● ●● ● ●●
● ●● ● ●● ●● ●
●●● ●● ●● ●● ● ●●
● ●● ●
●● ● ● ●
●
●
●●
●●●
●●● ●● ●
● ●●●●
●
●●●
● ●●● ●●●●●●●
●●
●
●● ●● ● ● ●
●
●
●●
●●●
●●● ●● ●
● ●●●●
●
●●●
● ●●● ●●●●●●●
●●
●
●●
●● ● ●●●●●●
● ●
● ●●● ●
● ●●
● ●● ● ●●
● ●●
●●● ●●
●
●●●
●●● ● ●
● ●● ● ●●●●●●
● ●
● ●●● ●
● ●●
● ●● ● ●●
● ●●
●●● ●●
●
●●●
●●● ● ●
●
● ● ●●●●●
●
●●●●●● ●●●●
●●●● ●● ● ●●
●● ● ●
●●●● ●
●●● ●● ●●●●●
●
●●●●●● ●●●●
●●●● ●● ● ●●
●● ● ●
●●●● ●
●●● ●●
● ●●● ● ● ●● ● ● ● ●●● ● ● ●●
0.4
● ●
0.4
● ●● ● ● ● ● ●● ● ●
● ● ●
● ●
●
● ● ● ● ●●● ●●●●
●
● ●
●●●● ●●● ● ● ●
●
● ● ● ● ●●● ●●●●
●
● ●
●●●● ●●● ●
●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ●
●
● ●● ●● ●●●● ●
●
● ●●●● ● ●●● ● ●●
●●●● ●●●
● ●● ● ●●● ● ● ● ●
●
● ●● ●● ●●●● ●
●
● ●●●● ●
●● ● ● ● ●●● ● ●● ● ● ● ●
●● ●● ● ● ● ●●● ● ●● ● ● ● ●
●●
●●● ● ●●●
●● ● ● ● ●● ● ● ● ● ● ● ●
●
●●
●●●●
●●
● ●●● ● ●●●
●● ● ● ● ●● ● ● ● ● ● ● ●
●
●●
●●●●
●●
●
● ●
● ●● ● ● ● ● ● ● ●
● ● ●●● ● ●
●●●
● ●●● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ●●● ● ●
●●●
● ●●● ●
●● ●● ● ● ● ●●●● ● ●● ●● ●● ●● ●● ● ● ● ●●●● ● ●● ●● ●●
● ●
●●
●●●●● ● ● ● ●● ●● ●
●●●
● ●
●● ● ● ●
●●
●●●●● ● ● ● ●● ●● ●
●●●
● ●
●● ●
●●●●●●
● ●●● ●● ●●
●● ● ● ●●
●●● ● ●●●●●●
● ●●● ●● ●●
●● ● ● ●●
●●● ●
● ● ● ●●●●● ●
●
● ●
●●●● ●● ● ● ● ● ●●●●● ●
●
● ●
●●●● ●● ●
● ● ●● ● ●● ●● ●●● ● ● ●● ●●● ● ● ●● ● ●● ●● ●●● ● ● ●● ●●●
● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●● ●
●● ●● ● ● ●●●
● ● ●● ●● ● ● ●●●
● ●
● ●● ● ●
●●● ● ● ●● ●●●● ● ●● ● ●
●●● ● ● ●● ●●●●
● ● ● ●●●●●
●●●●
● ● ● ●●
● ●
● ● ● ● ● ●●●●●
●●●●
● ● ● ●●
● ●
● ●
● ●
● ●●●●● ● ● ● ● ●● ● ●
● ●●●●● ● ● ● ● ●●
● ● ●●
●
● ● ●●
● ● ● ● ●
●●●●●●●●● ●●● ●
● ●●●●● ● ●●●●●●●●● ●●● ●
● ●●●●● ●
0.3
0.3
● ●
●●● ● ●
●
●● ●●●
● ● ●● ●●●● ●● ●●● ● ●
●
●● ●●●
● ● ●● ●●●● ●●
●● ● ●●● ●● ●●● ● ●● ● ●●● ●● ●●● ●
●● ● ●
●●
●
●●●●●● ● ●● ●
●● ●● ● ●
●●
●
●●●●●● ● ●● ●
●●
● ●●●●
● ● ● ●●●●●●● ● ●●●●
● ● ● ●●●●●●●
●● ● ● ●
● ● ●● ●●● ●
●● ● ●
●●
●● ● ● ●
● ● ●● ●●● ●
●● ● ●
●●
● ●●●●● ●●●●● ● ●●●●● ●●●●●
●●●
●● ● ● ●
●● ● ●●●
●● ● ● ●
●● ●
●●
●●●● ● ● ● ●●● ●●
●●●● ● ● ● ●●●
●●
● ●● ●● ●● ● ● ●●●● ● ●●
● ●● ●● ●● ● ● ●●●● ●
●● ● ●●● ●● ● ●●●
●●●● ●●●●●●● ● ●● ●
● ●●●● ●●●●●●● ● ●● ●
●
● ● ●
● ●●
●● ●● ● ● ●
● ●●
●● ●●
0.2
0.2
●●●●● ● ●●●●● ●
●● ●● ●● ●●
●●● ●● ● ●●● ●● ●
● ● ●●● ●●
●●● ● ●●●●
● ● ●●● ●●
●●● ● ●●●●
● ● ●
● ● ●● ● ● ●
● ● ●●
●● ● ●● ●● ● ●● ● ●● ●● ●
● ●● ●● ●●
●● ● ● ●● ●● ●●
●● ●
●● ●
● ●●●●● ● ●● ●
● ●●●●● ●
● ●●●●● ●
●● ● ●●●●● ●
●●
●●●●●●●●●
●
●
● ●●●●●●●●●
●
●
●
● ● ●
● ●●
●●● ● ● ●
● ●●
●●●
● ●
●
●● ● ●
●● ●
●●
●●●●
●
●
●●
●● ●● ●●
●●●●
●
●
●●
●● ●●
●●●●
● ●●
● ●●●●
● ●●
●
0.1
0.1
●● ● ●● ●
●●● ● ●●● ●
● ●
●● ●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
25
R Software
The R Project for Statistical Computing
26
Basic Commands
27
Basic Commands
28
Basic Commands
29
Basic Commands
30
Graphics
31
Graphics
Let f : R2 Ñ R,
32
Indexing Matrix Data
Compute
1 A[2,3]
2 A[1,]
3 A[1:3, c(2,4)]
4 A[,-4]
5 dim(A)
33
Loading Data
34
Additional Graphical and Numerical Summaries
35
Additional Graphical and Numerical Summaries
36
Linear Regression
Motivation
What?
Why?
37
Simple Linear Regression
Model:
Y “ f pX q ` , f pX q “ β0 ` β1 X , (7)
where is a centered random noise, uncorrelated to X .
Example 1 Example 2
0.5
● ●
●● ●
● ●●
● ●
● ●
●
●●●● ●●● ● ●
● ● ●● ●
●
●
● ●● ● ●
●● ● ● ● ● ● ● ● ● ●● ● ●
●●●●●
●● ●
●
●
●
●●
●
●
●● ● ● ●● ●●● ● ●
●
●
● ●
●●●● ● ●● ●● ● ●●
● ● ●
●●●
●●●
● ●
●●●
●●
●
●
●●●
●
●
●
● ● ● ● ●● ● ●● ●●● ●
● ● ● ●
●● ●●
●
●
●●
●
●● ●
●
● ● ●●●●●●● ● ●●●●●●
● ●● ●● ● ●●●●●
●
●
●●
●●
●
● ●●●●
●●●
●●
●●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
● ●
●
●
●●
●
●●
●●
●
●●
●
●
●●
●
● ●● ●●●● ●●● ●● ● ● ●●
●
●
●●
●●●● ●
●●●●●
● ● ●●●
● ● ● ●
●
●●
●●
●
●
●●
●
●
●●
● ● ●
●●
●
●●
●
●●
●
●
●●
● ●●●●● ●● ● ●●●●
●●
● ●
●●
● ●●●●● ●
●
●
●●
●
●
●●
●
●●
●
●●
● ● ●●
● ●● ●
●● ●●●●
●●●
●
●
●
●●
●
●
●●
●
●●
● ●● ● ●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
● ●● ● ●● ● ●
0.8
●
● ●● ● ●
● ●●● ●● ● ●●
●● ●
●●●
●●
●
●
●●
●
●
●
●
●
●●
●●
● ●●
● ●● ● ● ●
● ● ●●● ●
●
●
●
●●
●
●
●●
●
●
● ●● ● ● ●● ●● ●
●●●●
●●
●
●
●
●
●
●●
●●● ●● ●
●●●●
● ● ● ● ● ● ●●●●●
● ●● ●●● ●●● ● ●
●
● ●●
0.4
●
● ● ●● ● ●●●●
● ●● ●●●●● ●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●●● ●●●●●● ● ● ●
●●●●
●●●
●
●
●
●●●
●
●●
●
●●● ●●●
●
●●●● ●●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●●●● ● ●
●● ● ●● ●
●
●
●
●●
● ●●
● ● ● ● ● ●● ●●
●
●
● ●●● ●●●
●
●
● ● ●
●●● ● ●●
●●
● ● ●● ●● ● ●●●● ●● ● ● ●● ● ●
●
● ● ●● ● ●
●●
● ●●● ●
● ●●
●●●
●
●
●●●
●● ●● ● ● ●●● ●● ●● ●● ●●●
●
●
●●
●
●●
●
●●●● ●●●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●● ● ●●● ●
●
●●
●
●●●● ●●●
●●
●
● ● ● ● ● ●● ●
●●
●
●
●●
●●
● ●●●
●●
●
●● ● ●● ● ● ● ● ● ●
●●●
●
●
●● ● ● ●●
● ●●
●
●
●●
●●●
●●
●● ●● ●●●●●●●
●
●
●
●
●●● ●●● ● ●● ● ● ●●● ● ●
●●●
●
●●
●
●●● ●
● ●●
●●
● ●
●
●●
●
●●
●
●
●●●●
●●
● ●●●●● ● ●
●
●
●
●
●●● ●● ●● ● ●●●
●●
●●
●
● ●● ●
● ● ●●
●●●
●●●
●
●●
●
●●
● ●
●●
●● ● ●
●●
●●
●● ● ● ● ● ●
●
●
●
● ● ●●
●●●●●
●
●
●
●
●●
●
●●
●●● ●
●
● ● ●
●●
●
●
●● ●
●
● ● ● ●●
●●●
●●●●
● ●● ●
●●
●
●
●
●●●
● ●
● ●●
● ●
● ●●●
●
●
●
●
●
●
●●●●
● ● ●●● ● ● ● ●●
●
●
●
●
●●
●
●
●●● ● ●
●● ● ●
● ●●
● ●
●●
●●
●
●
●
●
●
●
●●
●
●●●●●●
●
●● ● ●●●●●
●
●●●●
● ●●●●●●● ●●●
● ●
●●●● ● ●
●
●●
●
●
●●
●●●●
● ●
●
● ●●● ●● ● ● ●●● ●
●●● ●● ●
●●
●
●
● ●
●
●
●● ●
● ●
●● ●●●
●
● ●●●● ●● ●● ●● ●
●
●
●●
●●
●
● ●●
●● ●
●●
●● ●● ●● ● ●●●
● ●
●●
● ●
●●●●●
●●
●
●●
●●
●●
●
●●
● ●
●●●
● ●
0.6
●● ●
●
● ● ● ●
●
● ●
● ● ●●● ●●
●
●
●
●
●
●
● ●
●●
●●
●
●
●●●●● ● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
● ●
●
●●
●
●●
● ● ●● ● ● ● ● ●
● ●
●
●●
●● ●
●●● ● ● ●
● ●● ●
0.3
● ● ● ●●
●
● ● ● ●●
● ●●●● ●●●
●
●
●●●
●
● ●
● ●●
● ●● ● ●
●
●●
●
●●
●
● ●
●● ●●●●●● ● ●●●●●●●●
● ●●● ●●●●
●●
●
●●
●
●●
●●
●
●
● ●
●
● ● ●●●●
●●
●
●
●●● ●● ●●●
●●
●● ● ●● ●●
●
●
●
●
●
●
●●●● ●
●
●●
●
●
●
●●●
● ●● ● ●●●●
●●
●
● ● ●●●●● ●
●
●●
●●
●
●
●
●
●
●●
●●●●
●●●●● ●●
●
●
●
●
● ● ● ●
● ●
●●
●● ● ●● ●
●●
● ●
●
●
●●
●●
●●●●
●● ● ●
●● ●●
●●
●
●
●
●●●● ● ●●● ●
● ● ●
●
●
●
●
●
●
●●
●●
●●●●●●
y
●●● ●
y
● ●● ●● ●
●● ●
●●
● ● ●●
●
● ● ●● ●●
●●● ●●
●●●● ●●
●
●●
●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●
●
●●●●●● ● ●●
●
● ● ●●●
● ●●●
●●
●
●
●
●●
●
●●
●
● ●●
● ●● ● ●
●
●●
●●● ● ●●●
●
●●●
● ●
●● ● ●● ● ●●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●●●● ● ● ●●
●
● ● ●
●● ●● ●●●
●
● ●● ●
●●●
● ●●●●
● ● ● ●●● ●●● ●
● ●
●
●●
●
●
●●
● ● ●
●●●
●●
●●
●● ● ●
● ●●●●
●
●
● ●●● ● ●●
●
●
●●
●
●●
●●
●
●
●● ●
●●●● ●●
●
●● ●●●●
●●● ●
● ●●●
●●
●●
● ●●
0.4
● ● ● ●●● ●●●● ●
●●● ●
●●●●●● ●● ●●●● ●
●●
●
●
● ●●●●
● ●●
●●
●
●●
●● ●
●
●
●●
● ●●●●●
●●
●●●
●
●●
●
●
●●
●
●
●●● ●
● ● ●
●● ●
●
● ●●
●
● ●●● ● ●
●●
●
●
●●
●
●●●
●●●●
● ● ●
0.2
●● ● ●
●● ● ●● ●●
● ●
●
●●
● ●
●●
●
●
●●● ● ● ●●
● ● ●●
●
●●
●
● ●● ●● ●
● ●●●
●
●
●
●
●
● ● ● ●● ●● ●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●●
● ●●●●● ●●●●
●
● ● ● ●● ●●●
●●
●● ●●● ●
●● ●
●
●●●●
●●
● ●
●● ●● ●● ●● ● ● ●●
●●
●●
●
●
●
●●
●●
●
●
●● ●● ●
●
●
●
●
●
●● ●
● ●● ● ●●●● ●
●
●●
●●●●●
●● ●●
●
●
●
● ● ●
● ●●●●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●●
● ● ●●● ● ●●●
●
●
●● ●
● ●●
●
●●
●● ● ● ●●●●●●●
●●
●
●
●●
●●●●●
●
●●
●
●●●
●
●
●●● ● ●● ●●
●● ●●●
●●
●
●
●
●
●
●●
●●
●
●●●●● ●
● ●
●●
●
● ● ● ● ●
●●● ● ●
●●
●
●●
●● ●●
●●● ●
●● ●● ●●● ● ●
●
●● ●
●●
● ●●●●
●
●
●
●
●
●●
●
● ●●● ●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●● ●●
●● ●
●●
●● ● ●●●
●
●
●●● ●● ●
0.2
●●●●●
● ●● ●●●●● ●
●
●
●●
●
●
● ●
●●●●
●●●● ● ● ●
●
●
●
●
●●● ●● ●●●●●● ●
●●
●●
●
●
●
●
●
●
●
●●
●
●● ●
●●●●
● ●●
●
●●●● ●●
●● ●
●
● ●●●●
●●●
●
●●
●●
●
●●●
●
●● ● ●● ●
0.1
●
●
● ● ●●● ●
●
●●
●
●
●●
●
● ●●
●●●●●
● ● ●
● ●
●●
●●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●● ●
●● ● ● ●
●
● ●
●
● ●● ●
●●
● ●
●
●
●
●●
●●
●●
●
●●●
●● ●
● ●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●● ●
● ●
● ●
●●●●●●
● ●
● ●
● ●●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
38
Estimation
e.g. identifying β̂0 , β̂1 that fulfills the least squares criteria.
Least Squares Criteria:
39
Estimation
FOC:
BRSS
“ β̂0 ` β̂1 x̄ ´ ȳ “ 0, (11)
B β̂0
n n n
BRSS ÿ ÿ ÿ
“ β̂0 xi ` β̂1 xi2 ´ xi yi “ 0. (12)
B β̂1 i“1 i“1 i“1
with solution
40
Example: Sample Mean Estimator
Recall:
Let Z „ pµ, σq, where the mean µ and s.d. σ are unknown.
Sample mean estimator: Collect n random samples of Z , and
řn
estimate µ by µ̂ “ n1 i“1 zi . Repeat the previous m times with
different samples of the same size: tµ̂p1q , µ̂p2q , . . . , µ̂pmq u.
Unbiasedness: We say µ̂ is an unbiased estimator of µ, since
m
1 ÿ pkq
µ̂ Ñ µ as k Ñ 8.
m k“1
Variance: defined as
σ2
SEpµ̂q2 “ ,
n
measures the precision of the sample mean estimator.
41
Bias in Simple Linear Regression
β0 and β1 are r.v.s, and given some data we estimate them by β̂0
and β̂1 , but the estimators vary according to the random sample.
pkq
Unbiasedness: if we consider m random samples, we obtain β̂0 and
pkq
β̂1 , for k “ 1, . . . , m. It can be shown that
m m
1 ÿ pkq 1 ÿ pkq
β̂ Ñ β0 and β̂ Ñ β1 as k Ñ 8,
m k“1 0 m k“1 1
42
Variance
x̄ 2
„
2 2 1
SErβ̂0 s “ σ ` řn 2
, (15)
n i“1 pxi ´ x̄q
σ2
SErβ̂1 s2 “ řn 2
, (16)
i“1 pxi ´ x̄q
43
Confidence Bands
44
Hypothesis testing
H0 : β1 “ 0
Ha : β1 ‰ 0
β̂1 ´ 0
t“ , (19)
SErβ̂1 s
45
Hypothesis Testing
Remark 3.1
To see that t „ tn´2 in (19), recall that a r.v. X “ ?Z is distributed
V {ν
tν if Z „ N p0, 1q, and V „ χ2ν ,
where χ2ν
denotes the chi-squared
distribution with ν degrees of freedom. The denominator of (19) reads
d d
RSS{σ 2 σ2
SErβ̂1 s “ řn 2
, (20)
pn ´ 2q i“1 pxi ´ x̄q
where RSS 2
σ 2 „ χn´2 , while the numerator follows
Od
σ2
pβ̂1 ´ 0q řn 2
„ N p0, 1q . (21)
i“1 pxi ´ x̄q
46
Model Accuracy
47
Multiple Linear Regression
Model:
Y “ f pX q ` , f pX q “ β0 ` β1 X ` ¨ ¨ ¨ ` βp X , (23)
48
Multiple Linear Regression
●
● ●
●●● ● ● ●
●●● ●● ●●● ●
● ● ●● ● ● ●
●
● ●●●●●●●● ●
● ●●●●
●
● ● ●● ●
●● ● ● ●
● ● ●●
● ● ● ●● ● ●●
●● ●●● ● ● ● ●●●● ● ● ● ●
●
●● ●● ●● ●
●● ● ● ● ●
●● ● ●
● ●●●●●● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ●
●● ●●● ● ● ●● ● ● ● ● ● ● ●
●
●
● ●
●● ●●● ● ● ● ● ●● ● ● ●● ●●● ●● ●● ● ● ● ● ●●
●●● ●●● ● ● ●
●●
●● ● ●●●●● ● ●●●● ●
● ●● ● ●
● ● ● ● ●
● ● ●●
●● ●●● ● ●● ● ● ● ●● ● ● ● ●
●● ● ● ● ●
●● ● ●● ● ● ●●
●● ● ● ●●●● ● ●
●● ● ● ● ● ● ● ●●●● ●● ●
●● ● ● ● ●● ● ●●●● ● ● ●●● ●● ● ● ● ● ●
●
●●●● ● ● ● ● ● ●●● ● ● ● ●
●● ● ● ● ● ● ● ●●● ●● ● ●●
1.0
● ● ● ● ●
● ● ● ● ● ●● ● ● ●● ●
●
●
● ●● ●
●
● ●●
●● ● ● ●● ●● ● ● ●
●
● ● ●●●
●● ●●● ●●● ● ● ●● ●● ●● ● ●●●● ●
●●● ●
● ●●
●
● ● ●●●
● ●● ●● ●●●●●●●● ●●
●● ●
●● ● ● ● ● ●● ●● ● ●●● ● ●●
●● ●●
● ● ●
● ●● ● ●● ● ● ● ●
●●●● ●● ●●● ● ●
● ● ● ● ● ●●● ● ● ● ● ● ●●●●●
● ● ● ●●
●
● ● ● ●● ● ● ●● ●●● ● ● ●● ● ●● ●●● ●● ●●
● ● ● ● ● ●
●● ●● ● ●● ●● ● ● ● ● ● ●● ●● ●
● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ●
● ● ●●● ●● ●
0.5
● ● ● ●● ● ● ● ●
● ●● ●● ● ● ● ● ●●●●● ●● ● ● ●●● ●●●●●●● ●●● ● ●
● ● ●● ●
●● ●● ● ● ●
● ● ●●●
● ●●●● ●● ● ●● ● ● ●●● ●
● ●● ●
● ●●●● ●
●● ● ● ● ● ●● ●●
● ● ●● ● ●●
● ●● ● ● ● ● ●●● ● ● ●● ●
● ●
● ●●●
● ● ●●●●●● ●● ● ●●
● ●● ● ● ●● ● ● ● ● ● ● ●
● ●●● ● ● ● ● ●● ●● ●●●● ● ● ● ● ● ●
● ●
●●● ● ● ●
●●●
●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ●●● ● ●
●●●
● ● ●
x2
● ●
●● ● ● ● ●● ● ● ●● ● ● ●● ● ●● 1.0
● ●●●
●● ● ● ● ●● ●
●● ●●● ●●● ●●
y1
●●● ●●●● ● ● ● ● ●
0.0
● ● ● ●
y1
● ●●● ●
●● ● ●
●● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ●
●● ● ● ●● ●●● ● ● ● ●●●● ●
●
●●●●●● ● ● ● 0.8
● ● ● ●● ● ● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ● ● ●●
● ● ● ●●
● ●● ●● ● ●
● ●● ●
●
●
● ● ●● ●● ●●
0.6
x2
●● ● ●●
● ● ●
●●●● ● ● ●
−0.5
●● ● 0.4
0.2
−1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
x1
●
●● ● ●● ●
● ●● ● ●● ● ● ● ●
●●
● ● ●●●●● ●● ● ● ● ● ●●● ●
● ● ●●●● ●● ● ● ●● ●● ● ● ● ●
●● ● ● ● ● ●● ●●● ● ●● ● ● ●
●● ● ● ●
● ●●●●● ●● ● ● ●● ● ● ●
● ●● ●●
●●
● ● ●●
●●●
● ● ● ● ●● ●● ● ●●●● ●● ●●●●
●
●
●
●● ●
● ● ● ●●● ●●
● ●● ●
● ● ● ● ●●
● ● ●● ● ● ● ● ●● ● ●● ● ● ●●●●●
●
●● ●●● ● ● ●● ●● ● ●
●●●
●●●● ● ● ● ●●
●● ● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●
●
● ●●●
●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●
●●
●● ● ● ●● ● ● ● ● ● ● ●
● ●● ●●●
1.5
●● ● ●●● ●● ●● ●●● ●
● ●● ●
●
● ● ●●● ●● ● ●● ●
●● ● ● ●
●● ●● ● ● ●
●
● ●●● ● ●● ● ●● ●● ●●
●
●●● ●● ●●●●
● ● ●● ● ●
●●● ●●● ●● ●●● ●● ●●● ● ●
● ●● ● ●● ● ● ● ● ● ●●
●● ● ● ● ● ●● ●● ● ● ●● ●
● ● ● ●●● ● ● ● ● ● ●● ● ● ●●
●● ● ● ●● ●● ● ● ● ● ●
●●● ● ●
●●
● ●●● ● ● ●● ● ● ● ●●● ●● ● ●●●●●● ●● ● ●
● ●● ● ● ●
1.0
●●● ● ● ●● ● ● ●
●● ● ● ●● ● ● ●● ● ●● ●
●●● ●●●
● ● ● ● ● ● ●●●
●●● ● ● ●●● ●●● ● ● ● ●
● ●●
●● ●● ● ● ●● ●
● ●● ●● ● ●
●●●● ● ●●
●● ●
● ●● ● ● ● ●●
● ● ● ● ● ● ● ●● ●● ● ●●
●
●●●●●●●● ● ● ●● ●● ●● ●● ● ● ● ● ● ●●●
● ●
●
●
● ● ●
● ●
●● ● ● ● ● ● ●●
●●● ● ● ● ● ●●
● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●
●● ●● ● ●●● ●●● ●
● ● ● ●● ●● ●● ●● ● ●
0.5
● ● ●● ● ● ● ●
● ●● ● ●●●● ● ●● ● ●● ●●
y1
● ● ●● ● ● ●● ● ●●● ● ●● ●
●
●● ●● ● ●
●● ●●● ●● ●● ● ● ● ●● ● ●● ●● ● ● ● ●
● ●
●●●● ●● ● ●●●● ● ● ● ●● ● ● ●● ●● ●
●●
● ● ●● ● ● ●●
●● ● ●
y2
●● ● ●●●● ●● ● ● ●● ●● ● ●●
●
● ●●
●●●● ●●● ● ●● ●● ● ● ● ●● ●
●● ● ●
●●●
● ● ● ●● ● ● ● ●●
● ● ● ● ●● ● ●● ●
x2
● ●● ● ●● ●● 1.0
● ●●●
●● ●● ● ● ●● ● ●●●● ●
● ●● ●
0.0
● ● ●● ●
● ●● ● ● ● ● ●● ● 0.6
● ●● ● ● ●● ●●●
● ●
●●●● ● ●
−0.5
●● ● 0.4
0.2
−1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x1
x1
49
Hypothesis Testing
H0 : β1 “ β2 “ ¨ ¨ ¨ “ βp
Ha : at least one βj is non-zero
pTSS ´ RSSq{p
F “
RSS{pn ´ p ´ 1q
50
Variable Selection
52
Predictors
53
Predictors: Boston Example
50
50
●
● ●
●●
●●
●● ●
●● ● ●● ●
● ●
●●
●●
●● ●
●● ● ●●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ● ● ●
40
40
● ●
● ●
●● ● ●● ●
●
● ●
●
● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●
● ●●●● ● ● ●●●● ●
● ● ● ●
●●● ●● ●● ● ● ●●●● ●● ●● ●
● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●
30 ●●● ● ● ●●● ● ●
30
● ●● ● ● ●● ●
● ●● ● ● ● ● ●● ● ● ●
●●●● ● ● ●●●● ● ●
● ●● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ●● ●●
y
y
● ●● ●● ● ●
● ● ● ●● ●● ● ●
● ●
●●●●●● ●● ●
● ● ●
●
●● ●●●●● ●
●
●
●●
● ● ● ● ● ●● ●
●
●
●
● ●●●●● ●
●
●
●●
● ● ●
●●
●●● ● ●●●●● ●●●● ●●●●
● ●
● ●●
●●● ● ●●●●● ●●●● ●●●●
● ●
●
●●●● ● ●
●●● ● ●● ●●●● ● ●
●●● ● ●●
●●●● ● ● ● ●● ●●●●●● ●● ● ● ● ●●●● ● ● ● ●● ●●●●●● ●● ● ● ●
● ●●● ●●●●● ●
● ● ●●● ●● ●● ●
● ● ● ●●● ●●●●● ●
● ● ●●● ●● ●● ●
● ●
●●●●● ● ●
●● ●●●●● ● ● ● ●●●●● ● ●
●● ●●●●● ● ● ●
● ● ●●
●
● ● ● ●●
●
●
● ● ●●●●●
● ● ● ●● ●● ●
●●● ● ●●● ● ● ●●●●●
● ● ● ●● ●● ●
●●● ● ●●●
●● ●● ●
● ● ●● ●● ●
● ●
20
20
●● ● ●●
●● ●● ●●●
● ● ●● ● ●●
●● ●● ●●●
● ●
● ● ●● ● ●●●●●●
●● ●
●
●●●
● ●●
●●●● ●●
●
● ● ● ● ●● ● ●●●●●●
●● ●
●
●●●
● ●●
●●●● ●●
●
● ●
●● ●●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ● ●● ●●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●
● ●
● ● ●● ● ● ● ●● ●
● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
● ● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
●
● ●●● ●● ●
● ● ● ● ●●● ●● ●
● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●● ●
● ● ●●
● ●●● ● ●● ● ●●● ● ●●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
●● ●●● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●
●●
●●● ● ● ●● ● ● ●● ● ● ●
●●
●●● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
10
10
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●●
● ● ● ● ●
●
● ● ●●
● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ●
10 20 30 10 20 30
x x
50
●
● ●
●●
●●
●● ●
●● ● ●● ●
● ●
●●
●●
●● ●
●● ● ●●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
● ● ● ●
● ●
● ●
● ● ● ●
40
40
● ●
● ●
●● ● ●● ●
●
● ●
●
● ● ● ●● ● ● ● ● ●● ●
●● ● ●● ●
● ●●●● ● ● ●●●● ●
● ● ● ●
●●● ●● ●● ● ● ●●●● ●● ●● ●
● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ● ● ● ●
●●● ● ● ●●● ● ●
30
30
● ●● ● ● ●● ●
● ●● ● ● ● ● ●● ● ● ●
●●●● ● ● ●●●● ● ●
● ●● ● ● ● ● ●● ● ● ●
●● ● ● ●● ● ●
●● ●● ●● ●● ●● ●●
y
● ●● ●● ● ●
● ● ● ●● ●● ● ●
● ●
●●●●● ● ●● ●
● ● ●
●● ●●●●● ●
●●● ● ● ● ●● ●●
● ●●●●● ●
●●● ●
●● ● ●●●● ●● ● ● ●
●●
●
● ●●●● ●● ● ● ●
●●● ●●
●●●●● ●● ●● ● ● ●●● ●●
●●●●● ●● ●● ● ●
●●●● ●●●●●●●
● ● ●●●●●●● ●● ●● ●● ● ● ●●●● ●●●●●●●
● ● ●●●●●●● ●● ●● ●● ● ●
● ●●● ●● ●●●●●●● ●
● ● ●●● ●● ●● ●
● ● ● ●●● ●● ●●●●●●● ●
● ● ●●● ●● ●● ●
● ●
●●●●● ●● ●● ●● ● ●●●●● ●● ●● ●● ●
● ●● ● ●● ●● ●● ● ●● ●● ●● ●
● ● ●●●●● ●● ●● ●
●
●● ● ●●● ● ● ● ● ●●●●●
● ●● ● ●● ●● ●
●
●● ● ●●● ● ●
●●● ●●● ● ● ●●● ●●● ● ●
20
20
●● ●● ●● ●● ●●
● ● ●● ●● ●● ●● ●●
● ●
● ●● ● ● ●●● ● ● ● ●●● ●
●●● ●●●●●● ●●●
●● ●●●
●●●●
● ● ● ● ●● ●●● ●●●●●● ●●●
●● ●●●
●●●●
● ● ●
●● ●●● ● ●● ●● ● ●● ●●●●
● ● ●● ●●● ● ●● ●● ● ●● ●●●●
● ●
● ● ●● ● ● ● ●● ●
● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
● ● ●
●●● ●
●
●●
●
● ● ● ● ● ● ● ●
●
● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●
●● ● ● ● ●● ● ● ●
● ● ● ● ●● ●●● ●
● ● ●● ● ● ● ● ●● ●●● ●
● ● ●●
● ●●● ● ●● ● ●●● ● ●●
● ●●● ●● ●● ● ● ●
● ● ● ● ●●● ●● ●● ● ● ●
● ● ●
● ●
● ●● ● ● ● ● ●
● ●● ● ● ●
●● ●
●● ● ●● ● ● ●●
●● ● ●
● ● ●● ●
●● ● ●● ● ● ●●
●● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
10
10
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●●
● ● ● ● ●
●
● ● ●●
● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ●
10 20 30 10 20 30
x x
54
Other Considerations
55
Potential Problems
56
Potential Problems
1 pxi ´ x̄q2 1
hi “ ` řn 2
, hi P r , 1s. (24)
n j“1 pxj ´ x̄q n
57
Comparison with KNN
1D 2D 3D
1.0
1.0
0.8
0.8
1.0
0.6
0.6
0.8
y
0.6
0.4
0.4
z
1.0
y
0.4
0.8
0.2
0.2
0.6
0.2
0.4
0.2
0.0
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x
x x
58
Homework
59
Classification Models
Linear Regression
Source: [JO13]
60
Linear Regression
Remark 4.1
Note that linear regression fails for classification problems. Consider: Y
coded as 0{1 and p “ 1. In linear regression we estimate fˆ : Rp Ñ R,
meaning that we are mapping onto the whole real line, and not only onto
t0, 1u. This means we can predict 0.7, which is meaningless.
61
Linear Regression
Source: [JO13]
62
Simple Logistic Regression
63
Estimation & Prediction
e β̂0 `β̂1 X
p̂pX q “ ,
1 ` e β̂0 `β̂1 X
and the classification follows a rule of the form:
#
1 if p̂pX q ą H
ŷi “
0 if p̂pX q ď H,
64
Multiple Logistic Regression
where:
e β0 `β1 X1 `β2 X2 `¨¨¨`βp Xp
ppX q “ .
1 ` e β0 `β1 X1 `β2 X2 `¨¨¨`βp Xp
65
Bayesian Classifier
πk fk pxq
pk pX q “ řK , (31)
`“1 π` f` pxq
where:
66
Bayesian Classifier (example for p=1)
px ´ µk q2
" *
1
fk pxq “ ? exp ´ (32)
2πσ 2σ 2
pk pX q “ ř ! ), (33)
K 1 px´µ` q2
`“1 π ` ?
2πσ
exp ´ 2σ 2
µk µ2
9 2
X ´ k2 ` logpπk q,
σ 2σ
thus the Bayes classifier selects class k that maximizes
µk µ2
δk pX q “ 2
X ´ k2 ` logpπk q. (34)
σ 2σ
67
Bayesian Classifier (example for p=1)
δ1 pX q ą δ2 pX q
µ1 µ21 µ2 µ22
X´ ` logpπ1 q ą X ´ ` logpπ2 q
σ2 2σ 2 σ2 2σ 2
µ1 ` µ2
X ą ,
2
and class 2 otherwise.
68
Linear Discriminant Analysis
µ̂k µ̂2k
δ̂k pX q “ X ´ ` logpπ̂k q. (35)
σ̂ 2 2σ̂ 2
In particular, the required sample estimates follow
nk
π̂k “ (36)
n
1 ÿ
µ̂k “ xi , (37)
nk i:y “k
i
K
1 ÿ ÿ
σ̂ 2 “ pxi ´ µk q2 (38)
n ´ K k“1 i:y “k
i
69
Linear Discriminant Analysis
Source: [JO13]
70
Linear Discriminant Analysis
71
Linear Discriminant Analysis
Source: [JO13] 72
Assessment of Binary Classifiers
Confusion matrix:
True condition
Condition positive Condition negative
Predicted True positive False positive
condition (Type I error)
positive
Predicted False negative True negative
condition (Type II error)
negative
where:
True positive rate = (sum of true positive)/(sum of condition positive)
False positive rate = (sum of false positive)/(sum of condition negative)
73
Assessment of Binary Classifiers
74
Assessment of Binary Classifiers
Source: [JO13]
75
Assessment of Binary Classifiers
Source: [JO13]
76
Quadratic Discriminant Analysis
Same as LDA, but each class has its own covariance matrix
Thus, we are interested in finding k that maximizes
1 1
δk pxq “ ´ px ´ µk qJ Σ´1k px ´ µq ´ log |Σk | ` logpπk q
2 2
1 1 J ´1
“ ´ x J Σ´1
k x ´ µk Σk µk ` µk Σk x
J ´1
2 2
1
´ log |Σk | ` logpπk q, (41)
2
where the first term is a quadratic form and thus, the decision
boundaries have curvature.
Equation (41) means that we need to estimate Σk for each k.
77
Quadratic Discriminant Analysis
Remark 4.3
QDA is more flexible than LDA. Intuitively this means that when
compared to each other:
78
Linear Discriminant Analysis
Source: [JO13]
79
Comparison of Classification Methods
µ2 ´ µ2
ˆ ˙ ˆ ˙
p1 pxq µ1 ´ µ2
log “ logpπ1 q ´ logpπ2 q ` 2 2 1 ` X
1 ´ p1 pxq 2σ
looooooooooooooooomooooooooooooooooon σ2
looooomooooon
c0 c1
“ c0 ` c1 X ,
80
Comparison of Classification Methods
Remark 4.4
KNN is more flexible than QDA. Intuitively this means that when
compared to each other:
81
Resampling Methods
Overview
Problem: want to select the model with the best test error
performance, but usually do not have available test sets.
Idea: create artificial test sets by sampling.
Limitation: sampling from a population a repeated number of times
demands some computation power.
Sampling methods:
1 Cross validation: Validation set, LOOCV, and k-fold
2 Bootstrap
82
Cross Validation: Validation Set
and fit the model using “training” (70% obs.). Then compute MSE
using “validation” (30% obs.). Select the model with smallest MSE.
Drawbacks
1 Test MSEs are too variable because samples can be very dissimilar.
2 Test MSEs are too high because we are only using part of the data.
83
Cross Validation: Validation Set
Source: [JO13]
84
Cross Validation: LOOCV
85
Cross Validation: LOOCV
Remark 5.1
Consider simple linear regression. We stress that the use of
p´iq p´iq p´iq
ŷi “ β̂0 ` β̂1 xi in expression (42) is computed using
ř ř
p´iq j‰i yj j‰i xj
β̂0 “ ´ β̂1 ,
n´1 ř
n ´1 ř
´ ¯ ´ ¯
j‰i xj j‰i yj
ř
j‰i xj ´ n´1 yi ´ n´1
p´iq
β̂1 “ ř ¯2 ,
ř ´ j‰i xj
j‰i x j ´ n´1
Source: [JO13]
87
Cross Validation: k-Fold
88
Cross Validation: k-Fold
Source: [JO13]
89
Cross-Validation on Classification Problems
Remark 5.2
Instead of the MSE, we use the ER. We write it once again here for
clarity, for the case of n observations
n
1ÿ
ER “ Iy ‰ŷ ,
n i“1 i i
90
Bootstrap
Idea: sample data with replacement to obtain a data set of the same size.
Used to quantify the uncertainty with a given estimator.
Example 5.1
Consider the problem of selecting the optimal investment allocation α˚
when choosing from assets X and Y in such a way that the minimization
of the variance of the portfolio return αX ` p1 ´ αqY is reached.
Consider availability of sample estimates σ̂X2 , σ̂Y2 , σ̂XY .
Solution. It is easy to see that
Source: [JO13] 92
Homework
93
Model Selection
Subset Selection
Recall:
Remark 6.1
Note that the number of possible models for each possible model size is:
p ˆ ˙ ˆ ˙ ˆ ˙ ˆ ˙
ÿ p p p p
“ ` ` ¨¨¨ ` “ 2p ,
i“0
i 0 1 p
94
Best Subset Selection
95
Forward Stepwise Selection
Remark 6.3
Note that the number of models being tested in Algorithm 2 is
ppp ` 1q
1 ` p ` pp ´ 1q ` ¨ ¨ ¨ ` 1 “ 1 ` ă 2p ,
2
i.e. much smaller than that of Algorithm 1.
96
Backward Stepwise Selection
Remark 6.4
1 The number of models tested in Algorithms 3 and 2 is the same.
2 The best model resulting from Algorithms 1, 2 and 3 does not need
to coincide.
97
Choosing the Optimal Model
Two ways:
98
Cp , AIC, BIC, Adjusted R 2
Main idea: Provide selection criteria after controlling for model size d.
RSS{pn ´ d ´ 1q
Adj. R 2 “ 1 ´ (49)
TSS{pn ´ 1q
99
Cp , AIC, BIC, Adjusted R 2
Remark 6.5
100
Regularization
Regularization Methods
Idea: Don’t do variable selection. Instead run the model with all
variables, but shrink the betas so that:
§
Ò Biasrfˆpx0 qs, đ Varrfˆpx0 qs,
§
§
where the big arrow đ means that we expect the benefits from reducing
§
the variance will compensate for the increase in bias.
Alternatives:
1 Ridge Regression
2 LASSO
101
Ridge Regression
Where:
The shrinkage penalty, reduces the value of the estimated β̂j ’s.
For a tuning parameter λ Ñ 0, the penalty has no effect, and for
λ Ñ 8, the penalty has all the control.
102
Ridge Regression
Remark 7.1
Note that each β̂j do depend on the scaling of all predictors x1 , . . . , xp .
The usual recommendation is then to transform the variables xj to
xi,j
x̃i,j “ bř . (51)
n
i“1 pxi,j ´ x¯j q2
Remark 7.2
Note that RR performs better than LS if the gain in variance reduction
surpasses the loss for bias increase. This is the case when LS variance is
particularly high, which happens when p Ñ n.
103
LASSO (Least Absolute Shrinkage and Selection Operator)
104
LASSO
Remark 7.3
Recall that the `-norm of a vector x P Rn is defined as
˜ ¸ p1
ÿn
p
}x}` “ |xi | , (53)
i“1
105
LASSO
Unit spheres in R2 :
L2 L1
1.0
1.0
0.5
0.5
0.0 0.0
-0.5
-0.5
-1.0
-1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
L8 Lp
1.0
1.0
0.5
0.5
0.0 0.0
-0.5
-0.5
-1.0
-1.0
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
106
LASSO
Remark 7.4
Note that finding β̂ in LASSO can be written equivalently as
˜ p
¸2 p
n
ÿ ÿ ÿ
β̂1 , . . . , β̂p “ arg min yi ´ β0 ´ βj xi,j `λ |βk |.
β1 ,...,βp i“1 j“1 k“1
or as
$ ˜ ¸2 ,
& n
ÿ p
ÿ p
ÿ .
β̂1 , . . . , β̂p “ arg min y i ´ β0 ´ βj xi,j : |βk | “ s .
% β1 ,...,βp i“1 j“1
-
k“1
107
Source: [JO13]
108
Selecting the Tuning Parameters
109
Dimension Reduction
Overview
Zm P spantX1 , . . . , Xp u, m “ 1, . . . , M.
Here we discuss:
110
Overview
Remark 8.1
Note that the change of variables from tXj upj“1 to tZm upm“1 can in itself
reduce the variance of the overall estimator fˆpx0 q. In particular, consider
M
ÿ
yi “ θ0 ` θm zi,m ` i , θ0 , . . . , θm P R.
m“1
p
ÿ
zi,m “ φj,m xi,j , φ1,m , . . . , φp,m P R.
j“1
βj
where: φ̂1,1 , φ̂2,1 are the “loadings” for z 1 , and zi,1 are its “scores”.
Remark 8.2
The first principal component is the line that minimizes the sum of the
square perpendicular distances between each point ant the line.
112
Principal Component Analysis
113
Principal Component Regression
where i is the noise under the usual assumptions, and M can be selected
via cross validation.
Remark 8.3
PCR assumes the directions where X1 , . . . , Xp show the most variation
are directions associated with Y .
Remark 8.4
Note that PCR uses all features in each component, thus it is a
dimension reduction method, but not a feature selection method.
114
Partial Least Squares
For k “ 1, . . . , M do:
115
Homework
116
Lab: Forecasting Soccer Matches
Machine Learning, Big Data and other Fancy Words
Worldwide hype
and in Peru...
RIMAC data science challenge1 : predict car insurance default for good
customers. Three best forecasts get a data scientist job!.
Fintech: LatinFintech, Bitinka, Culqui, Kambista, etc.
Popular Events
5
Google says it can beat “Paul the Octopus”.
2 https://github.com/GoogleCloudPlatform/ipython-soccer-predictions
3 http://www.goldmansachs.com/our-thinking/macroeconomic-insights/
euro-cup-2016/
4 From twitter account: @2010misterchip.
5 Claim made by J. Tigani (Google I/O) at a big data conference (Strata) in 2014.
118
Objective of this Application
Forecast Argentina–Peru:
6 www.optasportspro.com
119
Data Structure
Three leagues:
Features ˆ 2:
Make an assessment on how the teams are coming for the game, i.e.
write a matrix with teams on one side and features on the other side8 .
Details:
We aim to maximize
# +
n
ź
yi 1´yi
Lpβ; λq :“ log pi pβq p1 ´ pi pβqq ´ λ}β}1 ,
i“1
řn
where }x}1 “ i“1 |xi |, x P Rn , and λ is the regularization parameter.
Details:
122
Classification Accuracy:
The measure of accuracy of the model is given by
TP ` TN
ACC “ ,
TP ` TN ` FP ` FN
where TP: true positive, TN: true negative, FP: false positive, and FN:
false negative.
123
Results: 2014 FIFA World Cup
Highlights:
124
Results: 2018 FIFA World Cup (Quals)
Highlights:
125
Details: Last 10 Games:
126
Details: Coming 10 Games:
127
Concluding Remarks
On Google’s exercise:
On Argentina–Perú:
128
Simple Non-Linear Methods
Simple Non-Linear Methods
yi “ β0 ` β1 xi ` i , Covpi , j q “ σ 2 1ti“ju .
Examples:
1 Polynomial regression
2 Step Functions
3 Regression Splines
4 Smoothing Splines
5 Local Regression
129
Polynomial Regression
Model:
d
ÿ
yi “ β0 ` βj xij ` i , Covpi , j q “ σ 2 1ti“ju . (54)
j“1
Remark 10.1
The boundary effect creates a big uncertainty at the borders. Note that
Varrfˆpx0 qs “ `J
0 Ĉ `0 , Ĉ i,j “ Covpβ̂i , β̂j q, `0 “ p1, x0 , . . . , x0p qJ
130
Polynomial Regression
Source: [JO13]
131
Step Functions
Model:
K
ÿ
yi “ βk Ck pxi q ` i , Covpi , j q “ σ 2 1ti“ju , (55)
k“0
where CK pxi q “ ItcK ďxi u at k “ K , and Ck pxi q “ Itck ďxi ăck`1 u otherwise.
132
Step Functions
Source: [JO13]
133
Regression Splines
Model:
q
ÿ K
ÿ
y i “ β0 ` βk xik ` βk`q bpxi , εk q ` i , Covpi , j q “ σ 2 1ti“ju , (56)
k“1 k“1
where
#
pxi ´ εqq if xi ą ε
bpxi , εq “
0 if otherwise
134
Regression Splines
Source: [JO13]
135
Regression Splines + Boundary Conditions = Natural splines
Remark 10.2
The boundary problems of regression splines can be solved using
additional boundary conditions that force the fit to be linear at the
boundaries. Regression splines for which such conditions hold are called
“natural splines”.
136
Regression Splines + Boundary Conditions = Natural splines
Source: [JO13]
137
Regression Splines + Boundary Conditions = Natural splines
Source: [JO13]
138
Regression Splines + Boundary Conditions = Natural splines
Source: [JO13]
139
Smoothing Splines
140
Smoothing Splines
Remark 10.3
Note that in (57):
Remark 10.4
It can be shown that given q “ 3, the function g pxq that minimizes (57)
is a natural cubic spline. Hence, cubic smoothing splines do not have
boundary effects.
141
Smoothing Splines
Remark 10.5
It turns out that the computation of LOOCV is particularly fast for
smoothing splines. In fact
n ´
ÿ ¯2
p´iq
RSSCV pλq “ yi ´ ĝλ pxi q
i“1
n „ 2
ÿ yi ´ ĝλ pxi q
“ , (59)
i“1
1 ´ rS λ si,i
142
Smoothing Splines
Source: [JO13]
143
Local Linear Regression
144
Local Linear Regression
145
Local Linear Regression
Source: [JO13]
146
Local Linear Regression
Source: [JO13]
147
GAMs
Generalized Additive Models
148
Generalized Additive Models
Source: [JO13]
149
Advantages and Disadvantages of GAMs
Advantages:
Disadvantages:
150
Local Linear Regression
Source: [JO13]
151
Homework
152
Tree Based Methods
Regression Trees
Idea: Segment predictor space into rectangular regions and predict based
on the mean/mode of the response within the region.
153
Regression Trees
Source: [JO13]
154
Binary Splitting
Idea: given region X of the feature space, the select a variable j and
a cut-off point s to make the split
155
Binary Splitting
Source: [JO13]
156
Complexity Cost & Tree Pruning
1 ÿ 1 ÿ
Qm pT q “ pyi ´ ĉm q2 , ĉm “ yi
Nm x PR Nm x PR
i m i m
157
Building a Regression Tree
Algorithm 5
1 Use binary splitting to grow a large tree T0 on training data,
stopping when each terminal node has less than a minimum of
observations.
2 Apply cost complexity pruning to the large tree and obtain a
sequence of best subtrees T Ă T0 as functions of α.
3 Use k-fold cross validation and pick α̂ “ arg minα MSE in test.
4 Return the subtree from step 2 that corresponds to α̂.
158
Classification Trees
1 ÿ
p̂km “ I
Nm i:x PR tyi “ku
i m
159
Trees vs. Linear Models
160
Trees vs. Linear Models
Source: [JO13]
161
Advantages and Disadvantages of Trees
Advantages:
Disadvantages:
Remark 12.1
Recall that averaging a set of independent random variables reduces de
variance. In fact, given independent r.v.s z1 , . . . , zn , each with variance
σ 2 , then the variance of z̄ is σ 2 {n.
163
Bagging
Idea of Bagging: take many training sets from the population, build
different trees for each set and average the predictions.
B
1 ÿ ˆb
fˆavg pX q “ f pX q,
B b“1
where fˆb pX q is the prediction with the training set b. However, since we
do not have access to different training sets, one can use bootstrap
B
1 ÿ ˆ˚b
fˆavg pX q “ f pX q,
B b“1
164
Random Forests
Remark 12.2
Note that if m “ p, all predictors are considered at every split, thus we
are in the bagging case.
165
Boosting
Idea: Take many small modified versions of the data set, and grow trees
sequentially, i.e. each tree grows using info. from previously grown trees.
166
Boosting
Algorithm 6 (Boosting)
3 Update residuals
ri Ð ri ´ λfˆpbq pxi q
167
Boosting
Tuning parameters:
168
Support Vector Machines
Hyperplanes
β0 ` β1 x1 ` ¨ ¨ ¨ ` βp xp “ 0, (63)
β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p ą 0 if yi “ `1
β0 ` β1 xi,1 ` ¨ ¨ ¨ ` βp xi,p ă 0 if yi “ ´1,
or equivalently
170
Classification Using a Separating Hyperplane
Source: [JO13]
171
Maximal Margin Classifier
172
Maximal Margin Classifier
Source: [JO13]
173
Maximal Margin Classifier
Source: [JO13]
174
Maximal Margin Classifier
175
Support Vector Hyperplane
Source: [JO13]
177
Support Vector Hyperplane
178
Support Vector Hyperplane
Source: [JO13]
179
Support Vector Hyperplane
where the last expression follows from αi “ δi yi , and the fact that αi ‰ 0
only for the support vectors.
180
Support Vector Machines
1 Polynomial kernel
` ˘d
K px, x i q “ 1 ` x J x i , d ą0
2 Radial kernel
` ˘
K px, x i q “ exp ´γ}x ´ x i }22 , γ ą 0.
181
Support Vector Hyperplane
Source: [JO13]
182
SVMs: More than Two Classes
Consider there are k possible classes for the response. We can use two
approaches:
`k ˘
1 One vs. One. Do SVM by pairs in the 2 possible cases. Do voting
at observation level and classify accordingly.
2 One vs. All.
For all observations in class 1, write `1 as response, and ´1 for all
the other clases. Do SVM and compute β0,1 , β1,1 , . . . , βp,1 .
Repeat the previous step for classes 2, 3, . . . , k and collect
β0,i , β1,i , . . . , βp,i , i “ 1, . . . , k in each case.
For out-of-sample observation x ˚ , compute
J
f x ˚ ; βi “ β0 ` x ˚ βi ,
` ˘
i “ 1, . . . , k,
` ˘
and select the case i for which f x ˚ ; β i is the largest.
183
Homework
184