Big Data Lesson 5 Lucrezia Noli

BIG DATA
From Linear regression models to Deep Learning
Lecturer: Lucrezia Noli

Lesson 4
Classification
2 questions:
• An object is A or B  binary classification
• 1 output variable that can take on two values
• An object is A, B, C, …, N  multi-class classification
• N output variables, each representing a class
Remember…
Many of the problems faced might be expressed in different ways:

• CHURNERS:
• Client will be a churner (T/F)
• Client willl leave for competitor A,B,C (multiclass)
We need to chose which kind of scenario we are in beforehand, to

then set up the model accordingly
Logistic Regression
BIAS 𝛽𝛽0
𝑥𝑥1
𝛽𝛽1
𝑥𝑥2 1 binary output
𝛽𝛽2
𝛽𝛽3 𝑦𝑦1 It’s output y1 or not?
𝑥𝑥3
𝛽𝛽4
𝑥𝑥4
Logistic Regression with
SoftMax
𝛽𝛽1
𝑥𝑥1 𝑤𝑤1,1
𝑤𝑤1,2
𝛽𝛽2
𝑥𝑥2 𝑤𝑤1,3
𝑦𝑦1 Is class y1?
𝛽𝛽3
3 class output
𝑥𝑥3 𝑦𝑦2 Is class y2?
𝑦𝑦3 Is class y3?

𝑥𝑥4
Model evaluation
• Supervised models’ performance can be evaluated comparing the
prediction with the true label, or true value.
• This test is to be done on the test set, to be kept separated from the training
set…
• Remember why??
Test Set e Training Set
TRAINING SET TEST SET
Algorithm
Confusion Matrix
True
label
How to evaluate binary class
models
• By confronting the predicted labels with the true ones, we derive 4
different types of metrics:
• TP: true positive, instances predicted as belonging to positive
class, which actually belong to this class
• TN: true negative, instances predicted as belonging to negative
class, which actually belong to this class
• FP: false positive, (o Type I Error) instances predicted as
belonging to positive class, which actually belong to negative
class
• FN: false negative, (o Type II Errorinstances predicted as
belonging to negative class, which actually belong to positive
class
Valuation indices
• From the numbers calculated before, we
obtain:
• Accuracy =(TP + TN) / (P + N)
- Be careful: its a very general metrics,
which doesn’t make any difference
between TP and TN
– This is a problem if you are more
interested in one of the two
classes (as is often the case)
True
label
• Precision (Positive predictive value) =
TP/(TP+FP)
• Sensitivity (o Recall o True positive rate)

= TP/(TP+FN) =TP/P
• Specificity (o True Negative Rate) =

TN/(TN+FP) = TN/N
• F1 = 2* Precision*recall/(precision +recall)
Metrics other than accuracy
• Why have I said many times around that «accuracy» is a «too general» of a
metrics, and we should base our model evaluation on other metrics too?
• Think about this case:
• I’m trying to predict fraud. In my dataset, I have 99 cases of normal
transactions, and 1 case of fraud.
• I run my model, where fraud is the target variable
• The model accuracy is 99%
• Is this a good model?

Metrics other than accuracy
• Actually not. Because it was able to correctly classify all 99 instances i WAS NOT interested in,
but failed to recognize the only 1 fraud I wanted to catch
ACCURACY is TRUE POSITIVE + TRUE NEGATIVE

/ ALL
True
Doesn’t make any distincion between
label how good the model is in predicting
the positive class (in this case the
fraud), and the negative class (in this
case the non- fraud)
exercise
Decision Trees
Decision Trees
• Decision trees are very broadly utilized in predictive analytics
projects
• Mainly for classification problems
• Based on flexible technique which makes them work efficiently in
various different situations
• The output is very readable, since it’s visually represented as a
tree-like structure
• 3 types of trees:
• cassification trees, for classification problems
• regression trees, used for regression problems
• classification & regression trees, a mix of both, not very often
used in real life
Decision Trees
• Algorithms :
• ID3 (Iterative Dichotomiser 3)
• C4.5 (successor of ID3)
• C5.0 (successor of C4.5)
• CART (Classification And Regression Tree)
• CHAID (CHi-squared Automatic Interaction Detector).
• MARS: facilitates use of numerical variables
• Conditional Inference Trees. Based on statistical methods, uses non-
parametric tests to make splits. Very efficient in minimizing overfitting
when tested several times, which means they’re generally unbiased and
don’t require pruning.
ID3 and CART both invented independently between 1970 and 1980), but
work in similar ways
Will John play tennis?
John giocherà a tennis?
New data: overlook: Rain, Humidity: High, Wind: weak

Will John play?
Decision tree split
• Overcast is a pure split: maximum

information
 We are 100% sure he’ll play
• Strong is an impure split: minimum
information
• (50-50 split)
 We are totally uncertain
Entropy / Information
In information theory, entropy measures degree of information in

terms of «purity» of a group of instances
Knowledge, or Entropy is a
information, proxy of
and entropy «chaos»: high
are opposites entropy, lots of
chaos in the
data
Entropy / Information
Entropy / information
Example
ID Stato Coniugale Genere Reddito ACQUIRENT

annuo lordo E
1 Sposato M € 35.000 SI
Freq(CSI) = 5 Freq(CNO) = 5 Count(T) = 10
2 Single F € 47.000 NO
3 Sposato F € 58.000 SI
Info(T) = 5/10*log2(5/10)+ 5/10*log2(5/10) =1
4 Single M € 31.000 NO
5 Separato M € 70.000 SI
We now calculate for the subsets:
6 Sposato F € 27.000 NO
7 Single F € 36.000 SI
• reddito (2 subsets: <= o > 35000)
8 Separato M € 50.000 NO • stato coniugale (3 subsets: sposato, single,
9 Sposato F € 65.000 SI separato)
10 Single M € 18.000 NO • genere (2 subsets: M e F).
11 Sposato M € 40.000 ??
InfoReddito(T) = 1,69 Guadagno(Reddito) = 1-1,69 = -0,69 Highest value

InfoStato con.(T) = 2,74 Guadagno(Stato coniugale) = 1-2,74 = -1,74
Infogenere.(T) = 1,94 Guadagno(Genere) = 1-1,94 = -0,94
Decision Tree: algorithm
• We need to find which variable among those included in the model gives
the purest splits. La ripartizione che massimizza il guadagno
corrisponde al primo livello dell’albero, sotto l’insieme totale.
• After each split we do the same reasoning, starting again with all
available variables (also the ones we already used) A questo punto si
ripete il processo per ciascuno dei sotto insiemi che, al loro interno,
presentano elementi appartenenti a classi diverse.
• Il processo si ferma quando i sottoinsiemi contengono elementi solo di
una classe, oppure quando il continuare con la suddivisione non porta
miglioramenti dell’accuratezza.
Pruning
• La crescita indefinita dell'albero potrebbe portare a problemi di

overfitting
• Pruning = processo di rimozione di nodi dall'albero per limitarne la
complessità.
• Pre-pruning: la limitazione avviene durante la costruzione dell'albero
imponendo una soglia alla crescita
• es: numero minimo di elementi per poter effettuare un ulteriore split
• Non è semplice trovare una soglia per evitare l'overfit e nel contempo
effettuare una modellazione adeguata
Decision Trees PROs & CONs
 Tecnica flessibile, adatta a numerose

situazioni
 Output molto chiaro, dato che è
rappresentato (anche visivamente) sotto
forma di
 albero
Instabilità: piccolo
cambiamenti nell'input, possono
provocare grandi variazioni
nell'albero
 Complessità dell'albero: a volte
gli alberi che si creano sono
particolarmente complessi (si può
mitigare il problema con tecniche
di pruning)
 Precision e Recall non ottimali
(migliorabili con tecniche di
boosting o random forest)
Excercise: classification with a decision tree

Big Data Lesson 5 Lucrezia Noli

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Big Data Lesson 5 Lucrezia Noli

Transféré par

Droits d'auteur :

Formats disponibles

BIG DATA

From Linear regression models to Deep Learning

Lecturer: Lucrezia Noli

Many of the problems faced might be expressed in different ways:

We need to chose which kind of scenario we are in beforehand, to

𝑦𝑦3 Is class y3?

• Supervised models’ performance can be evaluated comparing the

prediction with the true label, or true value.

TRAINING SET TEST SET

• Sensitivity (o Recall o True positive rate)

• Specificity (o True Negative Rate) =

• Is this a good model?

ACCURACY is TRUE POSITIVE + TRUE NEGATIVE

New data: overlook: Rain, Humidity: High, Wind: weak

• Overcast is a pure split: maximum

In information theory, entropy measures degree of information in

ID Stato Coniugale Genere Reddito ACQUIRENT

InfoReddito(T) = 1,69 Guadagno(Reddito) = 1-1,69 = -0,69 Highest value

• La crescita indefinita dell'albero potrebbe portare a problemi di

 Tecnica flessibile, adatta a numerose

Vous aimerez peut-être aussi