Vous êtes sur la page 1sur 30

BIG DATA

From Linear regression models to Deep Learning

Lecturer: Lucrezia Noli


Lesson 4
Classification

2 questions:
• An object is A or B  binary classification
• 1 output variable that can take on two values
• An object is A, B, C, …, N  multi-class classification
• N output variables, each representing a class
Remember…

Many of the problems faced might be expressed in different ways:


• CHURNERS:
• Client will be a churner (T/F)
• Client willl leave for competitor A,B,C (multiclass)

We need to chose which kind of scenario we are in beforehand, to


then set up the model accordingly
Logistic Regression

BIAS 𝛽𝛽0
𝑥𝑥1
𝛽𝛽1
𝑥𝑥2 1 binary output
𝛽𝛽2
𝛽𝛽3 𝑦𝑦1 It’s output y1 or not?
𝑥𝑥3
𝛽𝛽4
𝑥𝑥4
Logistic Regression with
SoftMax

𝛽𝛽1
𝑥𝑥1 𝑤𝑤1,1
𝑤𝑤1,2
𝛽𝛽2
𝑥𝑥2 𝑤𝑤1,3
𝑦𝑦1 Is class y1?
𝛽𝛽3
3 class output
𝑥𝑥3 𝑦𝑦2 Is class y2?

𝑦𝑦3 Is class y3?


𝑥𝑥4
Model evaluation

• Supervised models’ performance can be evaluated comparing the

prediction with the true label, or true value.

• This test is to be done on the test set, to be kept separated from the training

set…

• Remember why??
Test Set e Training Set

TRAINING SET TEST SET

Algorithm
Confusion Matrix

True
label
How to evaluate binary class
models
• By confronting the predicted labels with the true ones, we derive 4
different types of metrics:
• TP: true positive, instances predicted as belonging to positive
class, which actually belong to this class
• TN: true negative, instances predicted as belonging to negative
class, which actually belong to this class
• FP: false positive, (o Type I Error) instances predicted as
belonging to positive class, which actually belong to negative
class
• FN: false negative, (o Type II Errorinstances predicted as
belonging to negative class, which actually belong to positive
class
Valuation indices
• From the numbers calculated before, we
obtain:
• Accuracy =(TP + TN) / (P + N)
- Be careful: its a very general metrics,
which doesn’t make any difference
between TP and TN
– This is a problem if you are more
interested in one of the two
classes (as is often the case)
True
label
• Precision (Positive predictive value) =
TP/(TP+FP)

• Sensitivity (o Recall o True positive rate)


= TP/(TP+FN) =TP/P

• Specificity (o True Negative Rate) =


TN/(TN+FP) = TN/N
• F1 = 2* Precision*recall/(precision +recall)
Metrics other than accuracy
• Why have I said many times around that «accuracy» is a «too general» of a
metrics, and we should base our model evaluation on other metrics too?
• Think about this case:
• I’m trying to predict fraud. In my dataset, I have 99 cases of normal
transactions, and 1 case of fraud.
• I run my model, where fraud is the target variable
• The model accuracy is 99%

• Is this a good model?


Metrics other than accuracy
• Actually not. Because it was able to correctly classify all 99 instances i WAS NOT interested in,
but failed to recognize the only 1 fraud I wanted to catch

ACCURACY is TRUE POSITIVE + TRUE NEGATIVE


/ ALL

True
Doesn’t make any distincion between
label how good the model is in predicting
the positive class (in this case the
fraud), and the negative class (in this
case the non- fraud)
exercise
Decision Trees
Decision Trees
• Decision trees are very broadly utilized in predictive analytics
projects
• Mainly for classification problems
• Based on flexible technique which makes them work efficiently in
various different situations
• The output is very readable, since it’s visually represented as a
tree-like structure
• 3 types of trees:
• cassification trees, for classification problems
• regression trees, used for regression problems
• classification & regression trees, a mix of both, not very often
used in real life
Decision Trees

• Algorithms :
• ID3 (Iterative Dichotomiser 3)
• C4.5 (successor of ID3)
• C5.0 (successor of C4.5)
• CART (Classification And Regression Tree)
• CHAID (CHi-squared Automatic Interaction Detector).
• MARS: facilitates use of numerical variables
• Conditional Inference Trees. Based on statistical methods, uses non-
parametric tests to make splits. Very efficient in minimizing overfitting
when tested several times, which means they’re generally unbiased and
don’t require pruning.

ID3 and CART both invented independently between 1970 and 1980), but
work in similar ways
Will John play tennis?
Will John play tennis?
John giocherà a tennis?
Will John play tennis?
Will John play tennis?

New data: overlook: Rain, Humidity: High, Wind: weak


Will John play?
Decision tree split

• Overcast is a pure split: maximum


information
 We are 100% sure he’ll play
• Strong is an impure split: minimum
information
• (50-50 split)
 We are totally uncertain
Entropy / Information

In information theory, entropy measures degree of information in


terms of «purity» of a group of instances

Knowledge, or Entropy is a
information, proxy of
and entropy «chaos»: high
are opposites entropy, lots of
chaos in the
data
Entropy / Information
Entropy / information
Example

ID Stato Coniugale Genere Reddito ACQUIRENT


annuo lordo E

1 Sposato M € 35.000 SI
Freq(CSI) = 5 Freq(CNO) = 5 Count(T) = 10
2 Single F € 47.000 NO
3 Sposato F € 58.000 SI
Info(T) = 5/10*log2(5/10)+ 5/10*log2(5/10) =1
4 Single M € 31.000 NO
5 Separato M € 70.000 SI
We now calculate for the subsets:
6 Sposato F € 27.000 NO
7 Single F € 36.000 SI
• reddito (2 subsets: <= o > 35000)
8 Separato M € 50.000 NO • stato coniugale (3 subsets: sposato, single,
9 Sposato F € 65.000 SI separato)
10 Single M € 18.000 NO • genere (2 subsets: M e F).
11 Sposato M € 40.000 ??

InfoReddito(T) = 1,69 Guadagno(Reddito) = 1-1,69 = -0,69 Highest value


InfoStato con.(T) = 2,74 Guadagno(Stato coniugale) = 1-2,74 = -1,74
Infogenere.(T) = 1,94 Guadagno(Genere) = 1-1,94 = -0,94
Decision Tree: algorithm

• We need to find which variable among those included in the model gives
the purest splits. La ripartizione che massimizza il guadagno
corrisponde al primo livello dell’albero, sotto l’insieme totale.
• After each split we do the same reasoning, starting again with all
available variables (also the ones we already used) A questo punto si
ripete il processo per ciascuno dei sotto insiemi che, al loro interno,
presentano elementi appartenenti a classi diverse.
• Il processo si ferma quando i sottoinsiemi contengono elementi solo di
una classe, oppure quando il continuare con la suddivisione non porta
miglioramenti dell’accuratezza.
Pruning

• La crescita indefinita dell'albero potrebbe portare a problemi di


overfitting
• Pruning = processo di rimozione di nodi dall'albero per limitarne la
complessità.
• Pre-pruning: la limitazione avviene durante la costruzione dell'albero
imponendo una soglia alla crescita
• es: numero minimo di elementi per poter effettuare un ulteriore split
• Non è semplice trovare una soglia per evitare l'overfit e nel contempo
effettuare una modellazione adeguata
Decision Trees PROs & CONs

 Tecnica flessibile, adatta a numerose


situazioni
 Output molto chiaro, dato che è
rappresentato (anche visivamente) sotto
forma di
 albero
Instabilità: piccolo
cambiamenti nell'input, possono
provocare grandi variazioni
nell'albero
 Complessità dell'albero: a volte
gli alberi che si creano sono
particolarmente complessi (si può
mitigare il problema con tecniche
di pruning)
 Precision e Recall non ottimali
(migliorabili con tecniche di
boosting o random forest)
Excercise: classification with a decision tree

Vous aimerez peut-être aussi