Inge 4 Bi Summary

SEP
Solvay Entraide & Publications
Thierry VAN DE MERCKT GEST-S430
Business intelligence &

data science
Lecture notes
Ysaline Hermans
INGE 4
Université libre de Bruxelles 2022-2023

Cercle Solvay
BE Solvay
Solvay Entraide & Publica ons
Le SEP vous accueille tous les midis entre 12h et 14h avec le sourire et la joie dans notre
splendide local qui se trouve dans le Pint’House. Notre but est de vous fournir des notes de
cours, des résumés, des résolu ons d’exercices ou des examens des années précédentes.
Au cœur des valeurs du SEP se trouvent l’entraide et le partage au sein de la communauté

estudian ne. Nos synthèses sont faites par des étudiant.e.s pour les étudiant.e.s. Il est donc
u le pour nous de promouvoir nos valeurs d’entraide, d’engagement et de partage. Car oui,
nous sommes là pour vous, mais nous ne sommes rien sans vous. Soyez donc act.eurs.rices
et partagez aussi avec celleux de votre année et les suivantes des résumés, des notes de
cours ou des résolu ons d’exercices ! Vos déléguées SEP sont à votre disposi on pour
recueillir tout ce que vous es mez devoir être publié.
Le comité SEP est une choue e équipe qui s’inves t toute l’année pour vous oﬀrir les
meilleurs résumés, commander les exemplaires nécessaires et gérer les stocks de syllabi; ces
derniers qui n'a endent que vous. Il s’occupe aussi d’assurer les permanences tous les midis.
Si tu veux faire par e de ce e super team, n’hésite pas à passer nous voir !
Pour toutes vos ques ons, envoyez-nous un message sur notre page Facebook : SEP ~ Solvay
Entraide & Publica ons (n’hésitez pas à la liker aussi pour être les premier.ère.s au courant
de toutes les infos).
À bientôt,
Victoria, Lucia & Amélie

Vos déléguées SEP 2022-2023
Ysaline Hermans MA1 - INGE
BI & Data Science
1. Introduction to AI, data science & machine learning
Data have a big value today. Companies such as Google and Facebook valorise a lot of data.
Examples: 5 billion searches/day (Google), 1 million transactions/hour (Walmart).
But nowadays, use of data is strictly regulated through GDPR for example.
A lot of transactions and data come not from online (internet) but from physical stores.
HTML = declarative programming language (like SQL) that show to the browser how you
want that the page displays. It is difficult to extract data from HTML because you need to
understand the code used.
Big Data is a term applied to data sets whose size is beyond the ability of commonly used
software tools to capture, manage, and process within a tolerable elapsed time. Big data are
non-structured, really big (>1000 Terabytes) and come from real-time streams. Ex: web logs
at Google, tweeter data streams, …
Google developed distributed systems to manage and store big data because the former
systems were not able to.
The big change: “data-fication” of our world (through digitalization). Nowadays, data come
not only from companies anymore but also from consumers (smartphone).
The 3 data paradigms:

● Data in our IT systems: data is structured, reflects what happens in the real world, is
relational (transaction based), optimised to support our business processes (sales,
supply, …) and easily accessible (SQL). Main software’s: Oracle, Microsoft SQL Server,
IBM DB2 and MySQL (OS = Open Source). We don’t want the same information twice;
we don’t want to change something at several places when a data change —> 3rd
normalized form. Why? Because we need to do update often thus it needs to be
quick.
● Data in our warehouses: data is structured, copy of operational data, dimensional
(analytically based), creates the memory of our companies, optimised to support
business decisions (queries, aggregations, etc.), extremely easy to use (query tools
friendly). It contains historical info and the semantic is clear (easy to understand what
you have). Main software’s: Teradata, Microsoft SQL DWH, Amazon Redshift, Apache
Hbase (OS).
● Big data (lake): data might not be structured, new type of data, flat files at bottom,
optimised for fast storage, distributed computing, data is processed where it is, many
cheap cpu’s, fault tolerant, open source, accessed mainly through programming.
We want to process data where they are because they are too heavy to be moved.
Main softwares: Hadoop HDFS (OS), Google BigTable (OS), Amazon Dynamo,
MongoDB (OS).
Open initiative: need to build new softwares to manage big data. Google decided that
all softwares developed will be open to everyone —> fast pace of development
because a lot of people began to work on this —> change of paradigm: Google do not
ask you to pay but it generates a lot of revenues since a lot of people use it.
GIT: way of storing collectively and sharing code.
1
AI & Analytics
Analytics is the “mind set” and processes of making sense out of data, in order to get
non-trivial & actionable business insights to use in daily decisions.
Data Science designates the process of solving business issues using data. It starts from the
business problem formulation and creates a process from scratch to solve it.
Machine Learning is a branch of Artificial Intelligence that builds algorithms able to “learn”
predictive models out of facts. Typical problem: recognition, classification based on labels.
Label examples: look what happened before, understand the signs, … and model it to do
better in the future. For example, based on a set of labelled images, I want to recognize the
known faces in unlabelled images.
Deep Learning designates algorithms efficient to learn to recognize images, sounds and
words. It needs a huge number of observations.
Stories: Telematic data to identify a driver signature, Tesla is the 1st to use AI in real life with
their automated cars.
What is AI?
The ultimate goal of AI is to build robots that are able to do what human can do (Atlas
robot).
AI (perception flow) ≠ machine learning ≠ big data (flow of unstructured data that you have
to manage but it doesn’t mean that you do AI)
Perception: step of understanding what happens (cameras, listen, load from internet...
perceive things from the internet).
Evolution of AI:
2
Since 2016 + Machine learning: neural network spend a lot of energy to train a neural
network (problem: if AI spread everywhere, energy will really become a problem).
General AI: are able to train computers very quickly with only some examples need to
move to that to consume less energy (but we are far to that).
A neutral network can be very efficient but never has conscious/meaning of life (no meaning
of what a table is, etc.).
Google assistant: lady on phone with it, she doesn’t know that it is a computer… need to be
resolved before we use it (for a juridical point of view).
What about AI & BI?

AI is more than data science.
BI is very close to the mindset of analytics; usage of data as analytics but add data
management as an explicit task.
BI designate the processes, and technologies that organize data into meaningful business
information used to empower strategic, tactical, and operational decision-making.
It is mainly concerned with data management, data aggregation and organisation, and
efficient insights reporting, including data visualisation.
Data & analytics: intimate relationship!

Big dependency between AI and data; data is more important than technology because
without data, you don’t have any result)! Analytics and AI fee on data.
Example: Promise - to make the best teacup
1) Need to have data (need to fin tea leaves); is the right type of data available?
2) Collect, store data (collect leaves).
3) Transport data of all devices to servers (logistics; to made tea available) &
Governance data = manage devices with data (what data corresponds to what?). Ex:
if truck show high probability of breakdown go to maintenance.
(Big) data investment depends on analytics to get a return. Do not invest in data without use
cases!
Analytics: moment when you transform cost into result.
To do prediction, I need to have historical data (memory) to be able to look back in time. We
can look at critical values of data we have collected. That is the reason why data warehouses
are really important.
Data sources
Transactional data (company activity, mostly structured data) is the most important source of
data today; the most powerful one. These data are highly accessible but not necessarily
easily usable.
3
Each company collect data of individual consumers activities (on their website, etc., mostly
unstructured data). It is a very important source to get unbiased data (data not influenced by
the company-client relationship). But today it is forbidden to sell them to other companies
without the agreement of consumers. Difficult to access data then because the data is
owned by the app provider.
! Privacy issues today (GDPR) no main privacy issues for transactional data. GDPR covers
only data related to individuals, do not concern data related to companies (personne
morale).
Machine to machine (M2M) activity - Internet of things (mix of structured and unstructured
data formats): put some kind of intelligence in small devices so they can communicate.
Not widely available yet but it will come. There are also main privacy issues.
So what?
AI doesn’t mean big data.
You have benefits above costs only when you do something with it. You must understand
that before doing AI Change the way management is operating = dealing with people
(complicated).
We see that data science support data-driven decision making. Business decisions are
increasingly made automatically by computer systems.
First, understand data organization (what data we have and how to access them). Then,
understand data and finally their value.
4
2.1 Data science as a business formulation problem
Data-driven decision making refers to the practice of basing decisions on the analysis of data
rather than purely on intuition.
Data-driven marketing – Customer insight: about capturing actionable information on our
customers. It mainly serves for communication purposes: channel, tone of voice, etc. Ex: who
is my customer? what does he buy from me?
Data-driven marketing – Behaviour prediction (goal of first assignment): about using past
data to predict the chances that a specific behaviour will occur given that we can act on the
context of the communication: channel, proposition, tone of voice, moment of
communication, etc. Ex: what she will buy next? What is the best moment to communicate?
Price optimization: use data to adapt prices depending on which type of people you have
around. Flight tickets: adapt prices to consumers.
Data Lifetime: a lot of signals are only valid for a short period of time thus turning big data
into value requires them to be real time. Act/react on events when they happen or even
before they happen!
Doing data science

Data scientist: understanding the potential, can translate from business to execution, ability
to evaluate proposal and execution, can do the actual modelling, applied statistician and
computer scientist.
Hacking skill: Python, use of packages.
Math and statistic knowledge: understand what you are doing with packages.
Substantive expertise: need to understand the business you are working for.
What is data science?
Data science is a process that, starting with a business briefing, can extract non-trivial,
actionable, business friendly knowledge from large bodies of data. Knowledge can be
descriptive or predictive but should focus on delivering a business advantage for decision
making. Possibly the end of the process is to make a decision and automatically act according
to it. Ex: I need to reduce churn by 5% this year, how can I do this?
Difference with data mining? Data Science is just data mining with a starting point on the raw
data, and the raw business problem formulation. It is expected that the data scientists will:
1. Understand the business issue and can reformulate it in a data science perspective.
2. Formulate a few numbers of precise steps (a process) based on data to solve the business
issue.
3. Prepare the data – or even searching for new data! – so that the solution is optimal.
CRISP-DM (Cross-Industry Standard Process for Data Mining): Data Science methodology
Reliable data? What is the variance of my model? (From one sample to another)
CRISP-DM is a common approach used by data mining experts:
5
Data science is a process: science + craft (= savoir-faire: acquiring by doing) + creativity (really
need to think out of the box) + common sense
Predictive modelling/supervised data mining

Key – part 1: is there a specific, quantifiable target that we are interested in or trying to
predict? Ex: will this prospect buy a loan? Will this truck have an engine break soon?
Example - business problem: Telco, a telecommunication firm wants to investigate its
problem with customer that churn. What is the targeted behaviour?
Churn = stop contract (end date) quantifiable thing that you want to predict.
Why customers churn? They move from a private contract to a business one (students finish
studies), it is the end of the contract, they don’t pay the bills (differentiate people that want
to go away and people for who contract is ended by the company churn vs bad payer).
Step 1 = model design (to know what to predict)
Key – part 2: do we have data on this target (= label)? To define it and predict it. Which data
available to use? Etc.
Put “1” for people that churn and “0” for people that don’t churn. You need data about the
two to have benchmark, to look for discriminative patterns.
We need to be really careful, especially about the definition of the target because we are in
the first step of the business problem, and we need to understand the data we have.
We look at what people did in the past during a certain period of time. In the past, we also
defined a target (0 = churner, 1 = non churner) and got data. Now, I will extract a table with a
certain number of features (#calls, etc.) from that period of time (that helps us to understand
why people churn).
It is interesting to look at the evolution of transactions because if we see a massive
decrease, it could be a good sign of intention of closing (cut the period in 2 for example and
look at the ratio between the 2).
If a guy in a network (ex: family, friends) decides to churn, there are high chances that
others in the network churn also because the guy will tell them what he found better.
! Any kind of info you can get could be useful to predict. We don’t say that is a good feature
or not because we don’t know which will be helpful. Thus, we try to find as much info as
possible because everything could help.
6
Case: market life insurance – supervised segmentation

We have a particular life insurance product we would like to sell to new prospects (non
clients). We have a nice offer, but we incur cost to target it.
How to proceed to contact as much prospects as possible without sending the letter to all
Belgian population?
In Belgium, you could buy data (from INS for example) as a mailing list with demographic
information in the past, but it is not possible anymore (since GDPR).
Good process: testing take a random sample (ex: 100 000 persons buy mailing list) of the
population and send the letter. From that, we will have the ability to build a model.
“1” = people that made an action, showing interests
“0” = others
Then, we move from a table to a vector space (a line of the table = a vector). Next, we create
a classification tree (very used in BI to build models) all algorithms are searching for that.
Finally, we split the vector space based on the classification tree.
Green: reacted but did not buy (false positives)
+: reacted and bought (true positives) most interesting target
Best section: above right (because below right: spend money for a lot of false positives)
Or… we can do a logistic regression (also a good segmentation)
7
Key – part 3: the result of supervised data mining is a MODEL that, given data, predicts some
quantity.
Classification tree can be translated in SQL statements:
Example:
if (income <50k):
then no life insurance
else:
life insurance
This result/rule can be applied to any customer, and it gives you a prediction.
There are different models: tree/rule (supervised segmentation - SQL), numeric function as
P(x) = f(x) (neural network), etc.
The difference is the type of target variable:
● If the label is categorical, we use classification (multi-class prediction (> 2 categorical
values) or binary prediction). Ex: churn, truck problems = classification A lot of
management decisions can be mapped as classification problems. We don’t use a lot
of classification model (even if tree) In general, we try to compute probabilities.
Classification problem: in general, the target takes on discrete values that are NOT
ordered. The most common is to have a binary classification where the target is
either 0 or 1.
3 different solutions to classification:
o Classifier model: the model predicts the discrete value (class) for each
example. There is no information on the likelihood that prediction is true.
Mostly never used.
o Ranking (binary case): the model predicts a score allowing to sort all
examples according to the likelihood to be in one class - from most likely to
lowest likely.
Used when the cost/benefit is unknown/difficult to calculate or when the
business context determines “how far down the list”.
o Probability estimation: the model predicts, for each class, a score between 0
and 1 that is meant to be the probability of being in that class. For mutually
exclusive classes, the predicted probabilities should add up to 1 (être égal à 1).
When the cost/benefit is known relatively precisely (value-based
framework) or when you want to choose among several decisions/models.
You can always rank/classify if you have probabilities.
8
● If the label is a numerical value, we use regressions. But it is quite rare that we use
that because it is difficult to do a good regression (lot of variances). For example,
nobody succeeded at predicting the stock market.
! With decision tress algorithms, we must be careful when approximating probabilities like
this: small size of sample in leaves, local sample (does not “look” around). In general, the
probability estimation is quite poor.
Key – part 4: a data-driven model can either be used to predict or to understand (explanatory
modelling can be quite complex)
Ex: the model tells you that the truck has a high proba of breakdown tomorrow you can
thus use another truck (fully automated model).
Predictive model: data mining example
Terminology
Target = dependent variable = what we need to predict.
Attributes = independent variables = what described the target.
Data science: there are always thousands of variables (>< statistics: few variables).
You can observe a lot of patterns but not all are models (examples slide 47).
9
Dimensionality of a dataset = sum of the number of numeric features and the number of
possible values (modalities) of categorical features minus 1 for each one.
Feature types:
● Numeric: anything that has some order (numbers, dates, dimension of 1).
● Categorical: stuff that does not have an order (binary, text, dimension = number of
possible values minus 1).
Model induction, learning, inductive learning, model learning: a process by which a
pattern/model is extracted from factual data.
What is a model? A simplified representation of reality created for a specific purpose
(classification/regression question).
Patterns observable in data are not all good. For example, “age is inversely proportional to
alphabetic order” seems to not be a good pattern.
10
2.2 Generalization & performance assessment
Generalization
From tables to vector spaces:
Table (denormalised): each line is a unique client, and each column is a characteristic of each
client. We expect that columns are well filled for each client.
Vector space: each client is a vector, and it is possible to compute the Euclidian distance
between each pair of clients.
For supervised learning, we have labels for each client (ex: payment ok/defaulted). These
labels are acquired from data and are the ones we want to predict (for unlabelled ones).
Vector space generalization: extend the pattern Where do I stop? What is the optimum
partition of the vector space to maximize the likelihood that any prediction will be correct? I
cannot find it by computing them all thus I perform research in the space of possible
solutions; each machine learning algorithm proposes a search strategy, and a criterion to
evaluate any searched solution.
Generalization is the extension of the decision-making surface beyond the observed points.
It is the assignment of a predicted value where no observations have (yet) been made. All
algorithms, in one way or another, partition the vector space into several segments whose
value is predicted on the basis of past observations. Eventually, this prediction is
accompanied by a confidence estimator, often represented by the probability of occurrence
of the predicted value.
To generalize, you have to try to not include negative points in generalized positive areas.
Sometimes false positives are necessarily included because they are in a space where the
density is already high.
All algorithms put thresholds.
Generalization:
11
Decision trees: searche to cut the vector space (on x- and y-axis) in hyper boxes, minimising
the entropy of each resulting segment (stop cutting when no errors anymore) VS logistic
regression: searches for a single hyper plan that maximises the likelihood of the probability
distribution (using just a straight line, there are unavoidable errors).
Neural network: could potentially fits any decision space, if trained enough. The prof said
that it is a bad generalization.
When different algorithms are trained on the same data, parts of the vector space will be
linked to the same prediction, but many others will not.
12
Which algorithm is right? It depends on the bias of the algorithm (orthogonal, linear,
nonlinear), and its search strategy: are they adapted to the true pattern to be found? There
are hopefully empirical techniques to evaluate the quality of a model generalisation.
The choice of algorithms has no correlation with the success of the project (no right
algorithm; depends on bias or search strategy). Defining the target, etc. is more important.
The most important thing in a project is data!
Empirical inference: start from an example (sample) and infer to generalize (and have a
model that predicts on a larger universe). Everything is classifiable.
We never expect 100% accurate results. The models predict on a much larger universe, and
hence will make errors on fresh cases.
Factors influencing the performance: algorithms, expertise.
The importance of variables:
In a 2D space (pca-two, pca-three pca = principal component analysis), some points are not
separable. With the 3rd dimension added (pca-one), the points are clearly separable;
perception of distance between points changes completely when you add dimensions. The
choice of dimensions (variables) has a dramatic impact on any solution!!
Some machine learning algorithms include variables selection, some others (like deep
learning) build a new vector space to improve the ability to separate labels. The construction
of the right vector space is one of the most important tasks of a data scientist.
Measuring the performance of a model

In general, it is very important to look at the shape because they can inform me about the
model.
13
Lift curve:
Example: we want to understand a model to know if the label is high/medium/low not

binary. But in general, we transform the model to have a binary one low vs others,
medium vs others, high vs others. We don’t have immediate classification; we only have
probabilities.
You could build a decision tree, but it will be less good because then you don’t have the
probabilities distribution on each class you want to predict important for the business to
have that so that it can focus on a certain class (need proba).
Some profiles are easier to predict: buyers of manga are a very specific profile but buyer of
bestsellers are not predictable as everyone read bestsellers.
What kind of algorithm can we use to focus on important features? We try to remove
features that have nothing to say about it (remove sparsity) and we keep features with high
density of positive vs. negative.
Graph: blue curve = lift curve If I build the model and sort my table according to probas.
If I take all population (100%), I have a proba of 8.41% (natural density) all are positive on
the left… I fix the problem... If I take +/- 10% (index: 1000), I have a proba of 84.07 = ten
times the natural density.
At 12%, if I add a guy interested in manga, I know that the proba will be lower than the
average because I already identified all the good person (following the lift curve) thus,
not relevant to show manga to people on the right. I put 1 for people on the left side and 0
for people on the right side.
Everybody on the left of the natural density point has a higher proba than the a priori
proba and everybody on the right has a lower proba.
Cumulative response curve:
After sorting the clients using the scores of the model, and selecting the top best 10%, we
capture 70% of the target.
14
We don’t want to select behind the 10% because points are less good and random. By
selecting only 10% of the population, I can reach 70% of my targets.
Beginning of the curve: no false positive but there are more after. If we have an Irma model
for a long selection, it should have something wrong.
Irma model: perfect information thus no false positive or false negative (she doesn’t make
errors). Perfect information (ideal line) is never a curve but a straight line that gives the “a
priori proba” (what is known).
Example - Excel file: there are a lot of false positive because low proba. There are also false
negative (ex: a grandma buys a manga for her son even if she doesn’t have the profile to buy
manga). The signal is stronger when the profile is more specific.
When negative points are surrounded by positive ones = false positives They cannot be
avoided (for good generalization).
If the cumulative curve float around the random line, it’s that we don’t have the right data.
Ex: we will not find any pattern if look for a profile that buy bestsellers.
! Formulation problem You need to segment the problem to find something in your data.
Ex: who buys Renault? No Who buys each model of Renault (Renault espace, etc.) =
segmentation.
Law of parcimony = Occam’s razor:
“The principle states that among competing hypotheses that predict equally well, the one
with the fewest assumptions should be selected. Other, more complicated solutions may
ultimately prove to provide better predictions, but—in the absence of differences in
predictive ability—the fewer assumptions that are made, the better.”
If you have a model with less assumptions as good as another model with more
assumption, it means that is not necessary to add complexity when it has no value. You
should choose the simplest model.
Confusion matrix for binary class (only for binary targets):
15
Metrics:
● False positive: we predict it will be positive, but it is actually negative.
● False negative: we predict it will be negative, but it is actually positive.
● Recall (sensitivity): on all true +, % of + correctly predicted = probability of detection.
E.g., if you predict to have 1000 true positive but you only have 900 true positives
and have 100 false positives recall = 90%.
● Specificity >< Recall: on all true -, % - correctly predicted.
● Precision: on all + predictions, % of true +.
● Accuracy: all true pattern/n
● Base rate/a priori probability: % of true + = (Tp + Fn)/n
How to increase recall? You can do that, but the precision will decrease Choice to make.
How to increase precision? You can do that, but it will degrade the recall Choice to make.
If you build an Irma model, what would be the recall and the precision? Irma says he will buy
or not (crystal ball) Perfect information thus no false positive or false negative (she doesn’t
make errors) thus recall, precision and accuracy = 100%.
Example of confusion matrix:
How good is a model?

We can ask ourselves if a model allows to better predict the outcome than random guess.
For Cohen’s kappa, it is important to compare several models and prefer the one with the
highest Kappa (closest to Irma than randomness if better kappa).
16
Where Pr(a) is the relative agreement between the prediction and the true pattern and Pr(e)
is the probability of chance agreement, if we assume that predictions are independent of the
true pattern.
If the prediction is perfect, then kappa = 1
If the prediction agreement with true pattern is only due to chance, kappa = 0.
Kappa computes a measure of improvement of the model on random guess.
Reminder:
For unbalanced class modalities, a classifier might show excellent accuracy, while not
learning anything useful! For example, if we want to predict the minority class (yes) but that
we have only true negatives, the model will have a high accuracy, but we learn nothing
useful. We want to predict positive, and the model only says that all are negatives. Be careful
about accuracy.
Confusion matrix & model usage scope:
Metrics based on the confusion matrix give an evaluation of the models as a whole. When a
probability is available, it is quite rare that the whole model classification scope is used: we
might use more complex decision processes only in a subset of the scope of the model.
17
Scope usage = P(x) > 50%, else other processes are used. This means that a model is used in a
smaller decision space than the original vector space.
An updated confusion matrix could be computed to reflect the performance in the decision
space only! False positives and false negatives outside the usage scope are meaningless...
This can be applied also for other evaluation measures such an AUC, value-based framework,
etc.
ROC curve = Receiver Operating Characteristic curve:
We have a binary model M that returns a score for each instance. Instances are sorted
according to the score (descending), and we compute the distribution of the score within
each class: + and -. The question of interest here is to go from a score to a classification.
Left graph: where do we put the threshold here? 50 because of classification rule no error
with that model = Irma model (cannot find that in real life).
Right graph: if I use 50 as a threshold, there are a lot of false positives. If I move to right, less
false positives but more false negatives (increase sensitivity/recall but decrease specificity).
But at 55, I minimize the number of the two.
Recall and specificity linked to the fact that you move the threshold of the probability for
the decision.
! We want to have the probabilities and not the classification (otherwise business problem).
A Receiver Operating Characteristic or ROC curve is a graph that illustrates the performance
of a binary classifier system as its discrimination threshold is varied to produce the
classification. The curve is created by plotting the true positive rate (Sensitivity/Recall =
probability of detection) against the false positive rate (Fall-out = probability of false
detection = 1- Specificity) at various threshold settings.
18
AUC = Area Under the Curve:

The Area Under the Curve is a single measure in [0 ,1] measuring how good is the
classification compared to random guess.
The AUC can be computed on the ROC curve or on the Cumulative Response rate curve.
If AUC = 1 as good as Irma = as good as perfect information
If AUC = 0 as good as random
It minimizes the difficulty to choose between recall and precision by maximizing the triangle
above the curve.
AUC = B/A
Characteristics: scale invariant (only ranking is important), classification threshold invariant.
19
Confidence intervals & variance estimation:

Confidence intervals for AUC, Lift or any other measure can be obtained through multiple
evaluations of the measure on samples created using bootstrapping. Bootstrapping is a
statistical procedure that resamples a single dataset to create many simulated samples. This
process allows to calculate standard errors, compute confidence intervals, etc. Bootstrapping
resamples the original dataset with replacement many times, to create simulated datasets of
the same size of the original. Replacement: you make a copy of the original bag randomly
(gives different distributions) but without changing the first bag. But on average, you have
what the model predicts.
Model will predict you what you will have on average but there is a certain spread;
sometimes it will be better, sometimes less good.
Variance is an important measure of the quality of a model in general. Variance tells us how
much chance influences the performance of any prediction. The best model is the one with
a high accuracy and a low variance. The chart here shows that the confidence interval of the
gain curve is quite large, and hence that the performance of the model might vary from one
application to another, when used to classify unseen cases.
Data overfitting: the nightmare of any model

Overfitting
We have 3 vector space dimensional model. The first one is very simple and doesn’t present
a lot of errors. The second one is the polynomial model, it generalizes better than the first
one. The third model considers the extreme cases. We don’t know which model is the best,
we can discuss about that but there is no real solution.
20
With most algorithms, it is always possible to build a “lookup table” that just link any past
observation to its class. All algorithms search for solution and most of them are based on
area/decision spaces. More you have dimensions, more your model is precise.
Overfitting occurs when the model classifies seen cases much better than unseen cases:
which means that the generalisation of the model is poor!
For the first and the second model, the 2 classes are not separable in that 2-dimensional
space: some variables are missing to completely account of the phenomena. The third model
is the most accurate on the training data here. This is a model that tries to avoid a lot of
errors. By doing this, the model generates other errors Bad generalization = overfitting; we
lose precision here by over searching and because of a too low representation bias (non
linearity).
When the model is complete, we have more complexity.
What are the main differences between these two classification problems?
Fort the first model, the 2 classes are separable, despite the pattern is quite complex.
Training a decision tree would require building a very large and deep tree. For the second
one, the 4 classes are not separable. Hence, over search would lead to a complex and deep
decision tree. But it is difficult to know if the depth of the decision tree would reveal
overfitting.
In high dimensional spaces, it might be difficult to say if the complexity of a model produces
accuracy or overfitting. Special techniques must be used to spot those problems: holdout
data and cross-validation.
21
A model is trained on a random sample of the data, and the unseen data is used for
testing/validation.
This is the way how we define when we must stop in machine learning. If we have enough
data, the curve is perfect. If we don’t have enough data to do the curve, we need another
model. “Sweet spot” = ideal point.
Cross validation consists in dividing the training data into n folds. A model is built using n-1
fold and tested against the nth one. It is used to measure the average performance of a
modelling process as well as its variance. Systematic drop of performance on test sets means
that the model overfits data and must be re-worked.
I’m training on this, and I test on that So, I learn 5 times and we have a view on the
variability of the curve that we form. For each test set, we define what is the ratio between 1
and 0. If I have a lot of 0, maybe I need to reconsider another test.
The best model we have is the fold 4 (higher ratio AUC).

Accuracy is a bad measure take another value.
The Cumulative Gain charts shows the variance of models based on n folds x-validation.
X-validation is generally used to tune the algorithms free parameters, in order to build the
final one. For example, X-validation can be used to make a choice among available variables
to be included in a regression, or to decide about the maximum depth of a decision tree.
22
We can use bootstrap for the evaluation of the population.

Bootstrap - AUC of a single model is computed on many different data samples. Here we
show the performance of a model on the 5% worst and best bootstrapped samples. It shows
the AUC confidence interval.
Cross-Validation - The model’s AUC is computed on its training data compared to an averaged
cross-validated test data. The difference of 6% might be due to sample effect and not to
overfitting, since both curves are located in the 90% confidence interval.
If the AUC computed on the training data is largely higher than the one computed on the X-
validation test data means that the model still overfits.
23
2.3 Supervised techniques
Machine learning
● Supervised: We are given input samples (X) and output samples (y) of a function y =
f(X). We would like to “learn” f, and evaluate it on new data. Types:
o Classification: y is discrete (class labels).
o Regression: y is continuous, e.g., linear regression.
Supervised: Is this image a cat, dog, car, house? How would this user score that
restaurant? Is this email spam?
Supervised algorithms: kNN, linear regression, decision trees, naïve Bayes, logistic
regression, support vector machines, random forests.
● Unsupervised: Given only samples X of the data, we compute a function f such that y
= f(X) is “simpler”.
o Clustering: y is discrete
o Y is continuous: Matrix factorization, Kalman filtering, unsupervised neural
networks.
Unsupervised: Cluster some hand-written digit data into 10 classes. What are the
top 20 topics in Twitter right now? Find and cluster distinct accents of people at
Berkeley.
Instance-based learning: K-NEAREST NEIGHBOURS

(Binary problem; we want to classify this point.)
kNN is a lazy learning algorithm: the “learning” does not occur until the test example is given
There is no learning, just memorization of the sample. It remembers all training examples.
Given a new example x, find its closest training example <xi, yi> and predict yi.
Decision boundaries: Voronoi Diagram

Given a set of points, a Voronoi diagram describes the areas that are nearest to any given
points. These areas can be viewed as zones of control. Decision boundaries are formed by a
subset of the Voronoï diagram of the training data (not linear). Each line segment is
equidistant between two points of opposite class. The more examples that are stored, the
more fragmented and complex the decision boundaries can become. With large number of
examples and possible noise in the labels, the decision boundary can become nasty! We end
up overfitting the data.
24
Find the k nearest neighbours and have them vote. It has a smoothing effect. This is
especially good when there is noise in the class labels. Larger k produces smoother boundary
effect and can reduce the impact of class label noise. To find k, we vary its value. We divide
or cross-validate the k. K is fast to learn but not for classification.
I have a set of data, and I have enough data, I take a test set and I don’t touch it. Then, I
create a cross validation to see the variance. I learn on {1,2,3,4}/5 (circle divided in 5 and we
don’t touch to the 5th sample). I memorize this, I test the 1 with the 5, the 2 with the 5 and so
on. After this, I repeat the process, I have {2,3,4,5}/1. Etc.
What happen when K = N? Nonsense anymore because we don’t look at the closest
neighbours but at all data in the sample. There is a risk of underfitting (not good on the
training data and also cannot be generalized to predict new data).
The simplest model is the best model. We don’t want to add complexity. We search for the
minimum neighbours that offers the minimum error rate.
A small value of K increases the noise (risk of overfitting) but a large value makes it
computationally expensive and may lead to underfitting.
How to choose k? Can we choose k to minimize the mistakes that we make on training
examples (training error)? As the size of k increases, the error on the training set increases,
25
but decreases on the test set. K can be chosen so as to maximise other measures: recall,
precision, Cohen’s Kappa, AUC, etc.
If I try to have a maximum recall, what is the number of neighbours? At the level of the red
star. This model is not generalised enough. I have a lack of information on the dataset, there
is some outsider.
Scaling!
The scaling is always important for every machine learning that you use. Feature scaling
impacts the “similarity”.
Distance weighted nearest neighbour

This method is simple and powerful but not always the best to measure distance, it depends
on the problem. It works well for classification, but it doesn’t give probabilities based on
vector space as we want.
It makes sense to weight the contribution of each example according to the distance to the
new query example. Weight varies inversely with the distance, such that examples closer to
the query points get higher weight. Instead of only k examples, we could allow all training
examples to contribute (Shepard’s method).
From K-NN to probas
There may be many ways to give an estimation of the probability of each class using k-NN.
The simplest approximation is to estimate the probability through the relative frequency of
votes from the k neighbours: k = 4 → 1 (4/4), 0.75 (3/4), 0.5 (2/4), 0.25 (1/4), 0 (0/4). This
26
method is a weak approximation because it depends on the number of neighbours only and
does not use the typology of the vector space.
K-NN metrics
● Euclidean Distance: simplest, fast to compute.
● Cosine Distance: good for documents, images, etc. Compute the cosines of the angle
between the two vectors.
● Manhattan Distance: coordinate-wise distance
● Edit Distance: for strings, especially genetic data. It gives the minimum number of
modifications/editions I need to do one string to equal the second one.
● Mahalanobis Distance: normalized by the sample covariance matrix – unaffected by
coordinate transformations.
You can use other distance metrics, but you have to be careful when choosing one metric
depending on what is the distance relevant to you.
K-NN: keep in mind

Model = set of training points + number of neighbours + distance
● Simple: no training needed.
● Non-linear decision boundaries.
● Accuracy generally improves with more data.
● Irrelevant or correlated features have high impact (weights will be biased): must be
eliminated.
● Highly influenced by noise on features and labels.
● Computational costs: memory and classification-time computation.
● Requires all features being numerical.
● Parameters: distance metric, k (number of neighbours), weighting of neighbours (e.g.
inverse distance), variables selection.
Scikit-learn: slide 20.
27
LINEAR REGRESSION
Model = vector Bêta
We want to find the best line (linear function y=f(X)) to explain the data.
Assumptions on data
We have a lot of data test them.
● The relationship between X and Y is linear.
● Y is distributed normally at each value of X.
● The variance of Y at every value of X is the same (homogeneity of variances =
“homoscedasticity”).
● The observations are independent (no multicollinearity).
● There are no outliers.
Linear correlation
Regression strives to find a line to fit the points.
● If the points are perfectly aligned Perfect relation between x and y (Irma).
● Weaker relation if for an x value, you have a large span of y values (more spread).
● No information if the relation between x and y = circle; for each value of x, you have
all the values of y possible thus no solution in a perfect circle.
● If relation = horizontal line, that means that x has nothing to say about y but we can
build a predictive model because y is a constant.
Correlation coefficient r = measure of regression.
28
R – squared:
^ ^
Let 𝑦 = 𝑋 β be a predicted value, and 𝑦 be the sample mean. A linear regression minimises
the R2 measure. The optimal solution is found analytically (no search→ quick).
Quality of fit: Residual Sum of Square

Minimizing R2 is equivalent of minimizing the residual sum of square statistic (R2 nominator).
RSS = 0 if the regression perfectly predicts the observed value and is high inversely.
RSS can be described as the fraction of the total variance not explained by the model.
RSS and linearity
We can always fit a linear model to any dataset, but how do we know if there is a real linear
relationship? You can test it and if you have a drop, it is a signal that maybe the regression is
not linear.
29
Or you can plot the residuals to check the linearity (visualize residuals).
Regression for binary classification

Example of Iris data set: slides 30 – 34.
30
Anything that is on the regression line has a 50/50 proba to be + or -. More you are far from
the line, more the proba to be + (above) or - (below) is high = gradient of probas
Regression: Keep in mind

● Solution is analytic if Residual Sum of Squares (RSS) is used as an objective function.
● Does not accept missing values (need to replace them).
● Compute a global decision surface and so is less in danger of overfitting (unless
feature is sparse).
● Overfit comes from the number of variables used (dimensionality of the vector
space), not from the possible over training.
● No features selection: need to add regularisation to reduce dimensionality.
● Objective function is not best for generalisation.
All these solutions could be generated. The best model is the line in the middle because
more chance to be right if more data. And it is better to have the same space at left and right
to not have a biased generalization.
Scikit-learn: slide 38
LOGISTIC REGRESSION
31
It applies only to binary class: 1/0, +/-. Instead of computing the regression of Y on X, we
would like the regression to immediately give P(1|x). So, we would like to formulate the
problem like this: compute the likelihood of
However, the probability function is not linear: going from 0.10 to 0.20 doubles the
probability but going from 0.80 to 0.90 barely increases it. Also, the odds of being + (P+/P-) is
an exponential function. Hence estimating the probability distribution with a linear
regression will not have a good fit.
However, the logit (ln) of the odds ratio is linear. The formulation becomes: compute the
likelihood of
For a continuous outcome variable Y, the value of Y is given for each value of X. For a binary
outcome variable Y (+, -), the proportion of + cases are given for each value of X.
However, the decision surface is still a linear hyper cube.
32
Logistic regression: keep in mind

● Solution is not analytic: finding the solution that maximises the likelihood is done
through optimization algorithms → CPU intensive.
● Does not accept missing values (need to replace them).
feature is sparse).
space), not from the possible over training.
● Objective function is better for generalisation than Linear Regression due to the fact
that the correct prediction likelihood is maximized. However, if the vector space is
sparse, problems of choice of best solution still occurs. No features selection: need to
add regularisation to reduce dimensionality.
Good because: global thus takes all distances into account + compute the likelihood of
proba that can be computed on training data. Downside: linear method.
SVM = SUPPORT VECTOR MACHINES

A support vector machine is a classifier that tries to maximise the margin between training
data and the classification boundary (the plane defined by Xβ = 0). Margin = distance
between 2 points on the graph. Objective: maximize this distance.
33
The idea is that maximizing the margin maximizes the chance that the model will correctly
classify new data.
We try to put the frontier at the place where we maximize the margin because we don’t
know what will happen with unseen cases (so it allows to increase generalization).
Sometimes we cannot avoid misclassification. We can just influence the way we want errors
to be considered (different parameters); penalizes training points from being on the wrong
side of the decision boundary balance the margin surface with misclassification severity. 2
loss functions:
● Zero-one: penalization is uniform. =1 if misclassified, else = 0
● Hinge: penalization is linearly proportional to the distance from the closest support
vector. The cost of misclassification increases if points are further than the function
line (decision areas). It wants to maximize distance between positive and negative but
if further in positive then the cost of misclassification increases.
SVM vs. Logistics
Regression solutions can be dramatically influenced by outliers or noise (especially in sparse
vector spaces). SVM reduces this influence by explicitly trying to maximise the margin, hence
preserving the generalisation ability of the solution.
High dimensional feature spaces might induce model instability. Regularization or better,
features selection algorithms are key to avoid model instability, and variance.
Why would regression methods change radically results?
Residual errors: distance between predicted points and real data want to optimize
(minimize errors) that.
Lack of info in vector space (2D) to differentiate a point from the others.
34
SVM: optimize margin even if it means accepting some errors (optimal solution depending
on the two parameters: margin and errors) best solution for generalization.
SVM: keep in mind

● Solution is not analytic (like logistic regression >< regular regression): finding the
solution that maximises the likelihood is done through optimization algorithms →
CPU intensive (costly if a lot of data).
● Does not accept missing values (need to replace them with average one or with
nearest neighbours for example).
feature space is very sparse).
space), not from the possible overtraining. It is less in danger of overfitting than
nearest neighbours.
● Objective function is designed to maximise generalisation and hence is better than
Linear or Logistic Regression.
● No features selection: regularisation can be used to reduce dimensionality.
● The output of SVM is a score and not a probability. Probability can be estimated by
mapping a Logistic Regression on the SVM scores (Platt, 1999). Results are reliable
since SVM is a global method.
Scikit-learn: slide 53 don’t accept default parameters, understand what are their
influence!! If you don’t know what the influence is, just test it (by trying different values).
In reality, we always want to have a view on recall and precision. Thus, if you have a model
with a multiple class problem, you create 3 binary models and then decided based on the
proba (and not just build a SVM on it).
NON-LINEAR REGRESSIONS
It is possible to use linear algorithms to fit non-linear patterns ones by using the « kernel
function trick ». The original input space is transformed (by applying a function like to the
power of 2) into another vector space (transformed space), that explicitly contains
additional, non-linear variables. Although the classifier is a linear solution in the
transformed feature space, it will be nonlinear in the original input space. A kernel can be
applied to the original input space to create a new vector space, or it can be applied ‘on the
fly’ when training or applying a model (SVM in general comes with some of those kernels).
The more polynomials I add, the easier it will be to find a linear solution without errors. But
if I go too far, I have too many solutions (too much sparsity) and it will be impossible to find
the best solution Thus, you can add features but not too many.
35
The new variables can be the combinations of original ones (x.y), exponential ones (x2) or any
polynomial function based on the original variables. Log is usually used on variables
representing money to move from a non-normal to a normal distribution. It has the effect or
creating a non-linear projection in the original input space.
Weight of evidence recoding
Recoding depends on the target you want to predict. You have to be sure that the pattern
will be linear after recoding.
This function is used to recode categorical variables into continuous variables that reflect
the strength of the relationship between modalities and the target (« Good »). It is linearly
correlated to the target.
WoE - Univariate linearisation of input space
Continuous variables: must first be transformed into a number of bins.
Categorical variables: nothing to do.
WOE is computed for each modality and can be normalized.
A new variable with the same number of modalities is created. Modalities are the
{normalised} WoE and no more the original values. A new vector space is built, where each
variable is linearly correlated with the target – and hence where linear separation is easier to
perform.
Regularisation
We know we can overfit because we have too many features. Regularisation is used to
reduce the possible overfit of regression models when there are many variables.
We can add variables using the kernel trick (x2, x3, etc.). Then, the dataset has no more one
single x variables as input but all the kernels (x2, x3, etc.). We can use gradually more
36
variables by starting with x alone and add each time a new variable, increasing the
polynomial degree of the expression. By increasing the number of variables, we can see that
the regression better fits the data, until reaching a point where there is clearly overfit.
Here we dot it for regression, but it is the same for classification.
Regularisation adds a term to the regression function that penalises the magnitude of the
value of the regression coefficients β. Regularisation term = alpha*R(β).
If alpha = 0 we will be back to a simple regression.

If alpha is really high we will have a big penalty. Why do you want this? If you have 2
features highly correlated, the weight of these features can be any value because together
their impact will be 0 thanks to alpha.
Two main regularisation functions are used:
Scikit-learn: regularisation parameter: L1 (Lasso), L2 (ridge) or 0 (no regularisation).

Ridge regularisation:
As the value of alpha increases, the model complexity reduces. Significantly high values can
cause underfitting as well. Thus, alpha should be chosen wisely (alpha is part of the model,
you need to know what it is to reproduce the model). In general, alpha is found by a step
wise search using cross validation (CPU intensive!).
You can remove features with very low weights and then redo the process (loop) until you
begin to underfit (test to know when to stop). Regularisation is very important to give a
sense to the wins.
37
Lasso regularisation:
38
Last graph = intercept = β0 (average value).

Too high value of alpha might remove the influence of all variables in the regression, which
then returns the constant for all x (underfitting).
We have very small value, but we never reach a point where β = 0. Otherwise, it would mean
that this feature is removed from the model.
A careful evaluation of the best 𝛼 is mandatory. RSS of the model is by no means a good way
of selecting it because it is not a good generalization for the model.
With alpha given, the equation of Lasso will be minimized.
As alpha is a parameter, we will do a for loop (for I = 0.001 to 0.01 set alpha = 0.01) on the
regression (RSS+alpha*Sum/Bêta). Test AUC and then set alpha + 0.005 and then redo the
loop, etc. And then, we can decide what is the best.
Only cross-validation, or if much data is available, a single large hold-out validation set can be
used. What are the measures that we have seen that are a good measure of a quality of a
model? RSS (compute it on the test data) only increases with alpha, say nothing about
quality. AUC (ROC curve) is a good measure (with recall, etc.).
A regression is optimizing the RSS.
Example:
39
We have a training set of 20 000 records. Train the data and choose the model. How to
evaluate its variance on top of cross-validation? Bootstrapping on test set: duplicate
samples. AUC will vary a lot for unstable models >< robust models.
What are the parameters to reduce overfitting for a regression model? Pruning variables.
If you have a high variance model, most of the time, it means that you don’t have enough
targets (enough positive data).
Choose the best model according to the measures and the feeling you have on which model
is the best with variables.
Lasso vs. ridge:
In most of the cases, we prefer Ridge even when we have not so many features because
there are always some that are correlated. Why? Because if there is correlation, Bêta is
higher and features will compensate each other’s (“if one to the sky, the other to ground”).
We need regularisation to look at Bêta. Use Ridge for the final model because it is not
eliminating correlation, but it checks that the model is right.
Sometimes, Lasso is really good (but pruning a lot). Lasso pick only one feature (and the rest
is put to 0).
So, in general, we use both models.
Ridge Lasso
It includes all (or none) features in the Along with shrinking coefficients, Lasso
model. Thus, the major advantage of ridge performs feature selection as well by
regression is coefficient shrinkage and moving some coefficients to zero, which is
reducing model complexity. equivalent to exclude features from the
model.
It is majorly used to prevent overfitting. Since it provides sparse solutions, it is
Since it includes all the features, it is not generally a good choice when there are
very useful in case of exorbitantly high many features (thousands or more).
#features, say in thousands, as it will pose
computational challenges.
It generally works well even in presence of It arbitrarily selects any one feature among
highly correlated features as it will include the highly correlated ones and reduces the
all of them in the model, but the coefficients of the rest to zero. Also, the
coefficients will be distributed among them chosen variables change randomly with
depending on the correlation. changes in model parameters.
Elastic net regularisation

We have to find 2 alphas for the regularisation term instead of one.
It is a compromise between Ridge and Lasso regularisation. Finding the optimal solution now
requires finding 𝛼𝑅, 𝑎𝑛𝑑 𝛼𝐿. This method has been built to overcome some limitations of
the LASSO regularization:
● The fact that it poorly performs when the ratio (number of cases / number of
variables) is very small.
● The fact that it tends to select one single variable from a group of correlated
variables.
40
DECISION TREES
= an iterative & divisive (split node in 2 other nodes) algorithm to create piecewise
non-linear classifier over multi-class problem.
Search space: all possible sequences of all possible tests. The size of the search space is
exponential in number of attributes: too big to search exhaustively, exhaustive search
probably would overfit data (too many models).
A node is pure: if high/medium/low, you have a node only saying high or low or medium.
Making a decision on 10 cases is probably not reliable. At the beginning, decisions are well
informed but as we go deep and deep, we start having problems.
Top-down induction of decision trees: recursive partitioning Find “best” attribute test to
install at root, split data on root test, find “best” attribute tests to install at each new node,
split data on new tests. Repeat until: all nodes are pure, there are no more attributes to test,
some stopping criteria are met.
Greedy Search: once a node is expanded, no return possible. It does not guarantee that the
optimal tree is produced (in fact often it will not be the case…).
Iterative search splitting effect: after each split, the sample size for making a decision is
divided, quickly reaching small numbers. Hence, the statistical reliability of the decisions
decreases in the same way!
The algorithm is local in the sense that it only « looks » at data points in the current node,
and hence in the current portion of vector space to make any decision.
Decision trees can take more time than a regression.
Growing a DT – How to split records: how to specify the attribute test condition?
How to specify a test condition? It depends on attribute types (nominal, ordinal (integer),
continuous) & on number of ways to split (2-way split, multi-way split).
● Splitting nominal attributes:
o Multi-way split = use as many partitions as distinct values.
41
o Binary split = divides values into two subsets (requires to compute the optimal
partitioning CPU intensive but it has the advantage to reduce the data
splitting effect).
● Splitting continuous attributes

o Discretization: to form an ordinal categorical attribute (with bins of age for
example). It is static (discretize once at the beginning) or dynamic (ranges can
be found by equal interval bucketing, equal frequency bucketing (percentiles),
or clustering.
o Binary decision: (A < v) or (A >= v). It considers all possible split values and
finds the best one (can be computationally intensive).
Growing a DT – How to split records: how to determine the best split?

● Impurity measure: entropy (state of disorder). Entropy is something that you
compute to judge about the amount of info you should have.
42
Entropy splitting rule:
The information gain measures the reduction in entropy achieved because of the
split. Choice: choose the split that achieves highest reduction i.e., that maximizes the
information GAIN.
Problem with information gain: chance to have purity just by luck. It favours
attributes with many values. Extreme cases: social security numbers, patients’ ID’s,
date. If you are alone in your node: entropy = 0 (purity) bad idea to use IDs for
example.
Corrected splitting rule: poxy to copy with that problem.
● GINI splitting rule:
43
When a node p is split into k partitions (child nodes), the quality of split is computed
as
Choice: choose the split that achieves lowest resulting Gini i.e., that minimizes the
Gini split.
● Splitting on classification error:
Probability to flow to a certain node = frequency on that node (ex: 70 cases/100cases

(100 = number at the previous node)). Error rate for a node in a tree: if I have 2+
against 25- and I label the node “–“, the error rate = 2/25.
44
To choose the best splitting rule, we can make a decision based on weighted average of a
criteria which is entropy, Gini or error rate. We have to choose among parameters: need to
understand them.
There can be differences among decision trees built with the different rules. However,
despite that the DT could differ, their classification quality should not show major differences
(AUC, Kappa, ROC). Only tests performed on unseen cases could designate the best criterion.
Growing a DT – determine when to stop splitting?

When all the record belongs to the same class, we can stop splitting When there are no
more attributes to test. Early termination.
We can compute the complexity of a DT with the number of nodes (and not of features
anymore). With a decision tree, we separate the features to have pure nodes. But actually,
we destroy the possibility of generalization overfitting.
How to avoid overfitting in DT?

Pruning is a mechanism that automatically reduces the size & complexity of a DT in the hope
of producing a tree with the best generalization capacity, given the training data.
Two strategies:
● Pre-pruning: stops growing the tree before it reaches perfect classification on the
training data. Pre-pruning = early stopping (stop the growing process prematurely).
Typical stopping conditions for node expansion: stop if all instances belong to the
same class - stop if there is no attribute left to split the cases (they were all used
45
before) - stop if the number of instances is less than some user-specified threshold -
stop if maximum depth is achieved (parameter). It stops based on some statistical
significance test:
o χ2 test – Stop if no attribute splits the data in a way that the class distribution
is significantly improved than without it.
o Gini or Information Gain – Stop if expanding the current node does not
improve impurity measures above a certain threshold.
o Generalization Error – Stop if expected error does not decrease above a
certain threshold (see later for estimation measures).
Example: XOR/parity problem (really rare in practice) no individual exhibits any
significant association to the class, structure is only visible in fully expanded tree. If
you have strong interactions between features (ex: woman and pregnancy) even it is
not really a XOR, it will be really difficult for the decision tree to choose for which one
to go.
● Post-Pruning: fully grows the tree (possibly overfitting the data), and then post-prune
the tree from bottom to top.
Steps:
o Grow decision tree to its entirety: fully-grown tree shows all selected
attributes interactions. However, some subtrees might be due to sample
chance.
o Prune the nodes of the decision tree in a bottom-up fashion.
o If, after pruning, the error reduces more than a defined threshold, replace
sub-tree by a terminal node.
Sub-tree replacement bottom-up: consider replacing a tree only after
considering all its subtrees.
o The class label of terminal node is given by the majority class.
In all cases, Occam’s Razor principle is applied: given two models of similar generalization
errors, the simpler model is preferred over the more complex one.
Estimating the error rate (examples slides 102-104): you should certainly do that if you have
enough data.
46
Two strategies can be used to estimate errors in a tree: error is computed on the training
data (resubstitution error), or error is computed on a holdout validation dataset in order to
better measure the true error rate on unseen cases (“reduced error pruning” = REP or
“best-pruned subtree”).
We don’t want to add complexity without doing something relevant for the problem. You
have a lot of parameters to handle to decide how the pruning will occur.
Minimal cost-complexity pruning:
R(t) = error rate at that node if you prune/ Rα(t) = cost complexity measure/ Rα(Tt): error
rate of the subtree with weighted averages by the number cases in each subtree.
Last step of the tree above: 10 – 15 and then 8-5 and 2-10 you are not learning anything.
Alpha is too high but it is a parameter so let’s change it alpha = 0.06 doesn’t seem to be
the best neither test to know if we need to prune.
To find the best alpha: use a test set to compute R(t) and Rα(Tt) and search for the best alpha
or use cross-validation to compute R(t) and Rα(Tt) as the average over the validation sets.
Decision trees: to keep in mind

● Piecewise non-linear multi-class.
● Can cope with categorical & numerical features don’t need to do data
normalization or transform numerical data, you can just use data as they are.
● Compute a decision surface by orthogonal pieces (hyper-cubes) and so is more in
danger of overfitting. Decision trees are supposed to use all the features.
● Overfit comes mainly from over training and hence overly complex decision surfaces,
and many variables selected.
● Includes embedded features selection and missing value treatment (different
techniques to cope with missing values depending on the algorithm but none of them
remove the record (line) where a value is missing thus algorithms are robust for
missing values).
● DT shows unstable topology in their decision surfaces when many variables are
available (change with small change in sample). There is no notion of distance thus a
new sample could change the counting not really searching for the best
47
generalization (optimal model). That is why it is quite unstable (because it doesn’t use
distances).
● They are by nature more sensible to small changes to sampling because instead of
looking at the geometric topology of the feature space, they just base their decisions
on counting class distributions.
● When simple, they can be understood by human beings.
● No reliable class probability: approximated by class frequency in leaf node. (Use
Laplace correction = nc+1 / (N + m) where nc = number of cases of class c in terminal
node, N = number of cases in terminal node, m= number possible classes (= 2 for
binary class)). Probabilities are done based on frequency of node (number of
positives vs negatives in the node not reliable to say 20% to be negative if 2
negatives vs. 8 positives because we come from a dataset with thousands of data).
● Are very sensitive to unbalanced classes: if the data set contains too few + cases, a
decision tree could possibly have all leaves labelled with the majority class. Ex:
500.000 negatives, 500 positives → A priori of 0.1%.
● How to solve this problem? Data undersampling: reduce the number of negative
cases and data oversampling: make n copies of the positive cases →500 * 100 =
50.000 → A priori of 10%. → I see a lot of data scientists going to 50% positive.
● Problem? Computed probabilities are wrong and must be corrected... Probabilities
are different from the reality because of data over/undersampling thus you cannot
use them. You have to correct them. How? Keep another dataset apart (but generally
difficult because if you do oversampling it is because you don’t have enough data).
DT are good to get insights about the data but bad to use probabilities and manage things
based on it (thresholds, precision, etc.).
ENSEMBLE METHODS
They are like crowdsourced machine learning algorithms: take a collection of simple or weak
learners and combine their results to make a single, better learner. They are more suited to
algorithms that produce high variance or prediction instability like DT (less efficient for
logistic regression).
Types of ensemble methods:
● Bagging: train learners in parallel on different samples of the data, then combine by
voting (discrete output) or by averaging (continuous output).
● Stacking: combine model outputs using a second-stage learner like linear regression.
It means that you create many DT and then take the data and make each DT vote
make features and based on these, you build a second stage learner (like linear or
logistic regression).
● Boosting: train learners on the filtered output of other learners (leaning on errors
from other models). It determines what are the misclassified instances of the DT and
give a second chance to these. You force algorithms to focus more and more on
errors.
Random forests: forest of decision trees that are created adding randomness to each DT
build. Grow K trees on datasets sampled from the original dataset with replacement
(bootstrap samples), p = number of features:
● Draw K bootstrap samples of size N.
48
● Grow each DT, by selecting a random set of m out of p features at each node and
choose the best feature to split on. Typically, m is a parameter with default value m =
√𝑝.
● Aggregate the predictions of the trees (most popular vote) to produce the final class.
Principle of RF: take votes from different learners = look at intersections of many different
decision surfaces in our vector space.
How Random Forests ensure diversity among the individual trees? Draw K bootstrap samples
of size N: each tree is trained on different data. Grow a DT, by selecting a random set of m
out of p features at each node, and choose the best feature to split on. Corresponding nodes
in different trees (usually) can’t use the same feature set to split.
RF are probably the most popular classifier for dense data (<= a few thousand features). They
are easy to implement (just train a lot of trees), and parallel computing is easily done.
Random forests: keep in mind

● Piecewise non-linear multi-class.
● Can cope with categorical & numerical features.
● Compute a decision surface by orthogonal pieces (hyper-cubes) and so is more in
danger of overfitting, however reduced by the fact that each tree could overfit but
on different samples of the data.
● Overfit comes mainly from over training and hence overly complex decision surfaces.
● Includes embedded features selection.
● Will typically be more stable on decision surfaces topology and less sensible to small
changes to sampling than DT because many different trees are used (smoothing
factor).
● Will, in general, reduce variance of the model due to the voting process (based on DT
build on different data). Variance reduction is the main advantage of Random Forests
over DT.
● No understanding possible (of how it is built), hence difficult to make a business
validation.
● Probability of classes can be approximated by class distribution in terminal nodes of
contributing DT (best case) or by the percentage of votes for a given class (worst
case). Make sure you understand how probabilities are computed (could be the
number of votes/total votes, the vote could be built on the probability or other
things).
49
2.4 Dimensionality reduction & parameter estimation
Dimensionality reduction
= Selecting some subset of a learning algorithm’s input variables upon which it should focus
attention, while ignoring the rest. It is deeply related to separating relevant from irrelevant
features, and identifying spurious relationships between features and target, due to sample
effects. (Humans/animals do that constantly)!
Why dimensionality reduction?

● Noise removal – Some variables might introduce overfitting just because some
artifact relationship with the target is present in the sample but not in the true
distribution of the population. We remove features that have nothing to say about
the prediction.
● Business understanding – Knowing which features are most related to the event to
predict is key for business understanding, insights, and validation (remember the
Income variable for bank loan!).
● The intrinsic dimension may be small – For example, the number of genes responsible
for a certain type of disease may be small, and irrelevant and redundant features can
“confuse“ the algorithms!
● Curse of Dimensionality – Most machine learning techniques may not be effective for
high-dimensional data (the vector space becomes sparse), and hence model accuracy
and efficiency can degrade rapidly as the dimension increases.
1. Assumption: random uniform distribution. If you take a threshold in the middle,
you will capture 50% of the points.
2. Same with an additional dimension. When you add dimensions, the % of points
you capture decreases.
Look at the % of data captured based on the number of dimensions (so much
space; even with regression methods, it is possible to overfit) CURSE OF
DIMENSIONALITY —> reduce the number of dimensions to reduce the sparsity of the
space.
50
The required size of training sample (to achieve the same accuracy) grows exponentially with
the number of variables! In practice, the number of training examples is fixed (the classifier’s
performance usually degrades for a large number of features)! The loss of accuracy is mainly
due to the fact that most of the “events” to predict just depends on a few variables. Most of
the other variables just add irrelevant or correlated and redundant information. The real
problem is to identify the subset of variables for which the relationship with the target is due
to the true underlying law, and not to artefacts of the sample, on marginal contributions to
the prediction.
In real life, there is a lot of correlation and variance increases with cross validation (new
solutions appear).
Applications of dimensionality reduction

Very common problems have to many features. Examples: customer relationship
management, text mining (every word becomes a feature), image recognition, etc.
Techniques for dimensionality reduction

● Features reduction: refers to the mapping of the original high-dimensional data onto
a new, constructed, lower-dimensional space. Given a set of data points of p
variables, compute a lower-dimensional representation:
you want p dimensions that are a combination of the d available features (p<<<d)
Create new features based on the original ones. Ex: reduce image/sound quality —>
Try to keep the most important signal.
The criterion for feature reduction can be different based on different problem
settings:
o Unsupervised setting: minimize the information loss (don’t use the target).
o Supervised setting: maximize the class discrimination (use the target).
● Feature selection: process that chooses an optimal subset of features according to an
objective function. You select several features and remove the irrelevant ones (not
creating new features).
Techniques: decision trees or decision forests.
● Features reduction VS selection:
Features reduction Features selection
All original features are used. Only a subset of the original features are
selected, without feature alteration.
The transformed features are (non-)linear
combinations of the original features.
Features reduction
Algorithms:
● Linear:
o Unsupervised: LSI, ICA and Principal Component Analysis (PCA: heavily used,
computed very quickly and has nice properties)
o Supervised: LDA, CCA, PLS
● Nonlinear:
o Nonlinear feature reduction using kernels
51
Principal Component Analysis (PCA):

It reduces the dimensionality of a dataset by computing a new set of variables, smaller than
the original size and it retains most of the sample's information. By information we mean the
variation present in the sample, measured by the correlations between the original variables.
The new variables, called principal components (or Eigenvectors), are uncorrelated, and are
ordered by the fraction of the total information they capture.
Eigenvectors: the first vector allows to capture the maximum variance and the second vector
perpendicular to the first one allows to capture most of the remaining variance... Linear
combinations of the original axes (thus difficult to say something individually on a single
feature because eigenvectors mix different features). No correlation problem anymore. Good
case? Need to test.
Non-linear PCA using kernels:
Traditional PCA applies linear transformation, but linear projections will not detect the
pattern thus not effective for nonlinear data Kernel trick; PCA is applied on transformed
vector space.
PCA: keep in mind

● Fast to compute
● Analytic solution (one solution)
● No search
● No correlation between the egeinvectors.
Feature selection
Goal: to find the optimal feature subset (or at least a “good one”).
Methods:
● Filters:
We need a measure for assessing the goodness of a feature subset (scoring function)
and a strategy to search the space of possible feature subsets. Finding the best
minimal feature set for an arbitrary target concept is NP-hard (= as hard as
NP-problem = Nondeterministic Polynomial time problem) good heuristics
(approaches to problem solving) are needed.
52
Select subsets of variables as a pre-processing step, independently of the used

classifier!! Filtering is independent of the algo that will be used on the output of the
filter. The selection function is related to the machine learning algorithm. Ex: linear
filtering followed by a non-linear method —> Is it the best to do?
Key charact: usually fast (CPU light), provide generic selection of features, not tuned
for a given learner. It is often used as a pre-processing step for other methods.
Disadvantage: feature set not optimized for used classifier (linear selection criteria for
nonlinear algorithms and vice-versa)!
Filtering uses variable ranking: given a set of features F, variable ranking is the process
of ordering the features by the value of some scoring function (which usually
measures feature-relevance).
The resulting set is a sorted list of features. The score S(fi) is computed from the
training data. By convention, a high score is indicative for a valuable (relevant)
feature. A simple method for feature selection using variable ranking is to select the k
highest ranked features.
Ranking criteria – linear correlation:
Pearson correlation criteria for numerical attributes:
r* measures for the goodness of linear fit of Xi and Y (can only detect linear
dependencies between variable and target). r = 0 could also be a circle.
53
2 questions about Pearson correlation:

o Can variables with small score be automatically discarded? NO! Even variables
with small score can improve class separability! Each variable has a small
linear correlation with the target, but together they bring separability.
Counterexample (graph on the right): no gain in separation ability by using
two variables instead of just one!
o Can two variables that are useless by themselves be useful together? YES!
Correlation between variables and target are not enough to assess relevance!
→ Correlation of target variable with n-tuples of variables could be
considered too! But it is difficult to compute all possible n-way interactions.
54
Another ranking criterion – Information theory:

Information theory applied for categorical attributes. This approach uses (empirical
estimates of) mutual information between features and the target:
It assumes that all variables are categorical! Hence, numerical variables must be
transformed into categories first (for example using discretization techniques).
Probabilities are based on frequency counts. If X and Y are independent, then mutual
information = 0 and X can say nothing about Y.
Mutual information (MI): quantifies the "amount of information" (in bits) obtained
about one random variable through observing the other random variable: knowing X,
what do we know from Y? MI is linked to the concept of entropy.
Intuitively, if entropy H(Y) is a measure of uncertainty about the target variable Y, then
H(Y|X) is a measure of what X does not say about Y. It is "the amount of uncertainty
remaining about Y after X is known”.
Example of MI: suppose X represents the roll of a fair 6-sided die (dé), and Y
represents whether the roll is even (0 if even, 1 if odd). Clearly, the value of Y tells us
something about the value of X and vice versa. That is, these variables share mutual
information.
P(die=6)=1/6 - P(even)=1/2 – P(odd)=1/2
P(die=6Ieven)=1/3 – P(die=6Iodd)=0
P(evenIdie=6)=1 – P(oddIdie=6)=0
Knowing about even/odd result reduces a lot the marginal entropy about the
outcome of rolling the die.
Another ranking criterion – SVC:
55
Single Variable Classifiers (not the same as Support Vector Classifier) Idea: select
variables according to their individual predictive power. Criterion: performance of a
classifier built with that single variable. The predictive power is usually measured in
terms of error rate (or criteria using false positive rate, false negative rate or AUC).
Also: combination of SVCs using ensemble methods (boosting, ...).
Example: logistic regression on a single feature —> very quick VS on 1000 features,
one by one —> still quick, get the prediction potential of each feature, sort features
according to their AUC, accuracy, ...
● Wrappers:
Filters are useful but not searching for interaction between features Test if you are
going too far.
Learner is seen as a black box. ML algorithm itself is used to evaluate the model
performance as the subset of features is modified (using a defined search strategy).
Cross-validation is used to evaluate the model performance. We need to define:
o How to search the space of all possible variable subsets?
o How to measure the predictive performance of the learner?
Wrapper characteristics:
o The problem of finding the optimal subset is NP-hard!
o A wide range of search strategies can be used. Two different classes of search
strategies: Forward selection (start with empty feature set and add features
at each step) - Backward elimination (start with full feature set and discard
features at each step).
o Generated models are measured using a validation set or by cross-validation
(to make sure that they do not overfit).
o By using the learner as a black box, wrappers are universal and simple!
o Price to pay: highly CPU intensive!
56
`
Compute if the model is better than the previous one or not.
Huge CPU consumption, better to filter and test (not damaging).

Ex: RFECV = Recursive Features Elimination with Cross-Validation.
Example of backward algo:
Scikit-learn: slides 43-44
57
● Embedded methods (decision trees, Lasso regularisation):

It is specific to a given learning algorithm and performs variable selection (implicitly)
in the process of training the model.
Wrapper is better (best method) but CPU intensive (costly). Filter can be used to make a first
selection when a lot of variables are available. Use wrapper to fine tune.
Typically start with a filter method (to get rid of what doesn’t seem that relevant), can then
fine tune with a wrapper. Decision forest: can win something by removing features but
doesn’t change much since it does it by itself.
58
3.1 Designing protocols for business decision making (Nothing in the book – the most
important lecture)
3.1.1 Model design & business question
Model design
We have different classes/segments separating objects.
It is the “model design” that allows to specify the type of questions that will be answered by
the model:
● To which class belongs this object? → Classification model
● If we take an action, who will react the way we want? → Response model
● This object will have this X behaviour in Y days/weeks/months? → Predictive model
Classification models: we want to compute the probability of an object to be part of the

target. Examples: NASA pictures of celestial bodies, interest for a specific book, income level.
Discussion: Nasa pictures of celestial bodies

Imagine that you have trillions of star images, you really need a classification model to class
them in the different types. How do you do that? How do we transform an image in
dataframe?
We create a vector space that represent each image. A point in the vector space represents
an image. >< Deep network (see big data course): network learn to recognize images based
on pixels similarities.
Here: target = 7 kind of stars 7 models: one against all the other ones, etc.
How can we label the images? We want to build a function (model) that say what is the label
for a star based on the vector space. We can ask humans to give the labels of some images
until we have a sample that is good enough to start training the model (only way of doing
it). After that, we use decision trees (really successful).
Methodology:
Assign a class value to white (unknown) cases:
Take the sample population with known targets and then build the model to compute a
probability to be part of that class (multiclass problems). For binary targets, prediction is
always for the minority class! The algorithm searches for a model that best assign each case
to a class (-1 since the last class is the default) or computes best estimation of probability to
be part of a given class. By applying the model on unknown cases, it gives a prediction of the
class (its modality).
Points of attention:
● The sample selection must be unbiased: the sample of image that you have should be
taken randomly because if you apply any process, you will have some bias.
59
● The sample features values must be a “good enough” sample of possible values to
ensure a good generalization in “out of sample” universe space (if not we might
expect to have generalization problems). When you apply a model on fresh data, you
always have to check if you have out of sample data because if it happened a lot, it
could mean that you have to re-train your model.
Discussion: improving a bank credit scoring
You are responsible to improve short term loans acceptance policy of your bank. The bank
has many records of past acceptance and refusal for such loans for the last 5 years. How
would you proceed?
● How do you define the target? Probability of default given some customers. Target =
somebody who defaulted (target = 1) VS somebody who took a loan but didn’t
defaulted (target = 0). ! That is only for people who took a loan, that is important
because otherwise everybody in the bank will be scored even if no loan. People that
took a loan VS no loan (+ vs -).
● Which data do you want to use? Demographic data (age, education, etc.) people in
the same areas (suburb level of aggregation of data) will be in the same vector space
(because you cannot differentiate them because of GDPR). You can use the data of
your clients (bank accounts, deposits, investments, etc. A lot of things are
available).
One model for unknown people (only data: questions we have, and info based on
where you live) and one for known people.
● Is there any bias that affects your sample? The bank has a policy which means that
not everybody is accepted for the loan and thus not everybody is entering in our
scope.
Among people, there are false positive (we said it will be okay, but it is not) and false
negative (we said these guys are not good, but they are). What are the possible
improvements you can do? In other words, which types of errors can be improved?
We cannot improve false negative because we don’t see these guys as they were left
out by the policy opportunity cost that create a bias. We can improve the false
positive but not the false negative (can improve only one type of errors). We can
increase the precision of the policy with a classification model (or even change the
policy). Advice: drop the policy (or at least lower the threshold to exclude only
obvious people) for a given time costly (defaults that will come in) but will give you
information on false negative to see how to improve the policy.
Model design is really important: all about how I will setup my data. We don’t really care
about the algorithm.
Discussion: new shops
E6Mode has 12 shops in the north of Belgium, all clients use a fidelity card allowing to
receive special offers by post. They want to open 4 new shops in the south. But where? How
to do help them?
● What is the target? Target = 1 for the places where I have a store
● Which data can be used? Suburb (INS9): features = income, etc. Wallonia is not part
of the scope for the training of the model because there are no stores in Wallonia.
● Which points should receive attention? Can we try to bias the model based on
successful stores? Ask the clients which stores work better.
That is the job of data scientist (and not to have a deep understanding of the algorithm).
60
Response models
Example: responder selection for a cross-selling campaign on Assistance Insurance.

The question here is “who should I target to get best responders”? On which clients should I
expense some money?
● Pink are customers that “naturally” subscribe to Assistance. There is always a
“natural” conversion rate, without doing any campaign.
● Red are customers that took an Assistance after a campaign: personalization &
message & package makes a difference on response rate.
● Yellow are non-clients of Assistance and were never addressed on this.
● Green are those that did not reacted to an Assistance campaign (= refused).
Methodology:
“Probability to respond”: Assign probability/class to respond to all clients:
Run a campaign on a sample and get the response from it (1/0). Then, build the model to
compute a probability to respond positively (minority). The algorithm searches a model that
best separate the responders (Target = 1) from non-responders (Target = 0). By applying the
model on existing clients with no Assistance, we can compute the best selection for
responders.
● Unbiased: the sample used to run the first campaign must be representative of the
whole client population (select FR only and apply model on BE!).
● Response models must be used with caution: the cycle selection-model-selection
will lead to a shrinkage of the explored universe. It creates a strong bias in favour of
certain “types” of clients.
We can reinforce the model with iteration of new responders (used to re-train the
model).
! Selecting only customers or prospects that have the “profile” to respond leads to
propose this product only to those customers. Recycling the positive answers into
updated models each time, will, after a number of cycles, lead to miss changes in
behaviour from other type of clients. How do you overcome this issue? Always add a
group of people that will be selected randomly in your campaign (in the first selection
step: add 50% random for example).
61
And what about social networks? Confirmation bias: Facebook uses response model
Send you random info at the beginning but depending on the response you give to
it, you will have a reinforcement loop starting (you will see what you want to see).
Predictive models
To predict something that is in front of us in term of timeline.
Methodology:
Classification is used to create predictive model by design:
We build a predictive model by looking at some event occurrences in the event period
(Target period). The algorithms searches for a model in the observation period that best
discriminate the event to be predicted (target =1). By applying the model on fresh
observation data, it gives a prediction for the future event period.
The time of the observation period depends on the business problem; 3 months for
telecom churn but years for bank loans for example.
The latency period = time-to-action period Ex: 2 weeks, I try to predict who will churn in
2 weeks.
● How can you measure the event you want to capture? This is a very important
question to ask yourself. Think about churn for a Telco? Model design!!
● Target should not be related to some business process bias (previous campaign
selections with strong biases “we targeted only some regions”) should be aware of
bias.
● Be careful with variables computed in the observation period that are artifacts of the
target itself! Think about “Income” for credit risk estimation. ! Where are the
artifacts? Otherwise, you find “perfect” models, but it is nonsense.
● Variables must be stable in time: the way variables are computed must not change
between periods Build & Apply. Think about a “country segments” changing.
● Latency is the time-to-action period, it can also be used to make sure to remove
artifacts.
● The Apply data must always be computed in the same way as the build data! No
difference between the observation and the fresh observation periods.
● Be careful with seasonal event behaviour: predict probability of a client to buy
gardening products. Example: look at products sales distribution: see that there is a
peak in early April until end of June but it is sold all the year along. Think: what will be
the effect on the model design?
Exercise:
You work for a food retailer. You want to assign a long-term value class to any new client 5
weeks after the client acquisition (High, Medium, Low). Define:
62
● Target: how to compute it and what will be the sample

● Observation period
● Latency period
● How to use the model
For rare target events, we might want to accumulate larger target sets by extending the
event period. Problem: if the pattern is highly dynamic, we lose precision for events in the
end of the Event period → “dynamic model design”.
Points of attention: if targets are not distributed evenly in each event period, the algorithm
could spot some variables that gives information on the time slice used (second slice event is
“Auto Saloon” and event is “Auto Insurance”). → In that case the model will encode an
artifact linked to the model design and not to the real predictive pattern.
● Predictive models can be computed on the whole customer population or on a
sample used to compute response models.
o Proba(Target=1) → Simple predictive model
o Proba(Target=1/targeted) → Response model
● In fact, all response models are predictive by nature (a campaign has been done
previously) but not all predictive models are response models.
Think: is a Churn model predictive? Is it a response Model? predictive model (but not
response).
In this setting, some variables can highly damage your model performance when you move
from one observation period to another one, due to spurious timely relationship (not
repeating themselves), or due to hidden indirect relationship between the variable and the
target. Ex: prediction of charcoal (charbon de bois) buying. Period 1 had a fantastic weather:
sun & 30°. People having a garden are over-represented in target. In Apply period the
weather is heavily cloudy. People having a garden have more money and when it is heavily
cloudy, they move out on city trips. No Charcoal at all! Owning a garden has the inverse
effect from what is expected (you miss the weather variable!)
Exercise - slide 22:

You work at UPS, a delivery company, which own 50.000 small delivery trucks worldwide.
Everyday more than 80 of them break down on their delivery way, and this causes enormous
problems of re-assigning the parcels, and delivering them on time. UPS collects a lot of data
on their trucks in real time: engine temperature, pressure, kms done, and many other
technical measures on the engine and the drive. When a break down occur, data on the
break down problem and its resolution is captured as well. The COO wants you to reduce on
the road break downs using the data collected.
I want to retain at night trucks that have higher chances to break. How do I do that?
● Is it possible? Yes.
63
● Predictive or classification model? Predictive model —> we want to predict when a

truck breaks down.
● Observation period: collect the data (columns: variables, rows: trucks) —> months
(measures on the last few months (temperatures, etc.) —> to know how long the
observation period should be; try different models design and test.
● Target period (define the targets: trucks become a target when it breaks down):
prediction regarding the observations —> 1 day (latency period = 12h: the night).
Same model design for each target (dynamic model design).
64
3.2 The business Value Framework
Comparing & tuning models – general approach
Unbalanced densities: skewed distributions
You oversample. You build a model, and you want to have an idea of the accuracy. How?
Oversampling: seems to learn something (less errors) but generalization you forced to have
is non-sense.
Always use natural density for the test set because unseen cases come from the true
population (not from oversampling).
Comparing possible models:

● Accuracy is misleading.
● The best model to choose depends on what we want to achieve with the model.
● Using the confusion matrix to balance the tpr/fpr is useful (ROC).
● Many different models built on the same training sample could produce different
confusion matrixes.
● A common framework for comparison can be based on the expected benefit one
might achieve using a specific model.
● The same approach can be used to optimise the expected value when deciding about
threshold for a specific model (ROC curve).
Depending on the business, we will have a different cost/benefits matrix.
65
Model business value

We want to run a campaign for a new perfume. Cost of contact = 1€/contact - Margin if we
sell = 100€. What is the cost/benefit matrix? (Opportunity cost not taken into account: fn).
On the right: model tested on holdout data with this confusion matrix.
Comparing models through value
1) Launching a campaign
66
2.52€ = expected benefit per person reached by the campaign. 130 persons will be
selected for the campaign (predicted yes).
Confusion matrix: if you overfit, you don’t expect to have the same performance on
fresh data. But use the confusion matrix to compare with similar perfume or small
samples (launch a small campaign to have data).
2) Comparing 2 campaigns & applying on new data
Campaign 1: number of people selected = 1140 (predicted yes), profit/person = 2.52€, ROI
= 25175/877 = 28.7.
Campaign 2: number of people selected = 4737 (predicted yes), profit/person = 3.04 €,
ROI = 30351/4386 = 6.92.
Is the second model better than the first one? Yes, if the CEO wants a maximum cash
whatever the investment. Otherwise, the first model is better (better ROI). The business
has the power to make the decision (not the data scientist).
3) Impact of adding a sample (decision of the CMO)
67
With this new campaign settings, the first model generates a profit whilst the second
one lose money ...
! Data is the most important asset for us algorithm is really the last point. Then, model
design is important because it is the way you build the process to answer to a problem. If
your model is not good enough, the only thing you can do is searching for new data.
Model design & selection protocol – local approach
Expected value problem decomposition

The Expected Value framework allows to reformulate a business problem into
well-formulated set of data-driven decisions. Example: we have a capitalization insurance
(Branch 23 = fiscal advantage for older people) we would like to sell. We have a nice offer,
but we incur a cost to target it to our customers. How should we proceed? What is the
Business question? Maximize the total amount of subscription? →Then we just can send the
letter to all clients... (but why to send to clients that will not buy?) Maximize the total net
earnings of this campaign? →Then we need to evaluate the cost/benefit ratio ...
Always a cost/benefit problem.
How to compute the expected profit of somebody?
Really important to have good probabilities How much am I ready to invest in the
campaign? Based on the probability of positive and negative responses from customers.
Reformulation: an example
68
What if we do not have enough past respondents on this product? How can we compute 𝐿𝑇𝑉
𝑥 ? Think!
● Buy the information: run the campaign without LTV and wait to get enough
conversions to build the model.
● Approximate the LTV on all customer having the product, without looking specifically
to respondent customers.
● If the product is new and hence no customer has it, approximate the LTV on a similar
product (branch 21 for example) for respondents or not
● If none of above is possible, make an approximation based on business knowledge:
Private segment → Value = 1500EUR - Mass Affluent → Value = 800EUR - Retail+ →
Value = 150EUR.
There is a good chance that the different segments will react differently to a
solicitation. Hence, it might be interesting to check if we could improve the prediction
models by segmenting them on theses populations...
Final protocol to solve the case according to our objective:
69
1. Build a model (from a random sample) for computing probability to respond positively,
given the customer profile:
● Test the following designs:
o Build a general model for probability to respond positively, given the customer
profile.
o Build 3 segmented models : Private Customers, Mass Affluent, Retails+
(better)
● Compare the 2 model designs and choose the best one (prefer global if segmented
models do not bring more than 5% expected net earnings for the campaign).
2. Build a model to approximate the LTV of respondents, which can be different from the LTV
of all customers.
● Use all customers for whom we have at least 5 years of observation in the period.
● Compute LTV by Net Present Value of all benefits of this population discounted at 5%
interest rate (already a heavy job! Each starting date is different for each customer...).
● Compute LTV per segment Private, Mass Affluent, and Retail+.
3. Compute the EP for each customer within each segment.
4. Select, in each segment, those for which EP > c (cost of acquisition).
What could be wrong with this approach, given that this protocol should be applied every 3
months on fresh data?
Best prospects for campaign today will be also the best ones in 3 months... →add a business
rule: no one can be selected twice within a period of 6 months.
What if the CEO plans for this year to get at least 15% EBITA?
You should ensure that any 1€ cost will generate 1.15€ in return and hence the selection rule
becomes 𝑃(𝑝|𝑥) . 𝐿𝑇𝑉p(𝑥) > 𝑐. 1,15.
What if there are other products to cross-sell to your customers that have even greater LTV?
You should understand if the best populations for those other products are the same as for
the B23? In fact, you need to set a global framework to balance the choices and to schedule
the campaigns. It’s a business issue...
What if the CMO plans for this year to increase the penetration ratio of capitalisation
products from 7% to 9% “at all costs” (his bonus depends on it!)?
How much are we prepared to invest (negative expected value) on all campaigns for this
product to increase its penetration rate? To answer this question, you need to look at how
much customers must be converted to get to that 9% and balance it with the total cost of
acquisition...
Number of clients = 1.000.000 - Actual clients of B23= 70.000 - Number of clients to convert:
20.000 – Process: sort scored client file with no B23, DESC, choose
→ Note: importance of having an unbiased P. Think: what if this SUM on all clients < 20.000 ?
Maybe you should schedule 12 smaller campaigns and tune each one based on learning of
the previous ones...
Conclusion
● Think carefully at the business question.
70
● Include a global vision in your plan: a model is supposed to go in production and run
frequently. Thinking on how it should be used on the long term is key (do not spam
your clients!).
● Rely on the strategy of the company to understand how to use your models: financial
objectives, market penetration, societal effects, etc.
● Always think first how Data Science can help achieving the strategy of the company
and only after that, look at tactical improvements.
● Be aware that predictive modelling is just one tool to be used to improve some
aspects of the company process. Other discipline values like product excellence &
operational excellence, must be improved as well to maximize the impact of data
science contribution.
71
4.1 Introduction to DWH
The transactional DB nightmare

Pet store chain database
Context: 25 stores in BE, 20 animal breeds, the data model is the same for all shops and
centralized, all info necessary (price, suppliers, etc.)
Question from the COO: « Give me a clear view on the shipping costs per animal breed and
supplier, compared to our margin after discount on these animals...»
● Discount & Margin:

SELECT cost FROM animalOrderItem INNER JOIN ON animal (retrieve listPrice) INNER
JOIN ON saleAnimal (retrieve salePrice)
● Shipping cost:
SELECT SupplierId FROM AnimalOrder…
72
● Second path: not more complex but not correct because salesperson and ordering
person are not the same employeeID in sale and employeeID in animalOrder are
not the same.
● Why this is not correct? 2 problems: Listprice used instead of the saleprice to
compute the margin & ShippingCost is for delivery of several animals if animals
belong to different breeds, then I count the shippingCost 2 times.
● Possible final answer:
73
Transactional DB are difficult to read with large databases because of the understanding of
the database and the business processes. It can be a real nightmare.
● I can’t find the data I need: data are scattered over many systems - many versions,
subtle differences.
● I can’t get the data I need: I do not understand ER: which tables to join and how to do
that? - I do not find any historical data to see the trends...
● I can’t understand the data I found: field names are not explicit: semantic? - Available
data poorly documented
● I can’t use the data I found: results are unexpected: there are so many errors! -
Depending on the source, I get different answers!? - Data needs to be transformed:
generalization, derived values.
● My query is refused by the computer: it takes ages to compute: how should I optimize
the query?
● Joins are very costly for the database (decrease performances).
Data science: 80% of the time is used to cope with data to make them usable (business
modelling).
Data warehouse approach
74
We want to create a copy of the operational data that will have high consistent, qualitative
historical information. Data store: centralise info extracted from the different departments.
Metadata = documentation (info that we have, meaning of tables, fields, etc.)
DWH - The challenges of decision systems
● Challenge 1 – Data diversity: issue= how to integrate all these sources for decision
making? Solution = DWH: integrated, business oriented, accessible, query optimised,
stable & consistent.
● Challenge 2 – Data volume: issue = how to extract relevant info from our data?
Solution = throw data away without using it (when disk full for instance), query and
OLAP tools, data science: data mining, text mining, network mining.
DWH & Business Intelligence

Data warehouse = a copy of transactional data, specifically structured for query and analysis.
Data warehousing = architectural construct and the associated processes. It allows to collect
and clean business relevant data from a variety of internal and external systems, to
transform and integrate this data into a consistent, subject oriented information model with
a common understanding for the whole corporation, to access and explore this information,
developing insights and understanding, leading to improved decision making and control.
We don’t invent data in data warehouses (but we can add more info: proba, etc.) We do a
copy of transactional data, specifically structured for query and analysis Understandable
data: replace country code by the name for example.
BI evolution:
● 60-70’s – IT service reports: hand coded, single system data, summary metrics, batch
reports (inflexible, expensive to modify).
● 80-90’s – Decision support systems: report writers, joined operating data, detailed
metrics, query tools on OLTP only, spreadsheets.
● 90-2000 – Business intelligence: OLAP/graphs, DWH, statistical metrics, query tools
with user semantic on DWH, user decides what to see.
● Future technology? Massive data analysis (Big Data – Hadoop ecosystem)
o Many different sources for customers, suppliers, services, etc.
TripAdvisor/Opendata/Weblogs/Callcenters/operations/...
o New Mined data: text: create structure from unstructured data (OCR’ed
letters, emails, blogs, etc.) - Sentiment: based on text mining and other
techniques allows to get a “feeling” about what people say about your brand
75
on blogs - Networks: understand how people interact (GSM, electronic social

networks) so as to best interact with them (viral marketing) – IoT Internet of
Things.
o Artificial Intelligence: replacing or improving human acts & decisions by
sophisticated & learning systems (see Google assistant).
● Future processes?
o Data/Text/Network mining integration in day-to-day.
o Save time to treat information to the advantage of decision making and lesson
learning.
o Full automation of the treatment of data towards the decision, especially
through web channels: Amazon recommendations, stock exchange automated
sell/buy systems, start-up: Kreditech, skeeled.
OLTP VS DWH
OLTP = OnLine Transaction Processing DWH = Data WareHouse
Application oriented Subject oriented (focus, don’t keep useless
things (le grenier de la grand-mère)
Used to run business Used to analyse business
Clerical user (employé de bureau) Manager/analyst
Detailed data Details to be aggregated
Current up to date Snapshot data
Isolated data Integrated data
Std repeated small transactions Ad-hoc access using large queries
Read/update access Mostly read access (batch update)
No time stamps necessary Historical data is a must
OLTP = OnLine Transaction Processing OLAP = OnLine Analytical Processing

Real time data Query records (deferred)
Require fast response time Require less fast response time
Modify small amount of data frequently Do not modify data at all (read)
(read and write)
Indexed data Columns data
Require frequent backup Require less backups
No need for large storage space Need significant storage space
Simple queries Complex queries
Why creating a DWH?

● Save OLTP systems from CPU load of large queries.
● Create a memory of the enterprise.
● One question, one answer: solve the consistency issue.
● Queries requires different data organization: optimize data structure for fast answers
to query.
● Remove the burden of exploiting normalized data schema again and again...
● Deliver data to the business user that fits his needs: which level of granularity do YOU
need?
76
● Build a comprehensive data dictionary for business users: understand what you look
at...
● Leave the user define and implement his needs: use your query or OLAP tools.
From DWH to Data Marts:
Details = granularity. It takes years to build a data warehouse but then when you have it,
achieving higher level is easy. Analysis starts from top (use cases) but building starts from the
bottom. You need to clean the data before integrating them into the DWH.
A data mart is a simple form of data warehouse focused on a single subject or line of
business. With a data mart, teams can access data and gain insights faster, because they
don't have to spend time searching within a more complex data warehouse or manually
aggregating data from different sources.
Characteristics of DWH
W.H Inmon and Ralph Kimball are fathers of DWH approach.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection
of data in support of management’s decision-making process.
● Subject-oriented:
o Organized around major subjects, such as customer, product, sales.
o Focusing on the modelling and analysis of data for decision makers, not on
daily operations or transaction processing.
o Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
77
o Ex: sales process (full costs includes also logistic costs), logistic process
(customer distance from central warehouse).
● Integrated: data in a DWH is ALWAYS integrated.
o Constructed by integrating multiple, heterogeneous data sources: relational
db, flat files, on-line transaction records.
o Data cleaning and data integration techniques are applied: ensure consistency
in naming conventions, encoding structures, attribute measures, etc. among
different data sources. E.g., hotel price: currency, tax, breakfast covered, etc.
o When data is moved to the warehouse, it is converted
- Missing data: decision support requires historical data which
operational DBs do not typically maintain.
- Data consolidation: decision support requires consolidation
(aggregation, summarization) of data from heterogeneous sources.
- Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled.
● Time-variant: there is always a time element in everything that we load in a
warehouse.
o The time horizon for DWH will be significantly longer that what we would
expect to see (compared to operational systems). Operational DB: current
value data. DWH: provide info from a historical perspective In most of the
cases: 5 to 10 years of data.
o Every key structure in the data warehouse contains an element of time,
mostly explicitly. Operational data may or may not contain “time element”.
● Non-volatile:
o A physically separate store of data from the operational environment.
o Operational update of data does not occur in the data warehouse
environment. It does not require transaction processing, recovery, and
concurrency control mechanisms. It requires only two operations in data
accessing: initial loading of data and access of data. A data warehouse is
NEVER updated, it is just loaded (SELECT).
o A query done 3 months ago will return the exact same value if it is run today.
+ 4 levels of data in DWH: old detail, current detail, lightly summarized data (= data distilled
from current detail data. It is summarized according to some unit of time and always resides
on disk) and highly summarized data (=data distilled from lightly summarized data. It is
always compact and easily accessible and resides on disk). Metadata (= the data providing
information about one or more aspects of the data; it is used to summarize basic information
about data that can make tracking and working with specific data easier) is also an important
part of the DWH environment.
DWH insight tools
78
Direct queries = SQL: allows to access all DB you want. Reporting tools: allow us to define
what is measured, the nature of the measure depending on the context.
DWH size: quite big (billions of lines), you will never have that in operational DB. Typical
transactional DB = 100 GB vs. typical DWH = 100 TB.
OLAP
It is the idea of using multi-dimensional models with hierarchies. It is the fast and interactive
answer to large, aggregated queries.
“I have sold for 1 thousand” Are you happy with that or do you have questions? What did
you sell? To whom? When? (dimensions).
Measure = what is additive (cost, number of clients, sales, etc.).
Dimension: give the context of the measure. Dimensions have natural hierarchies. E.g., for
the region: country, city, office. You could see that as a cube with 3 dimensions: region,
product, month for example.
Typical OLAP operations:

● Roll-up (drill-up): summarize data by climbing up hierarchy or by dimension
reduction.
● Drill-down (roll down): reverse of roll-up (detailed data) from higher level summary
to lower-level summary or detailed data, or introducing new dimensions.
● Slice and dice: project and select.
79
● Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.

● Others: drill across (involving across more than one fact table), drill through (through
the bottom level of the cube to its back-end relational tables (using SQL).
Example of OLAP tool: in Excel, you have the ability to create pivot tables.
Star query model: can answer to any question with this star key map.
Customer has no granularity here. Could it have some? Yes, with household level for
example. You could create a segmentation of customers but = slice not hierarchy.
Highest granularity level for time here: daily all points closest to the middle.
When you want to design a DWH, the first thing to do is to meet all people and write those
star schemas and that will be the basic info that you will use to start modelling the
warehouse.
OLAP characteristics:
● Navigation tool: you typically explore the data to look at some hypotheses that you
have in mind. Excel can play this role today (pivot tables).
● Hypothesis driven search: OLAP is dedicated to validate some hypothesis, for
example, factors affecting defaulters. Completely different of what we do in data
science. Ex: is defaulting rate related to age?
● Need interactive responses to aggregate queries while exploring: must be fast, limits
the number of dimensions that can be used.
OLAP Server Architectures:
● Relational OLAP (ROLAP):
o Use relational or extended-relational DBMS to store and manage warehouse
data and OLAP middle ware to support missing pieces.
o Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services.
o Greater scalability.
● Multidimensional OLAP (MOLAP)
o Array-based multidimensional storage engine (sparse matrix techniques).
o Fast indexing to pre-computed summarized data.
80
● Hybrid OLAP (HOLAP)

o User flexibility, e.g., low level: relational, high-level: array.
● In-Memory
o All cubes are stored in memory (Qlik Sense, Tableau).
● Specialized SQL servers
o Specialized support for SQL queries over star/snowflake schemas.
To keep in mind: adding a number of dimensions is costly because you split the data, you
have huge table and then you suffer when you want to use that kind of technology on data.
OLAP products: tableau, Qlik and Microsoft SQL are the main ones.
81
4.2 Dimensional modelling
Data organization in DWH: star schema

Example of star schema
Star schema = a fact table in the middle connected to a set of dimension tables.
2 types of tables:
● Central table = fact table that contains factual information, i.e., measures that we
want to have (to be able to manipulate these ones)
● Other tables = dimension tables (do not take a lot of space: tables with 50 000
thousand lines are just tiny tables “tiny Mickey Mouse tables”).
Surrogate key = technical keys (different from keys in operational DB), all together these
keys constitute the PK of each record in the fact table
When you do a sum in the select, you need to do a group by. Example of query: SELECT
sum(sales), dayofweek FROM… GROUP BY dayofweek.
Example of snowflake schema

= a refinement of star schema where some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to snowflake.
82
You should only do that if you have some info that are updated independently from others
extract table with this info and kept it apart.
Example of fact constellation:

= multiple fact tables share dimension tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation. Conformed dimension: its key is used in different fact
tables.
Star schema
= single way to represent everything we need.
83
Modelling dates:
Fact tables contain time-period data Date dimensions are important.
Issues of star schema:

● Dimension table keys must be surrogate (non-business related), and in most cases
auto incremental, because keys will change over time when history on dimensions is
built. The surrogate key is never the primary key used in the transactional systems!
● Need to define the level of granularity of the fact tables major choice; takes the
highest level of granularity as possible. Transactional grain – finest level, aggregated
grain – more summarized.
Granularity & fact size:
● Grain increases with the number of dimension tables and with the granularity of the
dimension tables themselves (week = 1 record day = 7 records).
● The number of records in the fact depends on: the granularity of all connected
dimensions and on the co-occurrence of dimensions for a single fact transaction
(Client_dim & add Shop_dim: a client buys from a single shop each time so it will not
increase the number of records in Fact table vs. Client_dim & add Item_dim: if a
client buys in general 6 items per visit, it will multiply the number of records by 6).
84
Star schema: SQL for OLAP implementation

Ex – Star for vehicle sales: slides 14-15-16
Slice = anything that is not an aggregation. Roll-up and drill-down are implemented by
“Group By” and aggregate functions. When you have your fact table, any query is easy.
Example (slide 17):
● Measures I want to have: sum(sales $), sum(sales units), sum(cost $)
● What I want to have is: SELECT store.city, store.region,
promotion.priceReductionTime, product.brand FROM… WHERE, JOIN, GROUB BY
anything that is not a measure ORDER BY…
Building star schema: dimensional modelling

Steps in dimensional modelling:
1. Analyse the normalized entity-relationship & identify business processes to model
each separately.
2. Select n-m associative entities (bridge tables) containing numeric and additive facts:
these will be the measure of the fact tables. ManyToMany relationships are there
because there are some events between 2 tables in general.
3. Denormalize the remaining tables into dimension tables.
4. Determine granularity you want to have for the measures: which entities are linked,
which attributes for each of them, and the time granularity. The finest granularity for
a transactional database is the transaction itself.
5. Identify hierarchies in dimensions & decide on how they will be implemented. It can
be attributes: date (week/month), product (category/brand). It can be dimensions
(customer/segment).
6. Split all compound attributes to derive more attributes. Address can be a string
(good for a transactional DB) but can be useful to have separated info (zip/street/etc.)
7. Create the necessary categorical attributes: reference tables! Decoding: you have a
table with all codes and for each code, you will have a translation of it. Ex: age table
from 0 to 3 years old = baby, etc.
8. Add the date dimension with the right granularity.
9. Fact and dimensions are never connected with the operational keys! We use
suggorate keys.
Example - Sales process:
85
There are M2M relationships between sales and product and between promotion and
product Bridge table: sales-line (coding an event between 2 entities in the system).
What if the date (= time dimension) is missing? You could use the downloading date; if you
download the data every night, you can add the date of the day to all new transactions.
The name of the key in the sales_fact table, we don’t care (clerk_key shows the link with the
employee table). In general, we have the operational keys in the dimensions. But we are
not using them to connect to the fact table. Why do we have them then? Because, at some
moment in time, we will have to update dimensions because of changes issued from the
operational DB.
The design process
1. Choosing the business subject (data mart):
o A specific set of business questions (belongs to a BL): finance, marketing, etc.
o A set of related fact and dimension tables: costs, sales, clients, etc.
o Single source or multiple source: ERP, legacy, Excel, etc.
o Conformed dimensions: dimensions shared by some facts of the
constellation. Identify which are the conformed dimensions is the key. It is
conformed only if the PK of the table is the same PK of the other table that is
connected to the fact table. The dimension is shared by 2 fact tables. We need
to have these conformed dimensions, otherwise we still have silo of
information.
o Typically have a fact table for each process (joins will be done through
conformed dimensions) → a data mart will contain several fact tables. Ex:
campaign selection.
2. Choosing the grain (unit of analysis): the granularity determines the level of detail of
each fact record. Better to focus on the smallest grain in general: moving from high
granularity to a lower one is always possible (it is just a group by), but the opposite
action is not possible.
3. Choosing the dimensions: a dimension table is a table connected to the fact table
with a surrogate FK (and not operational keys). Dimension tables contain text or
numeric attributes that gives the context of the measures available in a fact table
(time, shop, client, etc.). Dimension’s attributes will be the source of query
86
constraints. Surrogate keys will be used to store the history of the dimensions. They
are the central point of the whole dimension architecture. In general, dimensions are
1ToM relationships with fact tables, but it is not always the case. Then, a bridge table
is built. With a bridge table, a single record of the fact contains a unique key, but it
links to many records of the bridge table and the primary key of the bridge table is
the two keys of the tables joined with it. However, only one of the keys is used to join
with the fact table.
(Example slides 30-31).
4. Choosing the facts: the facts represent a process or reporting environment
interesting for some business users. The fact tables contain only measures (fact
attributes) and a primary key composed of all foreign keys connecting to dimensions.
It is important to determine exactly what it represents. Typically, it corresponds to an
associative entity (bridge table) in the ER model (→ measures). It tends to have huge
numbers of records (the max number of records is the combination of all records of
each attribute (cartesian product)).
5. Defining the measures: measures are the measurements associated with fact records
at the fact table granularity. Normally, numeric and additive. Measures that are not
additive on time dimension but additive along all other dimensions are semi-additive.
Finally, there are non-additive measures (“Value-per-unit” measures). Ex: exchange
rate. Attributes in dimension tables are constants. Facts attributes (the measures)
vary with the granularity of the fact table: when aggregations are done, they should
smoothly aggregate as well.
Good dimension attributes:
● Clear & precise: gives a rich context to the measure, use crystal clear naming and
semantics.
● Complete: all attributes within a dimension completely describe the entity
represented by the dimension. In case of conformed dimensions, completeness can
be dictated by many business processes: color is important for marketing but not for
supply.
● Quality assured: no trash value like 99 for AGE... NAME is standardized and have no
duplicates.
● Equally available: attribute must be available for most of the objects it represents –
‘age is available for only 1% of our clients’.
● Documented: from a business point: semantic and point in time (‘risk profile attribute
of clients is updated every night and is computed based on ...’). From a technical
point (‘risk profile is loaded from Table X with process Y, etc.’).
Hierarchies in star schema: dimensional modelling

Hierarchies are the key relationship for roll-up and drill-down in OLAP environment. Ex: Sales
per Day→Week→Month→Quarter→... Hierarchies can be represented in 2 ways in
dimensions:
● Explicit Attribute sets: time dimension: date, DayofWeek, Week, Month, Quarter, ...
or product dimension: cheese, milk product, food.
● Snowflake: create a table per level and link tables by 1:N relationships. It normalizes
the attributes within a hierarchy and hence reduces attribute redundancy in a
dimension table. The user must join tables to get the right level of aggregation.
87
It is stupid to use it if you are not obliged. We don’t want to do it because we just
repeat things. It could be necessary for holidays for an international business for
example (dates are different in other countries for example).
Building the “memory” of the company: dimensional modelling

The time dimension is always there and connected to the fact table (star schema =
time-variant).
Slowly changing dimensions
● Type 1: store only the current value. No history is created: all facts are like the value
was always like the last one.
● Type 2: create a new dimension record for each new value of the attribute. This will
create a new surrogate key to be used as from this moment in all new records of the
fact table, creating the historical context of the measures.
● Type 3: create an attribute in the dimension record for previous value. This will not
build a history at the level of the facts, but only in dimension.
88
Modifying a dimensional model: dimensional modelling

If a new measure must be added to a fact, given that the historical data is still available in
operational systems, adding this measure in existing facts will be easy and reveal all historical
context as well. Ex: adding the “credit_line_level” semi-additive measure in a fact can easily
be done by looking at the level associated to each account at different moment in time and
add it to the fact.
Vs. adding new attributes to an existing dimension might be more difficult. If the added
attribute does not modify the granularity of the fact table, it can be added in the dimension.
Ex: Add date of birth in Dim_Client: ok. If the added attribute increases the granularity of
the fact table, the dimension table cannot be updated since it means that each record of the
fact must be split into n records and hence the actual surrogate keys are not valid anymore.
The whole table must be re-computed. Ex: Product_family → product, Houshold → client,
Day → transaction, etc.
Dimensional model strengths & weaknesses: dimensional modelling

Strengths:
● Predictable and understandable, standard framework. The semantic of table links is
not hidden from the user as it is the case for relational models. Users can understand
it easily.
● Respond well to changes in user reporting needs. Most of the reports will be created
through aggregation of low-level facts. Hence any reports, given that the model uses
conformed dimensions, can be built on top.
● Multidimensional models are more efficient for queries & analysis. Fact tables have
many records (millions), and dimensions will normally be relatively small (product =
10.000, shops = 500). Joining small tables on large ones can be efficient with the right
indexes.
● There exists a number of products supporting the dimensional model. Because the
semantic of the dimensional modelling is fixed, software can leverage this knowledge
and propose tools that provide automatic mechanism for Type1, Type2 and Type3
updates, that accept explicit description of hierarchies, etc. OLAP functionalities
creation can be fully automated from a constellation of fact tables. The programming
and business development associate work is largely reduced.
Weaknesses:
● Some dimensions might be too large. Client with address→10% change/year. Large
dimensions must be cut.
● A relation between two dimensional objects only exists through their link in a fact
(you don’t have the new info if he doesn’t come to the shop. A client with no sales
has no geo info attached.
● Normalized and indexed relational models are more flexible. Granularity choice can
be avoided. Relationships among objects do not depend on some events stored in
facts. The cost is: lose all advantages of dimensional modelling!
Overcoming weaknesses of dimensional modelling:
● Granularity: in practice, we often use a combination of both relational & dimensional
models at low levels (high granularity) and build departmental and personal data
marts using dimensional models only.
● Link across dimensional objects: a link can be created between Client and Geo as
soon as a client record is added in the dimension. The client record will eventually be
89
closed (date-in/out) even before it is used in the fact. A factless fact table, containing
no measures, can be used to records the link between a client and Geo. It is triggered
by a change of address.
Dimensional model example – Campaign Datamart
Design: really heavily users and business-oriented.

Entities? Dimensions? Measures? Which fact table I would like to have?
Entities (table):
● Fact table
Credit amount, TAEG can do that but in general we will not do that because maybe
the information has changed between the selection process and the answer of the
customer.
● Factless factable: with as many lines as customers
If selection process and we wait for the return in general, 2 fact tables (1 factless)
2 separate schemas with conformed dimensions to link the two.
● Time dimension (granularity: day): time-key, date (business key), dayofweek, etc.
Date will always stay the same But we could have a problem with bank holidays
for example. For example: 01/05 is a bank holiday in one country but not in
another. Thus, need also a time-key (surrogate key).
● POS dimension: POS-key, zip, etc.
● Customer dimension: customer-key
Do we want to put a portfolio string (with 1, 0, 1, 0, 0, etc.)? Not the best solution
(queries a bit harder). If the portfolio changes, need to update often the string will
often change and multiply the dimension of customers (more records).
● Portfolio dimension: customer, hypotheque, etc. bridge table between products
and fact table.
● Geo info dimension: apart from customers for the same reason than portfolio.
● Segmentation of client dimension: idem (don’t want to increase dimension of
customers)
90
● Campaign, Package, Wave, Channel, etc. dimensions.

When measures or dimensions (granularity) change, you have to build another fact table
(new schema). Conformed dimensions: to join the fact tables. Inner join between the two
fact tables gives you the customers selected that have answered.
Problems with dimensional models: not adaptative to new relationships later discovered
(imply to know everything now).
91
4.3 The logistics of data
The processes:
Data transformation (ETL) – DWH processes

ETL process = Extract, Transform, and Load capture, scrub or data cleaning, transform, load
and index.
92
Steps in data reconciliation:
Static extract = capturing a snapshot of the whole source data at a point in time.
Incremental extract = capturing changes in OLTP that have occurred since the last extract.
93
Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses,
missing data, duplicate data (happens a lot as soon as data are entered manually),
inconsistencies.
Also: decoding, reformatting, time stamping (can add a time stamp with the day of the load
(if loaded daily) for example), conversion, key generation, merging, error detection/logging,
locating missing data.
A lot of reference (decoding) tables = a lot of work
Record-level: selection (data partitioning), joining (data combining), aggregation (data

summarization).
Field-level: single-field (from one field), multi-field (from many fields to one, or one field to
many).
Example – single field transformation:
94
Example – multifield transformation:
95
Data is never physically altered or deleted once it has been added to the store.
The 4 steps in loading the DWH from the staging area:
1. Update reference tables: make sure all reference tables are updated before loading
dimensions (what if Promotion 10 “Win a Bongo WE” is not there...). Each Reference
table MUST have an “owner” to make sure it is properly managed.
2. Update all dimensions: prepare each dimension record through a vector
representation based on the copy of the operational data and compare it with
Dimension If record is unchanged, do nothing, else close current record (date-out =
today-1) and use current vector to create a new one (with date-in = today). Before the
update, many cleaning steps are done.
3. Update lower-level facts (with the highest granularity): same as for dimensions, a
vector is created from the operational data. It is loaded based on date of operation or
extraction. Rollback: capacity of de-loading erroneous data of a certain day.
4. Update highest level datamarts: all other data marts are updated by simple
aggregation based on removing some dimensions, or removing some attributes part
of a hierarchy (generalized dimensions).
Architecture – DWH processes

1. Independent Data Mart (chaotic development):
96
2. Dependent Data Mart and Operational Data Store (the right architecture):
Right structure: single load and aggregation in the storage area.
Example – Bank:
97
(SAS = The statistical tool for data science (years ago))
Project flow – DWH processes

The business dimensional lifecycle:
98
QlikView story (video)

Ad for QlikView (OLAP technology to build data cubes)
What is associative search? Data cube just by organization of measures in fact tables and
using hierarchies.
You cannot have the data with the right technologies like that (she is lying), you need to
have a DWH (centralizing info, building dimensions and cubes of data all this work has to
be done but she is omitting it)).
People that build software are saying that technologies save our problems with data, but it is
not true. Technology is not the problem (could be if you have a lot of data, you should have
the right technology). The biggest work that we have to do is data appropriation (preparing
the data).
Think by yourself, vendors are selling dreams.
QlikeView is a very good tool but only once data are prepared.
4.4 The value of the 3 data storage paradigms
3 data paradigms
1. Data in our IT systems

● Data is structured
● It reflects what happens in the real world
● Relational model supporting operations in the company
● Optimised for business processes (updates)
● Easily accessible (SQL)
● Tools: MySQL, Oracle, Microsoft SQL Server, IBM DB2
Open source community produces fantastic software (MySQL better than the others).
2. Data in DWH
● Data is structured
● Dimensional modelling
● Creates memory of our companies
● Done to support business decisions (select statements and not updates)
● Extremely easy to use (query tools friendly)
● Tools: Teradata, Microsoft SQL DWH, Amazon redshift, Apache Hbase (open source)
99
3. Big data (lake)

● Not very organized, data might not be structured
● New type of data
● Flat files at bottom: most of the files are just long strings
● Optimised for fast storage
● Distributed computing: not one single computer but a big system that manages the
info between hundred of small computers (redundant copies of data in case one
computer crashes)
● Data is processed where it is
● Many cheap CPU’s
● Fault tolerant
● Open source
● Accessed mainly through programming
● Tools: Hadoop HDFS (open source), Google BigQuery (open source), Amazon Dynamo,
MongoDB (open source)
Data lake = place where you just flow the information like strings, you don’t even know what
you put on this (just 1 and 0) because you don’t have the time for that.
Purpose: receive data without any constraints at all, just have the time to store it.
Characteristics of systems:
● Non distributed RDBMS (Relational DataBase Management System): ACID
transactions
o Atomicity: If update cannot be completed, the engine will step back for
everything either it does the all process, either it does nothing (you will
never stay in between).
o Consistency: only valid data are written.
o Isolation: one operation at a time algorithms always written in sequence
(no synchronous update conflicts).
o Durability: once committed, it stays that way (persistence).
● Distributed data systems: CAP theorem
o Consistency: all clusters have the same copies of data (can take some time).
o Availability: clusters always accept reads and writes.
o Partition tolerance: at any moment in time, you guarantee that the system
will work even when there are some network failures.
Distributed systems have some constraints.
Note that:
● CAP theorem is related to distributed implementations where data and computation
processes are done in parallel in an asynchronous way.
● ACID is implemented using sequential operations, possible with some parallelisation
(multithreading) using the same shared memory, in a synchronised way.
● Non distributed system = all the parts of the system are in the same physical location
vs. distributed system = parts of the system exist in separate locations.
What & where?
100
Distributed systems are not there for operations (because you cannot have one store that
shows you different levels of stocks that one other because updates were not done yet) but
only for management and decisions.
Data governance
Data governance is a collection of practices and processes which help to ensure the formal
management of data within an organization. It deals with:
● Compliance, security and privacy: GDPR compliance, security of the data and access
to it, including anonymization of personal data.
● Integrity & quality: ensuring that data is correct and unambiguous, definition and
implementation of a Master Data Management (MDM) process.
● Availability & usability: easy access of the data for all business needs, in a format that
is usable for all.
● Roles and responsibilities.
● Overall management of the internal and external data flows within an organization.
You need to have team dedicated to the governance of the data to make sure that people
have the good rights to access the good data and that data transactions are done properly.
Data lake & DWH

Traditional conversation:
● Kimball vs. Inmon
● Dimensional vs. 3rd normal form
● What hardware do we need?
101
● Which RDMS should we use? Oracle, SQL Server, MySQL?

● Which ETL tool should we use?
● Which BI tool should we sit on top? Tableau, Qlick?
New conversation:
● Do we need a data warehouse at all?
● If we do, does I need to be relational?
● Should we leverage Hadoop & NoSQL?
● Which platform and language are we going to code in?
● Can we use a complete open software platform? Free of license fees? What is TCO
(Total Cost of Ownership)?
So why change?
New technologies are great, but what drives our adoption of new technologies and
techniques? Data has changed (semi-structured, unstructured, sparse and evolving schema)
and volumes have changed (MB to TB to PB workloads) AND MOST IMPORTANTLY:
companies that innovate and leverage their data win! So, we want to use all of them...
Complaints about DWH approaches: “Onboarding new data is difficult!”, “Rigidity and Data
Governance”, too slow to follow business evolving requirements: “I need to analyse some
new source”, apply all processes of data governance? Conform and analyse the data? Load it
into dimensional models? Build a semantic layer nobody is going to use? Create a dashboard
we hope someone will notice?
... and then you can have at it 3-6 months later to see if it has value!”
+ Surveys shows that the majority of data analytic projects fail… Data is just really hard,
especially without the right strategy. In fact, you need to do some work for businesspeople
to understand what they have in their hands.
Traditional warehousing all wrong? No!
The concept of a data warehouse is sound:
● consolidating data from disparate source systems
● clean and conformed reference data
● clean and integrated business facts
● data governance
However, we should recognize that the EDW can’t solve all data requests & issues in the
organization!
People are fooled by technologies.
Unsolved issues though a DWH:
● Extremely high data size (PB)
● High throughput & real time data flows
● Real time decision processes
● Data type not supported by traditional facts and dimensions: network analysis, text
analytics, voice analytics, image analytics, raw device data (IoT), etc.
So, what is missing?
The data lake: a storage and processing layer for all data. It stores anything (source data,
semi-structured, unstructured, structured), keeps it as long as needed and supports a
number of processing workloads.
It’s where Hadoop and distributed systems help!
The Hadoop data lake: the data lake should be seen as a holistic data management
strategy. Data governance is tunable. Some analytics are suited for the data warehouse,
while many are not. The Hadoop data lake has different governance demands at each tier.
102
Only top tier of the pyramid is fully governed. We refer to this as the trusted tier of the big
data warehouse.
Big data warehouse: data is fully governed, all data is structured, partitioned/tuned for data
access, governance includes a guarantee of completeness, accuracy, and privacy, consumers:
data scientists, ETL processes, applications, data analysts, and business users.
The refinery: the feedback loop between data science and data warehouse is critical.
An example for utility company:
103
When using Hadoop & DWH?
104
4.5 Data governance and GDPR
We really need to comply with GDPR principles today.
GDPR = General Data Protection Regulation – May 2018

GDPR is a new set of rules that unifies the data privacy laws across EU countries and
strengthens the rights of European citizens to protect their information. The regulation came
into effect on May 25, 2018 and is applicable to all EU countries without national ratification.
Complying with this European regulation on data protection means ensuring data is collected
legally, informing users of how it is treated, and keeping data secure (i.e., protected from
breaches). GDPR is in the B2C world for European citizens. You are not concerned if your
clients are other businesses (B2B). It will soon be joined by a new law, the ePrivacy
Regulation, that will regulate the electronic communication with reference to GDPR
principles (cookies, how you can track people on the Internet, etc.).
PII = Personally Identifiable Information = piece of info you can use to identify someone.
Think of it like a puzzle – even if you can’t make out the picture with one piece, that piece
can be used along with others to form the complete image. The same concept applies to
personal information. It could be: names, driver’s license or ID numbers, phone numbers,
medical records, biometric data, birth locations, social security numbers, addresses, emails,
birthdays, DNA, license plate numbers. Special consideration should be taken when
collecting PII that the GDPR defines as “sensitive” – such as an individual’s race, ethnicity,
sexuality, political beliefs, biometric or genetic data, and trade union membership.
Who does GDPR apply to?
GDPR applies to businesses that target EU data subjects. The law is applicable not only to
organizations operating within the EU, but also to those worldwide that target individuals in
the EU. Any European citizen that has their data collected by a company is a data subject
under the GDPR, and the company that processes their data is known as the data controller.
If a third-party is employed to handle data processing (such as a payroll company), they are
the data processor (same responsibilities than the data controller).
What are the consequences of violating the GDPR regulation?
Companies that violate the EU General Data Protection Regulation face a maximum fine of
€20 million ($23 million) or 4% of their annual global turnover (whichever is higher). The first
significant penalty was issued in January 2019, when Google received a GDPR fine of €50
million for not fully informing users how their data would be used when they set up its
Android operating system. In 2019, the UK Information Commissioner’s Office (ICO) issued
penalties against British Airways and Marriott ($230 million and $123 million, respectively)
for allowing user data to be compromised in data breaches.
7 key principles of GDPR
They dictate how businesses should process data in order to conform the new EU data
protection standards.
● Lawfulness, fairness, and transparency: data processing must be legal, and the
information collected and used fairly. Users must not be misled about how their data
is used.
● Purpose limitation: you can use the data only for the purpose defined at the
beginning. The purpose must be clear from the start, recorded, and changed only if
there is user consent.
105
● Data minimization: want to collect as much info as possible because we don’t know
what will be useful But can only collect data required for the stated processing
purpose.
● Accuracy: reasonable steps must be taken to ensure the collected data is accurate
and up to date.
● Storage limitation: data can be kept for a while but not forever (not longer than
necessary) Approximately 6 years in Belgium.
● Integrity and confidentiality: appropriate cybersecurity measures must be put in
place to protect personal data being stored.
● Accountability: organizations are accountable for how they handle data and comply
with the GDPR.
GDPR compliance requirements
Depending on the type of data you collect and whether you are a processor or controller, you
may have to comply with some or all these changes.
● Data breach notifications: the controller should communicate the breach to the
supervisor authority asap in case of personal data breach (where feasible, not later
than 72 hours after having become aware of it).
● Data Protection Impact Assessments (DPIAs): a DPIA is an evaluation of the effect of
a data processing activity on the security of personal data. Article 35 requires
controllers to conduct DPIAs in the event that one of their data processing activities is
“likely to result in a high risk to the rights and freedoms of natural persons.” Note this
case: “automated processing for purposes of profiling intended to evaluate personal
aspects of data subjects”
● Privacy by Design (PbD): the controller shall implement appropriate technical and
organisational measures for ensuring that, by default, only personal data which are
necessary for each specific purpose of the processing are processed. This practice
should ultimately minimize data collection. It argues for privacy and security to be
fully integrated into the design processes, procedures, protocols, and policies of a
business. There are seven major principles that guide this concept: privacy should be
the default setting, privacy should be proactive, not reactive, privacy and design
should go hand in hand, privacy shouldn’t be sacrificed for functionality, PbD should
be implemented for the full life cycle of the data, data collection operations should be
fully visible and transparent, user protection must be prioritized.
● Consent acquisition: consent should be given by a clear affirmative act establishing a
freely given, specific, informed and unambiguous indication of the data subject’s
agreement to the processing of personal data relating to him or her, such as by a
written statement, including by electronic means, or an oral statement. Controllers
are no longer able to use opt-out or implied methods of consent — such as pre-ticked
boxes, silence, or not stating that they do not want ... Specificity and unambiguous
means that the usage of data collected is clearly stated and the consent is given only
for that usage.
● Data Subject Access Requests (DSAR): EU citizens have 8 rights over data collected
from them:
1. The right to be informed: data subjects should be able to easily learn how their
data is collected and processed.
2. The right of access: data subjects have the right to request to access any data that
has been collected from them.
106
3. The right of rectification: data subjects have the right to request to change
inaccurate or incomplete data that has been collected from them.
4. The right to erasure: individuals have the right to request the deletion of their data,
also referred to as the ‘right to be forgotten’.
5. The right to restrict processing: individuals have the right to request to block
specific data processing activities.
6. The right to data portability: individuals have the right to request to retain and
reuse their data for other services.
7. The right to object: data subjects have the right to object to the use of their data
for certain processing activities.
8. Rights in relation to automation: data subjects have the right not to be subject to a
decision based solely on automated processing, including profiling, which produces
legal effects concerning him or her or similarly significantly affects him or her.
To exercise these rights, data subjects can make direct requests to controllers,
whether it be through a phone call, email, or web form. These requests must be
addressed quickly as the GDPR only gives controllers 30 days to respond.
● Appointing a Data Protection Officer (DPO): a DPO must be appointed if the
processing is carried by a government entity, the controller/processor regularly
collects and processes a large amount of data, the controller/processor processes a
variety of sensitive personal information. A DPO plays several key roles in your GDPR
compliance plan. They are responsible for: educating controllers and processors on
how they must comply with the regulation, monitoring compliance efforts, offering
advice on data protection assessments, acting as the point of contact for the
supervisory authority.
What are the lawful bases for data processing?
Article 6 of the GDPR outlines six lawful bases for data processing:
1. Consent of the data subject
2. For fulfilment of a contract
3. Legal compliance
4. To protect the vital interests of the data subject
5. Necessity for carrying out a task that is in the public interest
6. Necessity for the purposes of legitimate interests of the data controller or third party
Legitimate Interest
Legitimate interest refers to any interest that provides a benefit to one or more parties
involved in the processing of data. Legitimate interests can be personal, commercial, or even
societal interests. For example, if you process data in the interest of your business
operations, your activities may fall under GDPR legitimate interests
● Fraud detection and crime prevention
● Network and information security
● Processing employee or client data
● Direct marketing although some data processing activities for marketing, like sending
marketing emails, require user consent.
Legitimate interests can be invoked except where such interests are overridden by the
interests or fundamental rights and freedoms of the data subject.
GDPR worldwide
China is clearly not using GDPR (cameras everywhere).
107
Over 100 countries have now implemented new data protection laws to regulate the flow of
personal data, and there is more legislation to come. One such law is the California
Consumer Privacy Act (CCPA), in effect since January 1, 2020. This law is already controversial
and has forced many US companies to rethink their data collection strategies.
US companies had varying responses to the GDPR. Since the GDPR legislation came into
effect, over 1000 major US publications have blocked users who are EU citizens, rather than
risk noncompliance. Based on an Ovum report commissioned by Intralinks, 52% of US
companies think that they are likely to be fined for noncompliance.
GDPR: good or bad?
Companies such as Facebook have a lot of power. But when you use it, you consent to give
your data, so Facebook has legitimate interest.
GDPR has damaged the competitive position of European companies: costly processes
changes around data collection and processing, data collector companies (such as Bisnode in
Belgium) had to dramatically reduce the scope of their data delivery, reducing the ability for
all companies to reach new clients at the advantages of companies like Facebook or Google,
add resistance within our companies to take full advantage of their data to improve customer
intimacy. In the future, it might turn into a competitive advantage, if all customers around
the globe ask for more protection of their privacy: European companies would be in a pole
position then...
Project presentations - takeaways:

● Could prefer to not use the continuous variables (binarize everything)
● Linear regression: fast computing!
● Don’t keep the other targets (except from C3 and C6) because it would give the
wrong information.
● Measures of quality have to be done on natural density (not on the test set)
should oversample only on train set and not on all the dataset otherwise, you will
be completely wrong (because measures also oversample).
● If you inflate the number of positives for manga and pop rock, you will change the
comparison based on the probabilities because probabilities will be changed based
on oversampling.
108
Questions exams BI
1) What is a wrapper? What is used for? What are the advantages in regards with
correlation coefficient?
A wrapper is a method of features selection: a subset of the original features is

selected (objective function: here, performance). The ML algorithm is used to
evaluate the model performance as the set of features is modified.
2 methods: backward (start with a set of full features and remove some at each step)
or forward (start with an empty set and add features at each step).
It is used to automatically reduce the dimensionality of dataset before applying a
model. Why? It helps to reduce noise (artifacts relationships inside the dataset
overfitting), to counter the curse of dimensionality (some models will not be efficient
on too large dataset accuracy and efficiency can degrade), it increases business
understanding (by seeing which features are more related to the target) and it allows
to remove irrelevant and redundant features that can confuse the algorithm.
Correlation coefficient? Look at interactions between features >< filter.
2) Explain this schema:
Learning curve
● What are training instance? Why does dt surpass logistic?
Training instance = data in the training set.
As the training size grows (x axis), generalization performance (y axis) improves.
Logistic regression has less flexibility, which allows it to overfit less with small data,
but keeps it from modeling the full complexity of the data. Tree induction is much
more flexible, leading it to overfit more with small data (overfitting because it grows
the tree until having pure leaf nodes), but to model more complex regularities with
larger training sets (more flexible: several decision boundaries).
● Linear, nonlinear? Why?
DT = nonlinear (split the instance space with several decision boundaries), logistic =
linear (ln applied on nonlinear proba function)
● DT have pruning here? Why?
1
No, because we see it overfits at the beginning (does not if pruned) because it grow
tree until having pure nodes).
3) What are the 5 steps of designing a star schema?
- Choosing the business question/subject fact (for each process) and

dimension tables (and conformed dimensions).
- Choosing the grain: level of details of fact records. Focus on the smallest grain.
- Choosing the dimensions (attributes to give the context to measure):
granularity + time dimension + surrogate key (not operational keys) for history
of dimensions.
- Choosing the facts: measures + PK = all surrogate keys of dimensions.
Typically, bridge table in ER model.
- Defining measures: numeric and additive in general
4) How do you get overfitting on Decision Tree? How do you treat it? Explain the
various methods (errors measures)?
DT overfit because highly flexible (complex); it grows the tree until having pure nodes
(at the end, learn things on the training set that are not generalizable = noise). We
should control the complexity of the model; by limiting the number of nodes for
example or pruning the tree.
When to stop splitting to avoid overfitting? Pruning is a mechanism that
automatically reduces the size and complexity of a DT to grow a tree with the best
generalization capacity.
- Pre-pruning (early stopping): x2 test, Gini or information gain, error
classification)
- Post-pruning: fully grow and then post-prune, replace subtree by a terminal
node if when pruning, the error reduces more than a defined threshold
error rate: optimistic vs pessimistic approach/minimal cost-complexity
pruning
5) What are the interest points of DT? Compare them with the interest points of
Logistic Regression?
- DT: flexible thus can represent well complex models, easily understandable
(business sense), categorial & numerical features, risk of overfitting (from
overtraining) + unstable, not searching for the optimal model, include
embedded features selections and missing values treatment, class proba are
not reliable, sensitive to unbalanced classes, nonlinear, robust to outliers,
computationally expensive.
- Logistic regression: less in danger of overfitting (global decision surface, from
dimensionality), only numerical features, does not accept missing values,
highly influenced by noise, need to add regularization/dim reduction, probas,
linear, CPU intensive.
2
6) “What is data governance? How to combine Data Lake and DWH? I started talking
about the data lake, and he asked to describe + give example of the “pyramid”.
Data governance is the collection of practices and processes which help to ensure the
formal management of data within an organization. It deals with compliance, security
and privacy, integrity and quality, availability and usability, roles and responsibilities,
overall management of the internal and external data flows within an organization.
What is missing to a DWH is a Data Lake = storage and processing layer for all data.
The Hadoop Data Lake: different governance demand at each tier.
Pyramid: landing area: raw data collection – data lake: turn data into information –
DS: agile business insight – big data warehouse: user community queries and
reporting.
Why? When you need all type of data (cannot be supported by DWH) and keep DWH
because memory, consistency issue, subject oriented, large queries, etc.
How? Data Lake (DS can use data) ETL (to transform unstructured data) DWH.
7) What can cause overfitting with regression models (including the logistic)? What
are the solutions to address this?
For regression models, overfitting is caused by high dimensionality of data.

There are many solutions to address this:
- Regularization: Lasso or Ridge. Regularization adds a term to the regression
function that penalizes the magnitude of the value of the regression
coefficients Bêta (alpha*R(Bêtai). Lasso: L1 norm (explicit variable pruning),
Ridge: L2 norm (no explicit variable pruning but some have low weights),
elastic net regularization: compromise between Lasso and Ridge.
- Dimensionality reduction: features selection (wrapper, filter,
embedded)/reduction (pca)
8) You have a fact table with historical data from 3 years ago until now. This fact table
is connected to some dimension tables. Imagine you want to add a new attribute to
one dimension table? What will happen? Which elements do you have to be careful
with? (Explain the different situations possible)
You should pay attention to the granularity impact of the new element:
- If the element increases the granularity of the fact table, it is not possible
because it means that each record of the fact must be split into n records and
hence the actual surrogate keys are not valid anymore.
- If the element does not modify the granularity of the fact table, it can be
added.
9) It’s a case a bit similar to the capitalization insurance (the business value
framework). A university wants to send letters to their alumni to ask a donation
from them for a project (150€ donation) and with a cost of 10€ per letter. You have
to explain how you will target the people that you will send the letter to (you have
some data on them) and all the steps from the business value framework
(probability and profit matrixes, expected profit, etc.).
3
Business value framework:
-
-
(response model)
- Build a model (from random sample) for computing proba to respond
positively given the customer profile. Build a general model and a segmented
one compare the 2 model designs and choose the best one.
Cost/benefit matrix + expected value + probas
- Build a model to approximate the LTV of respondents (with data of 5 years of
observation) + for each segment.
- Compute the EP for each customer within each segment.
- Select in each segment, those for which EP > cost of acquisition. (No one can
be selected twice). TARGET.
10) How do you compute the score on a regression model? What do you need to have
as classifier? (Decision rule)
Linear: y = B0 + B1Xi linear relation between X and Y. Score: minimizing R2 measure =

minimizing RSS (good fit if small error)
Logistic: not analytic (optimization algo).
Need an objective function as classifier
11) What are the strengths and disadvantages of dimensional modelling Vs 3NF?
Strengths:
- Predictable and understandable, standard framework: semantic not hidden ><
3NF
- Respond well to changes in user reporting needs: aggregation of low-level
facts, hence any report can be built on this.
- Multidimensional models are more efficient for queries and analysis: joining
small tables (dimensions) on large ones (fact) can be efficient with right
indexes. >< 3NF: joins are costly (decrease performance).
- There exists a number of products supporting the dimensional model
(semantic fixed).
Weaknesses:
- Dimensions might be too large.
- Relation between dimensions exists only through fact
- Normalized and indexed relation models are more flexible (3NF)
Overcoming weaknesses:
- We use a combination of both relational and dimensional models at high
granularity.
- Link across dimensional objects: factless factable.
12) What is variance and how is it computed? To which measures is it applicable? How
do you explain it to a random person (= someone not knowing DS)?
4
Variance is an important measure of the quality of a model in general. Variance tells
us how much chance influences the performance of any prediction. The best model
is the one with a high accuracy and a low variance.
It is computed through cross-validation (look at variance of results between folds);
you separate your dataset into k-folds, train model on k-1 folds, test it on the last one,
and then redo it (for each combination of folds). Then you can compare the
performance between each combination (AUC, accuracy, etc.) and look at the
variance.
13) How do you use a classification model to make predictions and what are the points
of attention? I explained the observation period, latency period, ...
Classification model = want to compute the probability of an object to be part of the

target.
Methodology:
- Create a vector space of your dataset.
- Ask humans to label some of the data to have a sample to start training the
model.
- Then apply the model on unknown cases to make predictions.
- Sample selection must be unbiased (took it randomly).
- The sample must be a good enough sample of possible values to ensure a
good generalization in “out of sample” universe space.
Response model: who will react the way we want to an action?

Methodology:
- Run a campaign on a sample and get the response from it.
- Build a model to compute the proba to respond positively.
- Apply it on new data.
- Selection bias: the first sample used to run the campaign must be
representative of the whole client population.
- Caution: cycle selection-model-selection will lead to shrinkage of the explored
universe –> bias in favor of certain types of clients. We can reinforce the
model by adding new responders, randomly selected.
Predictive model: to predict something that is in front of us in term of timeline.

Methodology:
- Event period: looking at some event occurrences in the event period.
- Observation period: the algo searches for a model that best discriminate the
event to be predicted.
- Latency period = time to action
- Apply the model on fresh observation data to make predictions.
- Model design really important: how to measure the event to capture?
- Selection bias: target should not be related to some business process bias.
5
- ! to variables that are artifacts of the target itself (find nonsense perfect
model).
- Variables must be stable in time (computed the same way between build and
apply).
- Latency can be used to remove artifacts.
- Apply data must always be computed in the same way as the build data.
- ! Seasonal event behaviour
14) What are the different ways to keep history in DWH?
Type 1: no history
Type 2: new dimension record for each new value (new surrogate key), valid_from
and valid_to attributes needed.
Type 3: attribute in the dimension record for previous values (history only at the
dimension level)
15) Data Science Case: You are a data scientist, and your client is a hospital, they
developed a new medicine for patients suffering from heart disease. When building
your model to classify potential patients that may have the disease, we ask you if
you would give more importance to precision, recall or none of them?
Minimize false negative = people sick but not detected against false positive (not sick
but receive the medicine less severe).
Recall = TP/TP+FN
Precision = TP/TP+FP
We want to maximize recall, more importance given to recall.
(Or focus on precision if we want to have only sick people and minimize the risk (for
insurances).
16) Data Science: Why do we use surrogate keys in dimensional models?
A surrogate key uniquely identifies each entity in the dimension table, regardless of
its natural source key. This is primarily because a surrogate key generates a simple
integer value for every new entity (auto-increment). Surrogate keys are necessary to
handle changes in dimension table attributes.
To store the history of dimensions in the DWH (add a new surrogate key for each
new change). We cannot do that with the business keys of the transactional model.
17) Overfitting: what is it? How to avoid it for decision tree, SVM and logistic
regression?
Overfitting occurs when the model classifies seen cases much better than unseen
cases: which means that the generalization of the model is poor. Trade-off to make
between overfitting and model complexity.
DT and logistic: see above
SVM: implicitly reduced by explicitly trying to maximize the margin, hence preserving
the generalization ability, dimensionality reduction or regularization.
6
18) What are the methods of variable selection?
Goal: to find the optimal features subset

Filter: pre-processing step (scoring fct), independently of the used classifier. Examples
of ranking criteria: Pearson correlation, information theory (mutual info), SVC
Wrapped: see above
Embedded: specific to a given learning algo and performs variable selection
(implicitly) in the process of training the model. Ex: Lasso, DT.
Better: start with a filter and then tune with a wrapper.
19) DWH: why should we implement a DWH?
Because transactional DB are a real nightmare with a lot of data:

- Data scattered over many systems: many versions, subtle differences.
- Do not understand ER, not always historical data.
- Semantic not explicit, available data poorly documented.
- Unexpected results: errors, different answers.
- Refused queries (not optimal).
- Joins costly.
DWH: business oriented (+ consistency), large queries (save from CPU load), historical
data (memory), optimized data structure for fast answers to queries, data that fits the
needs (granularity you need?), data dictionary for business users.
DWH: subject-oriented, integrated, time variant, non-volatile.
20) You work at UPS, a delivery company, which own 50.000 small delivery trucks
worldwide. Everyday more than 80 of them break down on their delivery way, and
this causes enormous problems of re-assigning the parcels, and delivering them on
time. UPS collects a lot of data on their trucks in real time: engine temperature,
pressure, kms done, and many other technical measures on the engine and the
drive. When a break down occurs, data on the break down problem and its
resolution is captured as well. The COO wants you to reduce on the road break
downs using the data collected. (Same as in the course)
It is possible with a predictive model. Target = truck that breaks down. Observation
period: months (but try different model design and test). Target period: 1 day (latency
= 12h (night)).
21) Improving a bank credit scoring: You are responsible to improve short term loans
acceptance policy of your bank. The bank has many records of past acceptance and
refusal for such loans for the last 5 years.
● How would you proceed?

Prob(Target=1)
Classification model
● How do you define the target?
7
Target = somebody who defaulted
● Which data do you want to use?
Demographic data (nothing else because of GDPR) as age, area, etc. people in the
same suburb will be in the same vector space. (Model for unknown clients)
Use data available of clients (model for known clients)
● How would select the best creditors
Expected value framework: sample (selection bias), model, compute EP for each
customer within each segment and choose all customers that have EP > c.
● Bonus question: how do would compute the profit of a future customer?
By doing a regression analysis to predict it. (Relation between X and Y)
22) How would you build a model to predict churn in your business knowing that the
model is dynamic and there is not a lot of churns in the number of data?
Predictive model: observation, latency, event period (times of period depends on the
business problem, here: months for observation) build the model (looking at target
occurrences in the event period, search for a model in the observation period) . Then
apply it on fresh data.
For rare target events, we might want to accumulate larger target sets by extending
the event period. But as the model is dynamic, we will lose precision.
Points of attention: see above.
23) What is a ROC curve and how is it computed?

A ROC curve is a graph that illustrates performance of a binary classifier system as its
discrimination threshold is varied to produce the classification. The curve is created
by plotting the TPR against the FPR at various threshold settings.
Example: if you have a medicine to cure cancer, you will move the threshold in order
to have more false positive than false negative to cure more people but if you are
Maggy deblock you would prefer to do the opposite (for cost and insurances).
24) What are conformed dimensions?
Conformed dimensions are dimensions shared by several fact tables. We need to have
them and to identify them, otherwise, we still have silo of information.
25) Why do we need additive measures in a fact table?
To be able to manipulate them (sum, group by, etc.). Otherwise, fact table is useless,
we just have a lot of lines that cannot be grouped into measures.
26) Will you have only one answer with regressions, logistic regression and SVM?
Please draw the answer or answers if there are several.
Several answers (all the points on the regression line, margin for SVM)
● He asked me the formulas of the logistic regression and RSS
RSS = Logistic regression:
● What is the SVM? Classifier that tries to maximize the margin between training data
and the classification boundary.
8
27) Cite 2 ways to build hierarchies in a star schema?
- Explicit attribute sets (ex: cheese milk product food)

- Snowflake: create a table per level and link it with 1:N relationship. It
normalizes attributes and reduce attribute redundancy (should join table to
have the right level of aggregation) stupid if you are not obliged to (could
be useful for holidays different in several countries for example).
28) Sweet spot?
Graph: tree size (model complexity) against performance (accuracy), a line for the
training set and another for the test set.
When the size of the tree increases, both accuracy increases but training one is
always higher as the model is built on training data.
After the sweet spot, the training accuracy continue to increase until = 1. The tree can
memorize the entire set. The test accuracy decreases. The subsets of data at leaves
get smaller and smaller, and the model generalizes from fewer and fewer data
(error-prone).
● What are the single points you see in the graph?

DT performance (on training or test set) for a given DT size.
● How do you evaluate when to stop?
By doing pre-pruning or post-pruning
● Where would you stop? And if we talk about Occam’s razor
At the sweet spot = point where the tree starts to overfit because it acquires details
of the training set that are not charact of the pop in general.
Occam Razor: between models that give same predictive performance, should always
prefer the less complex here, should stop a bit before the sweet spot (performance
do not really increase just before, but the size of the tree does).
29) What is big data? Example?
Big data is a term applied to datasets whose size is beyond the ability of commonly
used software tools to capture, manage, and process within a tolerable elapsed time.
Big data are non-structured, really big, and come from real-time streams. Example:
web logs at Google.
30) Explain random forest. Explain bagging ensemble method.
Bagging ensemble method: train learners in parallel on different samples of the data,
then combine by voting or by averaging.
RF = forest of decision trees that are created adding randomness to each DT build.
You grow K trees on datasets sampled from the original one with replacement
(bootstrap samples)
- Draw K boostrap samples of size N
9
- Grow each DT, by selecting a random set of m out of p features at each node
and choose the best feature to split on.
- Aggregate the prediction of the trees (voting from different learners) to
produce the final class.
How is diversity ensured among individual trees? Each tree is trained on different
data (bagging). + Random set of features thus corresponding nodes in different trees
can’t use the same feature set to split.
31) What is entropy? How to compute it?
Entropy = state of disorder = impurity measure. Computed to select informative

attributes based on information gain. Computed by looking at information disorder at
each node (= 0 when all record belong to one class = most info).
Entropy = H(t) =
Information gain (reduction in entropy achieved because of the split but favors
attributes with many values) = Gain(split) =
Gain ratio(split) (penalizes attributes that split examples in many small classes) =
10
SEP
Solvay
Entraide
& Publications
Ouvert tous les jours de 12h à 14h
SEP - Solvay Entraide & Publications
Solvay Campus Recruitment

Envie d’être tenu.e au courant des prochaines offres
d’emploi et de stage, mais aussi de participer à des
événements comme les Career Days, le Consulting
Month ou encore la Finance Week? Le Campus
Recruitment est alors fait pour toi!
Le Campus Recruitment organise de nombreux

événements ayant pour but de mettre en relation les
étudiant.e.s avec les entreprises. Si tu souhaites avoir
plus d’informations sur tous les événements que nous
organisons, scanne le QR code et suis nous sur nos
réseaux sociaux!
Bureau Etudiant Cercle

www.besolvay.be www.cerclesolvay.be
Bureau Etudiant Solvay Cercle Solvay
besolvay cerclesolvay

Inge 4 Bi Summary

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Inge 4 Bi Summary

Transféré par

Droits d'auteur :

Formats disponibles

SEP

Solvay Entraide & Publications

Thierry VAN DE MERCKT GEST-S430

Business intelligence &

Université libre de Bruxelles 2022-2023

Solvay Entraide & Publica ons

Au cœur des valeurs du SEP se trouvent l’entraide et le partage au sein de la communauté

Victoria, Lucia & Amélie

BI & Data Science

1. Introduction to AI, data science & machine learning

The 3 data paradigms:

What about AI & BI?

Data & analytics: intimate relationship!

2.1 Data science as a business formulation problem

Doing data science

Predictive modelling/supervised data mining

Case: market life insurance – supervised segmentation

Or… we can do a logistic regression (also a good segmentation)

2.2 Generalization & performance assessment

Measuring the performance of a model

Example: we want to understand a model to know if the label is high/medium/low not

How good is a model?

ROC curve = Receiver Operating Characteristic curve:

AUC = Area Under the Curve:

Confidence intervals & variance estimation:

Data overfitting: the nightmare of any model

The best model we have is the fold 4 (higher ratio AUC).

We can use bootstrap for the evaluation of the population.

2.3 Supervised techniques

Instance-based learning: K-NEAREST NEIGHBOURS

Decision boundaries: Voronoi Diagram

Distance weighted nearest neighbour

● Manhattan Distance: coordinate-wise distance

K-NN: keep in mind

Correlation coefficient r = measure of regression.

Quality of fit: Residual Sum of Square

Regression for binary classification

Regression: Keep in mind

Logistic regression: keep in mind

SVM = SUPPORT VECTOR MACHINES

SVM: keep in mind

If alpha = 0 we will be back to a simple regression.

Scikit-learn: regularisation parameter: L1 (Lasso), L2 (ridge) or 0 (no regularisation).

Last graph = intercept = β0 (average value).

Elastic net regularisation

● Splitting continuous attributes

Growing a DT – How to split records: how to determine the best split?

Entropy splitting rule:

● GINI splitting rule:

● Splitting on classification error:

Probability to flow to a certain node = frequency on that node (ex: 70 cases/100cases

Growing a DT – determine when to stop splitting?

How to avoid overfitting in DT?

Decision trees: to keep in mind

Random forests: keep in mind

2.4 Dimensionality reduction & parameter estimation

Why dimensionality reduction?

Applications of dimensionality reduction

Techniques for dimensionality reduction

Principal Component Analysis (PCA):

PCA: keep in mind

Select subsets of variables as a pre-processing step, independently of the used

2 questions about Pearson correlation:

Another ranking criterion – Information theory: