Académique Documents
Professionnel Documents
Culture Documents
INGE 4
Le SEP vous accueille tous les midis entre 12h et 14h avec le sourire et la joie dans notre
splendide local qui se trouve dans le Pint’House. Notre but est de vous fournir des notes de
cours, des résumés, des résolu ons d’exercices ou des examens des années précédentes.
Le comité SEP est une choue e équipe qui s’inves t toute l’année pour vous offrir les
meilleurs résumés, commander les exemplaires nécessaires et gérer les stocks de syllabi; ces
derniers qui n'a endent que vous. Il s’occupe aussi d’assurer les permanences tous les midis.
Si tu veux faire par e de ce e super team, n’hésite pas à passer nous voir !
Pour toutes vos ques ons, envoyez-nous un message sur notre page Facebook : SEP ~ Solvay
Entraide & Publica ons (n’hésitez pas à la liker aussi pour être les premier.ère.s au courant
de toutes les infos).
À bientôt,
Data have a big value today. Companies such as Google and Facebook valorise a lot of data.
Examples: 5 billion searches/day (Google), 1 million transactions/hour (Walmart).
But nowadays, use of data is strictly regulated through GDPR for example.
A lot of transactions and data come not from online (internet) but from physical stores.
HTML = declarative programming language (like SQL) that show to the browser how you
want that the page displays. It is difficult to extract data from HTML because you need to
understand the code used.
Big Data is a term applied to data sets whose size is beyond the ability of commonly used
software tools to capture, manage, and process within a tolerable elapsed time. Big data are
non-structured, really big (>1000 Terabytes) and come from real-time streams. Ex: web logs
at Google, tweeter data streams, …
Google developed distributed systems to manage and store big data because the former
systems were not able to.
The big change: “data-fication” of our world (through digitalization). Nowadays, data come
not only from companies anymore but also from consumers (smartphone).
AI & Analytics
Analytics is the “mind set” and processes of making sense out of data, in order to get
non-trivial & actionable business insights to use in daily decisions.
Data Science designates the process of solving business issues using data. It starts from the
business problem formulation and creates a process from scratch to solve it.
Machine Learning is a branch of Artificial Intelligence that builds algorithms able to “learn”
predictive models out of facts. Typical problem: recognition, classification based on labels.
Label examples: look what happened before, understand the signs, … and model it to do
better in the future. For example, based on a set of labelled images, I want to recognize the
known faces in unlabelled images.
Deep Learning designates algorithms efficient to learn to recognize images, sounds and
words. It needs a huge number of observations.
Stories: Telematic data to identify a driver signature, Tesla is the 1st to use AI in real life with
their automated cars.
What is AI?
The ultimate goal of AI is to build robots that are able to do what human can do (Atlas
robot).
AI (perception flow) ≠ machine learning ≠ big data (flow of unstructured data that you have
to manage but it doesn’t mean that you do AI)
Perception: step of understanding what happens (cameras, listen, load from internet...
perceive things from the internet).
Evolution of AI:
2
Ysaline Hermans MA1 - INGE
Since 2016 + Machine learning: neural network spend a lot of energy to train a neural
network (problem: if AI spread everywhere, energy will really become a problem).
General AI: are able to train computers very quickly with only some examples need to
move to that to consume less energy (but we are far to that).
A neutral network can be very efficient but never has conscious/meaning of life (no meaning
of what a table is, etc.).
Google assistant: lady on phone with it, she doesn’t know that it is a computer… need to be
resolved before we use it (for a juridical point of view).
BI is very close to the mindset of analytics; usage of data as analytics but add data
management as an explicit task.
BI designate the processes, and technologies that organize data into meaningful business
information used to empower strategic, tactical, and operational decision-making.
It is mainly concerned with data management, data aggregation and organisation, and
efficient insights reporting, including data visualisation.
Data sources
Transactional data (company activity, mostly structured data) is the most important source of
data today; the most powerful one. These data are highly accessible but not necessarily
easily usable.
3
Ysaline Hermans MA1 - INGE
Each company collect data of individual consumers activities (on their website, etc., mostly
unstructured data). It is a very important source to get unbiased data (data not influenced by
the company-client relationship). But today it is forbidden to sell them to other companies
without the agreement of consumers. Difficult to access data then because the data is
owned by the app provider.
! Privacy issues today (GDPR) no main privacy issues for transactional data. GDPR covers
only data related to individuals, do not concern data related to companies (personne
morale).
Machine to machine (M2M) activity - Internet of things (mix of structured and unstructured
data formats): put some kind of intelligence in small devices so they can communicate.
Not widely available yet but it will come. There are also main privacy issues.
So what?
AI doesn’t mean big data.
You have benefits above costs only when you do something with it. You must understand
that before doing AI Change the way management is operating = dealing with people
(complicated).
We see that data science support data-driven decision making. Business decisions are
increasingly made automatically by computer systems.
First, understand data organization (what data we have and how to access them). Then,
understand data and finally their value.
4
Ysaline Hermans MA1 - INGE
Data-driven decision making refers to the practice of basing decisions on the analysis of data
rather than purely on intuition.
Data-driven marketing – Customer insight: about capturing actionable information on our
customers. It mainly serves for communication purposes: channel, tone of voice, etc. Ex: who
is my customer? what does he buy from me?
Data-driven marketing – Behaviour prediction (goal of first assignment): about using past
data to predict the chances that a specific behaviour will occur given that we can act on the
context of the communication: channel, proposition, tone of voice, moment of
communication, etc. Ex: what she will buy next? What is the best moment to communicate?
Price optimization: use data to adapt prices depending on which type of people you have
around. Flight tickets: adapt prices to consumers.
Data Lifetime: a lot of signals are only valid for a short period of time thus turning big data
into value requires them to be real time. Act/react on events when they happen or even
before they happen!
5
Ysaline Hermans MA1 - INGE
Data science is a process: science + craft (= savoir-faire: acquiring by doing) + creativity (really
need to think out of the box) + common sense
6
Ysaline Hermans MA1 - INGE
7
Ysaline Hermans MA1 - INGE
Key – part 3: the result of supervised data mining is a MODEL that, given data, predicts some
quantity.
Classification tree can be translated in SQL statements:
Example:
if (income <50k):
then no life insurance
else:
life insurance
This result/rule can be applied to any customer, and it gives you a prediction.
There are different models: tree/rule (supervised segmentation - SQL), numeric function as
P(x) = f(x) (neural network), etc.
The difference is the type of target variable:
● If the label is categorical, we use classification (multi-class prediction (> 2 categorical
values) or binary prediction). Ex: churn, truck problems = classification A lot of
management decisions can be mapped as classification problems. We don’t use a lot
of classification model (even if tree) In general, we try to compute probabilities.
Classification problem: in general, the target takes on discrete values that are NOT
ordered. The most common is to have a binary classification where the target is
either 0 or 1.
3 different solutions to classification:
o Classifier model: the model predicts the discrete value (class) for each
example. There is no information on the likelihood that prediction is true.
Mostly never used.
o Ranking (binary case): the model predicts a score allowing to sort all
examples according to the likelihood to be in one class - from most likely to
lowest likely.
Used when the cost/benefit is unknown/difficult to calculate or when the
business context determines “how far down the list”.
o Probability estimation: the model predicts, for each class, a score between 0
and 1 that is meant to be the probability of being in that class. For mutually
exclusive classes, the predicted probabilities should add up to 1 (être égal à 1).
When the cost/benefit is known relatively precisely (value-based
framework) or when you want to choose among several decisions/models.
You can always rank/classify if you have probabilities.
8
Ysaline Hermans MA1 - INGE
● If the label is a numerical value, we use regressions. But it is quite rare that we use
that because it is difficult to do a good regression (lot of variances). For example,
nobody succeeded at predicting the stock market.
! With decision tress algorithms, we must be careful when approximating probabilities like
this: small size of sample in leaves, local sample (does not “look” around). In general, the
probability estimation is quite poor.
Key – part 4: a data-driven model can either be used to predict or to understand (explanatory
modelling can be quite complex)
Ex: the model tells you that the truck has a high proba of breakdown tomorrow you can
thus use another truck (fully automated model).
Predictive model: data mining example
Terminology
Target = dependent variable = what we need to predict.
Attributes = independent variables = what described the target.
Data science: there are always thousands of variables (>< statistics: few variables).
You can observe a lot of patterns but not all are models (examples slide 47).
9
Ysaline Hermans MA1 - INGE
Dimensionality of a dataset = sum of the number of numeric features and the number of
possible values (modalities) of categorical features minus 1 for each one.
Feature types:
● Numeric: anything that has some order (numbers, dates, dimension of 1).
● Categorical: stuff that does not have an order (binary, text, dimension = number of
possible values minus 1).
Model induction, learning, inductive learning, model learning: a process by which a
pattern/model is extracted from factual data.
What is a model? A simplified representation of reality created for a specific purpose
(classification/regression question).
Patterns observable in data are not all good. For example, “age is inversely proportional to
alphabetic order” seems to not be a good pattern.
10
Ysaline Hermans MA1 - INGE
Generalization
From tables to vector spaces:
Table (denormalised): each line is a unique client, and each column is a characteristic of each
client. We expect that columns are well filled for each client.
Vector space: each client is a vector, and it is possible to compute the Euclidian distance
between each pair of clients.
For supervised learning, we have labels for each client (ex: payment ok/defaulted). These
labels are acquired from data and are the ones we want to predict (for unlabelled ones).
Vector space generalization: extend the pattern Where do I stop? What is the optimum
partition of the vector space to maximize the likelihood that any prediction will be correct? I
cannot find it by computing them all thus I perform research in the space of possible
solutions; each machine learning algorithm proposes a search strategy, and a criterion to
evaluate any searched solution.
Generalization is the extension of the decision-making surface beyond the observed points.
It is the assignment of a predicted value where no observations have (yet) been made. All
algorithms, in one way or another, partition the vector space into several segments whose
value is predicted on the basis of past observations. Eventually, this prediction is
accompanied by a confidence estimator, often represented by the probability of occurrence
of the predicted value.
To generalize, you have to try to not include negative points in generalized positive areas.
Sometimes false positives are necessarily included because they are in a space where the
density is already high.
All algorithms put thresholds.
Generalization:
11
Ysaline Hermans MA1 - INGE
Decision trees: searche to cut the vector space (on x- and y-axis) in hyper boxes, minimising
the entropy of each resulting segment (stop cutting when no errors anymore) VS logistic
regression: searches for a single hyper plan that maximises the likelihood of the probability
distribution (using just a straight line, there are unavoidable errors).
Neural network: could potentially fits any decision space, if trained enough. The prof said
that it is a bad generalization.
When different algorithms are trained on the same data, parts of the vector space will be
linked to the same prediction, but many others will not.
12
Ysaline Hermans MA1 - INGE
Which algorithm is right? It depends on the bias of the algorithm (orthogonal, linear,
nonlinear), and its search strategy: are they adapted to the true pattern to be found? There
are hopefully empirical techniques to evaluate the quality of a model generalisation.
The choice of algorithms has no correlation with the success of the project (no right
algorithm; depends on bias or search strategy). Defining the target, etc. is more important.
The most important thing in a project is data!
Empirical inference: start from an example (sample) and infer to generalize (and have a
model that predicts on a larger universe). Everything is classifiable.
We never expect 100% accurate results. The models predict on a much larger universe, and
hence will make errors on fresh cases.
Factors influencing the performance: algorithms, expertise.
The importance of variables:
In a 2D space (pca-two, pca-three pca = principal component analysis), some points are not
separable. With the 3rd dimension added (pca-one), the points are clearly separable;
perception of distance between points changes completely when you add dimensions. The
choice of dimensions (variables) has a dramatic impact on any solution!!
Some machine learning algorithms include variables selection, some others (like deep
learning) build a new vector space to improve the ability to separate labels. The construction
of the right vector space is one of the most important tasks of a data scientist.
13
Ysaline Hermans MA1 - INGE
Lift curve:
14
Ysaline Hermans MA1 - INGE
We don’t want to select behind the 10% because points are less good and random. By
selecting only 10% of the population, I can reach 70% of my targets.
Beginning of the curve: no false positive but there are more after. If we have an Irma model
for a long selection, it should have something wrong.
Irma model: perfect information thus no false positive or false negative (she doesn’t make
errors). Perfect information (ideal line) is never a curve but a straight line that gives the “a
priori proba” (what is known).
Example - Excel file: there are a lot of false positive because low proba. There are also false
negative (ex: a grandma buys a manga for her son even if she doesn’t have the profile to buy
manga). The signal is stronger when the profile is more specific.
When negative points are surrounded by positive ones = false positives They cannot be
avoided (for good generalization).
If the cumulative curve float around the random line, it’s that we don’t have the right data.
Ex: we will not find any pattern if look for a profile that buy bestsellers.
! Formulation problem You need to segment the problem to find something in your data.
Ex: who buys Renault? No Who buys each model of Renault (Renault espace, etc.) =
segmentation.
Law of parcimony = Occam’s razor:
“The principle states that among competing hypotheses that predict equally well, the one
with the fewest assumptions should be selected. Other, more complicated solutions may
ultimately prove to provide better predictions, but—in the absence of differences in
predictive ability—the fewer assumptions that are made, the better.”
If you have a model with less assumptions as good as another model with more
assumption, it means that is not necessary to add complexity when it has no value. You
should choose the simplest model.
Confusion matrix for binary class (only for binary targets):
15
Ysaline Hermans MA1 - INGE
Metrics:
● False positive: we predict it will be positive, but it is actually negative.
● False negative: we predict it will be negative, but it is actually positive.
● Recall (sensitivity): on all true +, % of + correctly predicted = probability of detection.
E.g., if you predict to have 1000 true positive but you only have 900 true positives
and have 100 false positives recall = 90%.
● Specificity >< Recall: on all true -, % - correctly predicted.
● Precision: on all + predictions, % of true +.
● Accuracy: all true pattern/n
● Base rate/a priori probability: % of true + = (Tp + Fn)/n
How to increase recall? You can do that, but the precision will decrease Choice to make.
How to increase precision? You can do that, but it will degrade the recall Choice to make.
If you build an Irma model, what would be the recall and the precision? Irma says he will buy
or not (crystal ball) Perfect information thus no false positive or false negative (she doesn’t
make errors) thus recall, precision and accuracy = 100%.
Example of confusion matrix:
16
Ysaline Hermans MA1 - INGE
Where Pr(a) is the relative agreement between the prediction and the true pattern and Pr(e)
is the probability of chance agreement, if we assume that predictions are independent of the
true pattern.
If the prediction is perfect, then kappa = 1
If the prediction agreement with true pattern is only due to chance, kappa = 0.
Kappa computes a measure of improvement of the model on random guess.
Reminder:
For unbalanced class modalities, a classifier might show excellent accuracy, while not
learning anything useful! For example, if we want to predict the minority class (yes) but that
we have only true negatives, the model will have a high accuracy, but we learn nothing
useful. We want to predict positive, and the model only says that all are negatives. Be careful
about accuracy.
Confusion matrix & model usage scope:
Metrics based on the confusion matrix give an evaluation of the models as a whole. When a
probability is available, it is quite rare that the whole model classification scope is used: we
might use more complex decision processes only in a subset of the scope of the model.
17
Ysaline Hermans MA1 - INGE
Scope usage = P(x) > 50%, else other processes are used. This means that a model is used in a
smaller decision space than the original vector space.
An updated confusion matrix could be computed to reflect the performance in the decision
space only! False positives and false negatives outside the usage scope are meaningless...
This can be applied also for other evaluation measures such an AUC, value-based framework,
etc.
We have a binary model M that returns a score for each instance. Instances are sorted
according to the score (descending), and we compute the distribution of the score within
each class: + and -. The question of interest here is to go from a score to a classification.
Left graph: where do we put the threshold here? 50 because of classification rule no error
with that model = Irma model (cannot find that in real life).
Right graph: if I use 50 as a threshold, there are a lot of false positives. If I move to right, less
false positives but more false negatives (increase sensitivity/recall but decrease specificity).
But at 55, I minimize the number of the two.
Recall and specificity linked to the fact that you move the threshold of the probability for
the decision.
! We want to have the probabilities and not the classification (otherwise business problem).
A Receiver Operating Characteristic or ROC curve is a graph that illustrates the performance
of a binary classifier system as its discrimination threshold is varied to produce the
classification. The curve is created by plotting the true positive rate (Sensitivity/Recall =
probability of detection) against the false positive rate (Fall-out = probability of false
detection = 1- Specificity) at various threshold settings.
18
Ysaline Hermans MA1 - INGE
19
Ysaline Hermans MA1 - INGE
20
Ysaline Hermans MA1 - INGE
With most algorithms, it is always possible to build a “lookup table” that just link any past
observation to its class. All algorithms search for solution and most of them are based on
area/decision spaces. More you have dimensions, more your model is precise.
Overfitting occurs when the model classifies seen cases much better than unseen cases:
which means that the generalisation of the model is poor!
For the first and the second model, the 2 classes are not separable in that 2-dimensional
space: some variables are missing to completely account of the phenomena. The third model
is the most accurate on the training data here. This is a model that tries to avoid a lot of
errors. By doing this, the model generates other errors Bad generalization = overfitting; we
lose precision here by over searching and because of a too low representation bias (non
linearity).
When the model is complete, we have more complexity.
What are the main differences between these two classification problems?
Fort the first model, the 2 classes are separable, despite the pattern is quite complex.
Training a decision tree would require building a very large and deep tree. For the second
one, the 4 classes are not separable. Hence, over search would lead to a complex and deep
decision tree. But it is difficult to know if the depth of the decision tree would reveal
overfitting.
In high dimensional spaces, it might be difficult to say if the complexity of a model produces
accuracy or overfitting. Special techniques must be used to spot those problems: holdout
data and cross-validation.
21
Ysaline Hermans MA1 - INGE
A model is trained on a random sample of the data, and the unseen data is used for
testing/validation.
This is the way how we define when we must stop in machine learning. If we have enough
data, the curve is perfect. If we don’t have enough data to do the curve, we need another
model. “Sweet spot” = ideal point.
Cross validation consists in dividing the training data into n folds. A model is built using n-1
fold and tested against the nth one. It is used to measure the average performance of a
modelling process as well as its variance. Systematic drop of performance on test sets means
that the model overfits data and must be re-worked.
I’m training on this, and I test on that So, I learn 5 times and we have a view on the
variability of the curve that we form. For each test set, we define what is the ratio between 1
and 0. If I have a lot of 0, maybe I need to reconsider another test.
The Cumulative Gain charts shows the variance of models based on n folds x-validation.
X-validation is generally used to tune the algorithms free parameters, in order to build the
final one. For example, X-validation can be used to make a choice among available variables
to be included in a regression, or to decide about the maximum depth of a decision tree.
22
Ysaline Hermans MA1 - INGE
23
Ysaline Hermans MA1 - INGE
Machine learning
● Supervised: We are given input samples (X) and output samples (y) of a function y =
f(X). We would like to “learn” f, and evaluate it on new data. Types:
o Classification: y is discrete (class labels).
o Regression: y is continuous, e.g., linear regression.
Supervised: Is this image a cat, dog, car, house? How would this user score that
restaurant? Is this email spam?
Supervised algorithms: kNN, linear regression, decision trees, naïve Bayes, logistic
regression, support vector machines, random forests.
● Unsupervised: Given only samples X of the data, we compute a function f such that y
= f(X) is “simpler”.
o Clustering: y is discrete
o Y is continuous: Matrix factorization, Kalman filtering, unsupervised neural
networks.
Unsupervised: Cluster some hand-written digit data into 10 classes. What are the
top 20 topics in Twitter right now? Find and cluster distinct accents of people at
Berkeley.
24
Ysaline Hermans MA1 - INGE
Find the k nearest neighbours and have them vote. It has a smoothing effect. This is
especially good when there is noise in the class labels. Larger k produces smoother boundary
effect and can reduce the impact of class label noise. To find k, we vary its value. We divide
or cross-validate the k. K is fast to learn but not for classification.
I have a set of data, and I have enough data, I take a test set and I don’t touch it. Then, I
create a cross validation to see the variance. I learn on {1,2,3,4}/5 (circle divided in 5 and we
don’t touch to the 5th sample). I memorize this, I test the 1 with the 5, the 2 with the 5 and so
on. After this, I repeat the process, I have {2,3,4,5}/1. Etc.
What happen when K = N? Nonsense anymore because we don’t look at the closest
neighbours but at all data in the sample. There is a risk of underfitting (not good on the
training data and also cannot be generalized to predict new data).
The simplest model is the best model. We don’t want to add complexity. We search for the
minimum neighbours that offers the minimum error rate.
A small value of K increases the noise (risk of overfitting) but a large value makes it
computationally expensive and may lead to underfitting.
How to choose k? Can we choose k to minimize the mistakes that we make on training
examples (training error)? As the size of k increases, the error on the training set increases,
25
Ysaline Hermans MA1 - INGE
but decreases on the test set. K can be chosen so as to maximise other measures: recall,
precision, Cohen’s Kappa, AUC, etc.
If I try to have a maximum recall, what is the number of neighbours? At the level of the red
star. This model is not generalised enough. I have a lack of information on the dataset, there
is some outsider.
Scaling!
The scaling is always important for every machine learning that you use. Feature scaling
impacts the “similarity”.
26
Ysaline Hermans MA1 - INGE
method is a weak approximation because it depends on the number of neighbours only and
does not use the typology of the vector space.
K-NN metrics
● Euclidean Distance: simplest, fast to compute.
● Cosine Distance: good for documents, images, etc. Compute the cosines of the angle
between the two vectors.
● Edit Distance: for strings, especially genetic data. It gives the minimum number of
modifications/editions I need to do one string to equal the second one.
● Mahalanobis Distance: normalized by the sample covariance matrix – unaffected by
coordinate transformations.
You can use other distance metrics, but you have to be careful when choosing one metric
depending on what is the distance relevant to you.
27
Ysaline Hermans MA1 - INGE
LINEAR REGRESSION
Model = vector Bêta
We want to find the best line (linear function y=f(X)) to explain the data.
Assumptions on data
We have a lot of data test them.
● The relationship between X and Y is linear.
● Y is distributed normally at each value of X.
● The variance of Y at every value of X is the same (homogeneity of variances =
“homoscedasticity”).
● The observations are independent (no multicollinearity).
● There are no outliers.
Linear correlation
Regression strives to find a line to fit the points.
● If the points are perfectly aligned Perfect relation between x and y (Irma).
● Weaker relation if for an x value, you have a large span of y values (more spread).
● No information if the relation between x and y = circle; for each value of x, you have
all the values of y possible thus no solution in a perfect circle.
● If relation = horizontal line, that means that x has nothing to say about y but we can
build a predictive model because y is a constant.
28
Ysaline Hermans MA1 - INGE
R – squared:
^ ^
Let 𝑦 = 𝑋 β be a predicted value, and 𝑦 be the sample mean. A linear regression minimises
the R2 measure. The optimal solution is found analytically (no search→ quick).
RSS = 0 if the regression perfectly predicts the observed value and is high inversely.
RSS can be described as the fraction of the total variance not explained by the model.
RSS and linearity
We can always fit a linear model to any dataset, but how do we know if there is a real linear
relationship? You can test it and if you have a drop, it is a signal that maybe the regression is
not linear.
29
Ysaline Hermans MA1 - INGE
Or you can plot the residuals to check the linearity (visualize residuals).
30
Ysaline Hermans MA1 - INGE
Anything that is on the regression line has a 50/50 proba to be + or -. More you are far from
the line, more the proba to be + (above) or - (below) is high = gradient of probas
Scikit-learn: slide 38
LOGISTIC REGRESSION
31
Ysaline Hermans MA1 - INGE
It applies only to binary class: 1/0, +/-. Instead of computing the regression of Y on X, we
would like the regression to immediately give P(1|x). So, we would like to formulate the
problem like this: compute the likelihood of
However, the probability function is not linear: going from 0.10 to 0.20 doubles the
probability but going from 0.80 to 0.90 barely increases it. Also, the odds of being + (P+/P-) is
an exponential function. Hence estimating the probability distribution with a linear
regression will not have a good fit.
However, the logit (ln) of the odds ratio is linear. The formulation becomes: compute the
likelihood of
For a continuous outcome variable Y, the value of Y is given for each value of X. For a binary
outcome variable Y (+, -), the proportion of + cases are given for each value of X.
However, the decision surface is still a linear hyper cube.
32
Ysaline Hermans MA1 - INGE
33
Ysaline Hermans MA1 - INGE
The idea is that maximizing the margin maximizes the chance that the model will correctly
classify new data.
We try to put the frontier at the place where we maximize the margin because we don’t
know what will happen with unseen cases (so it allows to increase generalization).
Sometimes we cannot avoid misclassification. We can just influence the way we want errors
to be considered (different parameters); penalizes training points from being on the wrong
side of the decision boundary balance the margin surface with misclassification severity. 2
loss functions:
● Zero-one: penalization is uniform. =1 if misclassified, else = 0
● Hinge: penalization is linearly proportional to the distance from the closest support
vector. The cost of misclassification increases if points are further than the function
line (decision areas). It wants to maximize distance between positive and negative but
if further in positive then the cost of misclassification increases.
SVM vs. Logistics
Regression solutions can be dramatically influenced by outliers or noise (especially in sparse
vector spaces). SVM reduces this influence by explicitly trying to maximise the margin, hence
preserving the generalisation ability of the solution.
High dimensional feature spaces might induce model instability. Regularization or better,
features selection algorithms are key to avoid model instability, and variance.
Why would regression methods change radically results?
Residual errors: distance between predicted points and real data want to optimize
(minimize errors) that.
Lack of info in vector space (2D) to differentiate a point from the others.
34
Ysaline Hermans MA1 - INGE
SVM: optimize margin even if it means accepting some errors (optimal solution depending
on the two parameters: margin and errors) best solution for generalization.
In reality, we always want to have a view on recall and precision. Thus, if you have a model
with a multiple class problem, you create 3 binary models and then decided based on the
proba (and not just build a SVM on it).
NON-LINEAR REGRESSIONS
It is possible to use linear algorithms to fit non-linear patterns ones by using the « kernel
function trick ». The original input space is transformed (by applying a function like to the
power of 2) into another vector space (transformed space), that explicitly contains
additional, non-linear variables. Although the classifier is a linear solution in the
transformed feature space, it will be nonlinear in the original input space. A kernel can be
applied to the original input space to create a new vector space, or it can be applied ‘on the
fly’ when training or applying a model (SVM in general comes with some of those kernels).
The more polynomials I add, the easier it will be to find a linear solution without errors. But
if I go too far, I have too many solutions (too much sparsity) and it will be impossible to find
the best solution Thus, you can add features but not too many.
35
Ysaline Hermans MA1 - INGE
The new variables can be the combinations of original ones (x.y), exponential ones (x2) or any
polynomial function based on the original variables. Log is usually used on variables
representing money to move from a non-normal to a normal distribution. It has the effect or
creating a non-linear projection in the original input space.
Weight of evidence recoding
Recoding depends on the target you want to predict. You have to be sure that the pattern
will be linear after recoding.
This function is used to recode categorical variables into continuous variables that reflect
the strength of the relationship between modalities and the target (« Good »). It is linearly
correlated to the target.
WoE - Univariate linearisation of input space
Continuous variables: must first be transformed into a number of bins.
Categorical variables: nothing to do.
WOE is computed for each modality and can be normalized.
A new variable with the same number of modalities is created. Modalities are the
{normalised} WoE and no more the original values. A new vector space is built, where each
variable is linearly correlated with the target – and hence where linear separation is easier to
perform.
Regularisation
We know we can overfit because we have too many features. Regularisation is used to
reduce the possible overfit of regression models when there are many variables.
We can add variables using the kernel trick (x2, x3, etc.). Then, the dataset has no more one
single x variables as input but all the kernels (x2, x3, etc.). We can use gradually more
36
Ysaline Hermans MA1 - INGE
variables by starting with x alone and add each time a new variable, increasing the
polynomial degree of the expression. By increasing the number of variables, we can see that
the regression better fits the data, until reaching a point where there is clearly overfit.
Here we dot it for regression, but it is the same for classification.
Regularisation adds a term to the regression function that penalises the magnitude of the
value of the regression coefficients β. Regularisation term = alpha*R(β).
37
Ysaline Hermans MA1 - INGE
Lasso regularisation:
38
Ysaline Hermans MA1 - INGE
We have very small value, but we never reach a point where β = 0. Otherwise, it would mean
that this feature is removed from the model.
A careful evaluation of the best 𝛼 is mandatory. RSS of the model is by no means a good way
of selecting it because it is not a good generalization for the model.
With alpha given, the equation of Lasso will be minimized.
As alpha is a parameter, we will do a for loop (for I = 0.001 to 0.01 set alpha = 0.01) on the
regression (RSS+alpha*Sum/Bêta). Test AUC and then set alpha + 0.005 and then redo the
loop, etc. And then, we can decide what is the best.
Only cross-validation, or if much data is available, a single large hold-out validation set can be
used. What are the measures that we have seen that are a good measure of a quality of a
model? RSS (compute it on the test data) only increases with alpha, say nothing about
quality. AUC (ROC curve) is a good measure (with recall, etc.).
A regression is optimizing the RSS.
Example:
39
Ysaline Hermans MA1 - INGE
We have a training set of 20 000 records. Train the data and choose the model. How to
evaluate its variance on top of cross-validation? Bootstrapping on test set: duplicate
samples. AUC will vary a lot for unstable models >< robust models.
What are the parameters to reduce overfitting for a regression model? Pruning variables.
If you have a high variance model, most of the time, it means that you don’t have enough
targets (enough positive data).
Choose the best model according to the measures and the feeling you have on which model
is the best with variables.
Lasso vs. ridge:
In most of the cases, we prefer Ridge even when we have not so many features because
there are always some that are correlated. Why? Because if there is correlation, Bêta is
higher and features will compensate each other’s (“if one to the sky, the other to ground”).
We need regularisation to look at Bêta. Use Ridge for the final model because it is not
eliminating correlation, but it checks that the model is right.
Sometimes, Lasso is really good (but pruning a lot). Lasso pick only one feature (and the rest
is put to 0).
So, in general, we use both models.
Ridge Lasso
It includes all (or none) features in the Along with shrinking coefficients, Lasso
model. Thus, the major advantage of ridge performs feature selection as well by
regression is coefficient shrinkage and moving some coefficients to zero, which is
reducing model complexity. equivalent to exclude features from the
model.
It is majorly used to prevent overfitting. Since it provides sparse solutions, it is
Since it includes all the features, it is not generally a good choice when there are
very useful in case of exorbitantly high many features (thousands or more).
#features, say in thousands, as it will pose
computational challenges.
It generally works well even in presence of It arbitrarily selects any one feature among
highly correlated features as it will include the highly correlated ones and reduces the
all of them in the model, but the coefficients of the rest to zero. Also, the
coefficients will be distributed among them chosen variables change randomly with
depending on the correlation. changes in model parameters.
It is a compromise between Ridge and Lasso regularisation. Finding the optimal solution now
requires finding 𝛼𝑅, 𝑎𝑛𝑑 𝛼𝐿. This method has been built to overcome some limitations of
the LASSO regularization:
● The fact that it poorly performs when the ratio (number of cases / number of
variables) is very small.
● The fact that it tends to select one single variable from a group of correlated
variables.
40
Ysaline Hermans MA1 - INGE
Scikit-learn: slide 72
DECISION TREES
= an iterative & divisive (split node in 2 other nodes) algorithm to create piecewise
non-linear classifier over multi-class problem.
Search space: all possible sequences of all possible tests. The size of the search space is
exponential in number of attributes: too big to search exhaustively, exhaustive search
probably would overfit data (too many models).
A node is pure: if high/medium/low, you have a node only saying high or low or medium.
Making a decision on 10 cases is probably not reliable. At the beginning, decisions are well
informed but as we go deep and deep, we start having problems.
Top-down induction of decision trees: recursive partitioning Find “best” attribute test to
install at root, split data on root test, find “best” attribute tests to install at each new node,
split data on new tests. Repeat until: all nodes are pure, there are no more attributes to test,
some stopping criteria are met.
Greedy Search: once a node is expanded, no return possible. It does not guarantee that the
optimal tree is produced (in fact often it will not be the case…).
Iterative search splitting effect: after each split, the sample size for making a decision is
divided, quickly reaching small numbers. Hence, the statistical reliability of the decisions
decreases in the same way!
The algorithm is local in the sense that it only « looks » at data points in the current node,
and hence in the current portion of vector space to make any decision.
Decision trees can take more time than a regression.
Growing a DT – How to split records: how to specify the attribute test condition?
How to specify a test condition? It depends on attribute types (nominal, ordinal (integer),
continuous) & on number of ways to split (2-way split, multi-way split).
● Splitting nominal attributes:
o Multi-way split = use as many partitions as distinct values.
41
Ysaline Hermans MA1 - INGE
o Binary split = divides values into two subsets (requires to compute the optimal
partitioning CPU intensive but it has the advantage to reduce the data
splitting effect).
42
Ysaline Hermans MA1 - INGE
The information gain measures the reduction in entropy achieved because of the
split. Choice: choose the split that achieves highest reduction i.e., that maximizes the
information GAIN.
Problem with information gain: chance to have purity just by luck. It favours
attributes with many values. Extreme cases: social security numbers, patients’ ID’s,
date. If you are alone in your node: entropy = 0 (purity) bad idea to use IDs for
example.
Corrected splitting rule: poxy to copy with that problem.
43
Ysaline Hermans MA1 - INGE
When a node p is split into k partitions (child nodes), the quality of split is computed
as
Choice: choose the split that achieves lowest resulting Gini i.e., that minimizes the
Gini split.
44
Ysaline Hermans MA1 - INGE
To choose the best splitting rule, we can make a decision based on weighted average of a
criteria which is entropy, Gini or error rate. We have to choose among parameters: need to
understand them.
There can be differences among decision trees built with the different rules. However,
despite that the DT could differ, their classification quality should not show major differences
(AUC, Kappa, ROC). Only tests performed on unseen cases could designate the best criterion.
before) - stop if the number of instances is less than some user-specified threshold -
stop if maximum depth is achieved (parameter). It stops based on some statistical
significance test:
o χ2 test – Stop if no attribute splits the data in a way that the class distribution
is significantly improved than without it.
o Gini or Information Gain – Stop if expanding the current node does not
improve impurity measures above a certain threshold.
o Generalization Error – Stop if expected error does not decrease above a
certain threshold (see later for estimation measures).
Example: XOR/parity problem (really rare in practice) no individual exhibits any
significant association to the class, structure is only visible in fully expanded tree. If
you have strong interactions between features (ex: woman and pregnancy) even it is
not really a XOR, it will be really difficult for the decision tree to choose for which one
to go.
● Post-Pruning: fully grows the tree (possibly overfitting the data), and then post-prune
the tree from bottom to top.
Steps:
o Grow decision tree to its entirety: fully-grown tree shows all selected
attributes interactions. However, some subtrees might be due to sample
chance.
o Prune the nodes of the decision tree in a bottom-up fashion.
o If, after pruning, the error reduces more than a defined threshold, replace
sub-tree by a terminal node.
Sub-tree replacement bottom-up: consider replacing a tree only after
considering all its subtrees.
o The class label of terminal node is given by the majority class.
In all cases, Occam’s Razor principle is applied: given two models of similar generalization
errors, the simpler model is preferred over the more complex one.
Estimating the error rate (examples slides 102-104): you should certainly do that if you have
enough data.
46
Ysaline Hermans MA1 - INGE
Two strategies can be used to estimate errors in a tree: error is computed on the training
data (resubstitution error), or error is computed on a holdout validation dataset in order to
better measure the true error rate on unseen cases (“reduced error pruning” = REP or
“best-pruned subtree”).
We don’t want to add complexity without doing something relevant for the problem. You
have a lot of parameters to handle to decide how the pruning will occur.
Minimal cost-complexity pruning:
R(t) = error rate at that node if you prune/ Rα(t) = cost complexity measure/ Rα(Tt): error
rate of the subtree with weighted averages by the number cases in each subtree.
Last step of the tree above: 10 – 15 and then 8-5 and 2-10 you are not learning anything.
Alpha is too high but it is a parameter so let’s change it alpha = 0.06 doesn’t seem to be
the best neither test to know if we need to prune.
To find the best alpha: use a test set to compute R(t) and Rα(Tt) and search for the best alpha
or use cross-validation to compute R(t) and Rα(Tt) as the average over the validation sets.
47
Ysaline Hermans MA1 - INGE
generalization (optimal model). That is why it is quite unstable (because it doesn’t use
distances).
● They are by nature more sensible to small changes to sampling because instead of
looking at the geometric topology of the feature space, they just base their decisions
on counting class distributions.
● When simple, they can be understood by human beings.
● No reliable class probability: approximated by class frequency in leaf node. (Use
Laplace correction = nc+1 / (N + m) where nc = number of cases of class c in terminal
node, N = number of cases in terminal node, m= number possible classes (= 2 for
binary class)). Probabilities are done based on frequency of node (number of
positives vs negatives in the node not reliable to say 20% to be negative if 2
negatives vs. 8 positives because we come from a dataset with thousands of data).
● Are very sensitive to unbalanced classes: if the data set contains too few + cases, a
decision tree could possibly have all leaves labelled with the majority class. Ex:
500.000 negatives, 500 positives → A priori of 0.1%.
● How to solve this problem? Data undersampling: reduce the number of negative
cases and data oversampling: make n copies of the positive cases →500 * 100 =
50.000 → A priori of 10%. → I see a lot of data scientists going to 50% positive.
● Problem? Computed probabilities are wrong and must be corrected... Probabilities
are different from the reality because of data over/undersampling thus you cannot
use them. You have to correct them. How? Keep another dataset apart (but generally
difficult because if you do oversampling it is because you don’t have enough data).
DT are good to get insights about the data but bad to use probabilities and manage things
based on it (thresholds, precision, etc.).
Scikit-learn: slide 113
ENSEMBLE METHODS
They are like crowdsourced machine learning algorithms: take a collection of simple or weak
learners and combine their results to make a single, better learner. They are more suited to
algorithms that produce high variance or prediction instability like DT (less efficient for
logistic regression).
Types of ensemble methods:
● Bagging: train learners in parallel on different samples of the data, then combine by
voting (discrete output) or by averaging (continuous output).
● Stacking: combine model outputs using a second-stage learner like linear regression.
It means that you create many DT and then take the data and make each DT vote
make features and based on these, you build a second stage learner (like linear or
logistic regression).
● Boosting: train learners on the filtered output of other learners (leaning on errors
from other models). It determines what are the misclassified instances of the DT and
give a second chance to these. You force algorithms to focus more and more on
errors.
Random forests: forest of decision trees that are created adding randomness to each DT
build. Grow K trees on datasets sampled from the original dataset with replacement
(bootstrap samples), p = number of features:
● Draw K bootstrap samples of size N.
48
Ysaline Hermans MA1 - INGE
● Grow each DT, by selecting a random set of m out of p features at each node and
choose the best feature to split on. Typically, m is a parameter with default value m =
√𝑝.
● Aggregate the predictions of the trees (most popular vote) to produce the final class.
Principle of RF: take votes from different learners = look at intersections of many different
decision surfaces in our vector space.
How Random Forests ensure diversity among the individual trees? Draw K bootstrap samples
of size N: each tree is trained on different data. Grow a DT, by selecting a random set of m
out of p features at each node, and choose the best feature to split on. Corresponding nodes
in different trees (usually) can’t use the same feature set to split.
RF are probably the most popular classifier for dense data (<= a few thousand features). They
are easy to implement (just train a lot of trees), and parallel computing is easily done.
49
Ysaline Hermans MA1 - INGE
Dimensionality reduction
= Selecting some subset of a learning algorithm’s input variables upon which it should focus
attention, while ignoring the rest. It is deeply related to separating relevant from irrelevant
features, and identifying spurious relationships between features and target, due to sample
effects. (Humans/animals do that constantly)!
50
Ysaline Hermans MA1 - INGE
The required size of training sample (to achieve the same accuracy) grows exponentially with
the number of variables! In practice, the number of training examples is fixed (the classifier’s
performance usually degrades for a large number of features)! The loss of accuracy is mainly
due to the fact that most of the “events” to predict just depends on a few variables. Most of
the other variables just add irrelevant or correlated and redundant information. The real
problem is to identify the subset of variables for which the relationship with the target is due
to the true underlying law, and not to artefacts of the sample, on marginal contributions to
the prediction.
In real life, there is a lot of correlation and variance increases with cross validation (new
solutions appear).
you want p dimensions that are a combination of the d available features (p<<<d)
Create new features based on the original ones. Ex: reduce image/sound quality —>
Try to keep the most important signal.
The criterion for feature reduction can be different based on different problem
settings:
o Unsupervised setting: minimize the information loss (don’t use the target).
o Supervised setting: maximize the class discrimination (use the target).
● Feature selection: process that chooses an optimal subset of features according to an
objective function. You select several features and remove the irrelevant ones (not
creating new features).
Techniques: decision trees or decision forests.
● Features reduction VS selection:
Features reduction Features selection
All original features are used. Only a subset of the original features are
selected, without feature alteration.
The transformed features are (non-)linear
combinations of the original features.
Features reduction
Algorithms:
● Linear:
o Unsupervised: LSI, ICA and Principal Component Analysis (PCA: heavily used,
computed very quickly and has nice properties)
o Supervised: LDA, CCA, PLS
● Nonlinear:
o Nonlinear feature reduction using kernels
51
Ysaline Hermans MA1 - INGE
Eigenvectors: the first vector allows to capture the maximum variance and the second vector
perpendicular to the first one allows to capture most of the remaining variance... Linear
combinations of the original axes (thus difficult to say something individually on a single
feature because eigenvectors mix different features). No correlation problem anymore. Good
case? Need to test.
Non-linear PCA using kernels:
Traditional PCA applies linear transformation, but linear projections will not detect the
pattern thus not effective for nonlinear data Kernel trick; PCA is applied on transformed
vector space.
Scikit-learn: slide 17
Feature selection
Goal: to find the optimal feature subset (or at least a “good one”).
Methods:
● Filters:
We need a measure for assessing the goodness of a feature subset (scoring function)
and a strategy to search the space of possible feature subsets. Finding the best
minimal feature set for an arbitrary target concept is NP-hard (= as hard as
NP-problem = Nondeterministic Polynomial time problem) good heuristics
(approaches to problem solving) are needed.
52
Ysaline Hermans MA1 - INGE
Key charact: usually fast (CPU light), provide generic selection of features, not tuned
for a given learner. It is often used as a pre-processing step for other methods.
Disadvantage: feature set not optimized for used classifier (linear selection criteria for
nonlinear algorithms and vice-versa)!
Filtering uses variable ranking: given a set of features F, variable ranking is the process
of ordering the features by the value of some scoring function (which usually
measures feature-relevance).
The resulting set is a sorted list of features. The score S(fi) is computed from the
training data. By convention, a high score is indicative for a valuable (relevant)
feature. A simple method for feature selection using variable ranking is to select the k
highest ranked features.
Ranking criteria – linear correlation:
Pearson correlation criteria for numerical attributes:
r* measures for the goodness of linear fit of Xi and Y (can only detect linear
dependencies between variable and target). r = 0 could also be a circle.
53
Ysaline Hermans MA1 - INGE
o Can two variables that are useless by themselves be useful together? YES!
Correlation between variables and target are not enough to assess relevance!
→ Correlation of target variable with n-tuples of variables could be
considered too! But it is difficult to compute all possible n-way interactions.
54
Ysaline Hermans MA1 - INGE
It assumes that all variables are categorical! Hence, numerical variables must be
transformed into categories first (for example using discretization techniques).
Probabilities are based on frequency counts. If X and Y are independent, then mutual
information = 0 and X can say nothing about Y.
Mutual information (MI): quantifies the "amount of information" (in bits) obtained
about one random variable through observing the other random variable: knowing X,
what do we know from Y? MI is linked to the concept of entropy.
Intuitively, if entropy H(Y) is a measure of uncertainty about the target variable Y, then
H(Y|X) is a measure of what X does not say about Y. It is "the amount of uncertainty
remaining about Y after X is known”.
Example of MI: suppose X represents the roll of a fair 6-sided die (dé), and Y
represents whether the roll is even (0 if even, 1 if odd). Clearly, the value of Y tells us
something about the value of X and vice versa. That is, these variables share mutual
information.
P(die=6)=1/6 - P(even)=1/2 – P(odd)=1/2
P(die=6Ieven)=1/3 – P(die=6Iodd)=0
P(evenIdie=6)=1 – P(oddIdie=6)=0
Knowing about even/odd result reduces a lot the marginal entropy about the
outcome of rolling the die.
Scikit-learn: slide 35
Another ranking criterion – SVC:
55
Ysaline Hermans MA1 - INGE
Single Variable Classifiers (not the same as Support Vector Classifier) Idea: select
variables according to their individual predictive power. Criterion: performance of a
classifier built with that single variable. The predictive power is usually measured in
terms of error rate (or criteria using false positive rate, false negative rate or AUC).
Also: combination of SVCs using ensemble methods (boosting, ...).
Example: logistic regression on a single feature —> very quick VS on 1000 features,
one by one —> still quick, get the prediction potential of each feature, sort features
according to their AUC, accuracy, ...
● Wrappers:
Filters are useful but not searching for interaction between features Test if you are
going too far.
Learner is seen as a black box. ML algorithm itself is used to evaluate the model
performance as the subset of features is modified (using a defined search strategy).
Cross-validation is used to evaluate the model performance. We need to define:
o How to search the space of all possible variable subsets?
o How to measure the predictive performance of the learner?
Wrapper characteristics:
o The problem of finding the optimal subset is NP-hard!
o A wide range of search strategies can be used. Two different classes of search
strategies: Forward selection (start with empty feature set and add features
at each step) - Backward elimination (start with full feature set and discard
features at each step).
o Generated models are measured using a validation set or by cross-validation
(to make sure that they do not overfit).
o By using the learner as a black box, wrappers are universal and simple!
o Price to pay: highly CPU intensive!
56
Ysaline Hermans MA1 - INGE
`
Compute if the model is better than the previous one or not.
57
Ysaline Hermans MA1 - INGE
58
Ysaline Hermans MA1 - INGE
3.1 Designing protocols for business decision making (Nothing in the book – the most
important lecture)
Model design
We have different classes/segments separating objects.
It is the “model design” that allows to specify the type of questions that will be answered by
the model:
● To which class belongs this object? → Classification model
● If we take an action, who will react the way we want? → Response model
● This object will have this X behaviour in Y days/weeks/months? → Predictive model
● The sample features values must be a “good enough” sample of possible values to
ensure a good generalization in “out of sample” universe space (if not we might
expect to have generalization problems). When you apply a model on fresh data, you
always have to check if you have out of sample data because if it happened a lot, it
could mean that you have to re-train your model.
Discussion: improving a bank credit scoring
You are responsible to improve short term loans acceptance policy of your bank. The bank
has many records of past acceptance and refusal for such loans for the last 5 years. How
would you proceed?
● How do you define the target? Probability of default given some customers. Target =
somebody who defaulted (target = 1) VS somebody who took a loan but didn’t
defaulted (target = 0). ! That is only for people who took a loan, that is important
because otherwise everybody in the bank will be scored even if no loan. People that
took a loan VS no loan (+ vs -).
● Which data do you want to use? Demographic data (age, education, etc.) people in
the same areas (suburb level of aggregation of data) will be in the same vector space
(because you cannot differentiate them because of GDPR). You can use the data of
your clients (bank accounts, deposits, investments, etc. A lot of things are
available).
One model for unknown people (only data: questions we have, and info based on
where you live) and one for known people.
● Is there any bias that affects your sample? The bank has a policy which means that
not everybody is accepted for the loan and thus not everybody is entering in our
scope.
Among people, there are false positive (we said it will be okay, but it is not) and false
negative (we said these guys are not good, but they are). What are the possible
improvements you can do? In other words, which types of errors can be improved?
We cannot improve false negative because we don’t see these guys as they were left
out by the policy opportunity cost that create a bias. We can improve the false
positive but not the false negative (can improve only one type of errors). We can
increase the precision of the policy with a classification model (or even change the
policy). Advice: drop the policy (or at least lower the threshold to exclude only
obvious people) for a given time costly (defaults that will come in) but will give you
information on false negative to see how to improve the policy.
Model design is really important: all about how I will setup my data. We don’t really care
about the algorithm.
Discussion: new shops
E6Mode has 12 shops in the north of Belgium, all clients use a fidelity card allowing to
receive special offers by post. They want to open 4 new shops in the south. But where? How
to do help them?
● What is the target? Target = 1 for the places where I have a store
● Which data can be used? Suburb (INS9): features = income, etc. Wallonia is not part
of the scope for the training of the model because there are no stores in Wallonia.
● Which points should receive attention? Can we try to bias the model based on
successful stores? Ask the clients which stores work better.
That is the job of data scientist (and not to have a deep understanding of the algorithm).
60
Ysaline Hermans MA1 - INGE
Response models
We can reinforce the model with iteration of new responders (used to re-train the
model).
! Selecting only customers or prospects that have the “profile” to respond leads to
propose this product only to those customers. Recycling the positive answers into
updated models each time, will, after a number of cycles, lead to miss changes in
behaviour from other type of clients. How do you overcome this issue? Always add a
group of people that will be selected randomly in your campaign (in the first selection
step: add 50% random for example).
61
Ysaline Hermans MA1 - INGE
And what about social networks? Confirmation bias: Facebook uses response model
Send you random info at the beginning but depending on the response you give to
it, you will have a reinforcement loop starting (you will see what you want to see).
Predictive models
To predict something that is in front of us in term of timeline.
Methodology:
Classification is used to create predictive model by design:
We build a predictive model by looking at some event occurrences in the event period
(Target period). The algorithms searches for a model in the observation period that best
discriminate the event to be predicted (target =1). By applying the model on fresh
observation data, it gives a prediction for the future event period.
The time of the observation period depends on the business problem; 3 months for
telecom churn but years for bank loans for example.
The latency period = time-to-action period Ex: 2 weeks, I try to predict who will churn in
2 weeks.
Points of attention:
● How can you measure the event you want to capture? This is a very important
question to ask yourself. Think about churn for a Telco? Model design!!
● Target should not be related to some business process bias (previous campaign
selections with strong biases “we targeted only some regions”) should be aware of
bias.
● Be careful with variables computed in the observation period that are artifacts of the
target itself! Think about “Income” for credit risk estimation. ! Where are the
artifacts? Otherwise, you find “perfect” models, but it is nonsense.
● Variables must be stable in time: the way variables are computed must not change
between periods Build & Apply. Think about a “country segments” changing.
● Latency is the time-to-action period, it can also be used to make sure to remove
artifacts.
● The Apply data must always be computed in the same way as the build data! No
difference between the observation and the fresh observation periods.
● Be careful with seasonal event behaviour: predict probability of a client to buy
gardening products. Example: look at products sales distribution: see that there is a
peak in early April until end of June but it is sold all the year along. Think: what will be
the effect on the model design?
Exercise:
You work for a food retailer. You want to assign a long-term value class to any new client 5
weeks after the client acquisition (High, Medium, Low). Define:
62
Ysaline Hermans MA1 - INGE
For rare target events, we might want to accumulate larger target sets by extending the
event period. Problem: if the pattern is highly dynamic, we lose precision for events in the
end of the Event period → “dynamic model design”.
Points of attention: if targets are not distributed evenly in each event period, the algorithm
could spot some variables that gives information on the time slice used (second slice event is
“Auto Saloon” and event is “Auto Insurance”). → In that case the model will encode an
artifact linked to the model design and not to the real predictive pattern.
Points of attention:
● Predictive models can be computed on the whole customer population or on a
sample used to compute response models.
o Proba(Target=1) → Simple predictive model
o Proba(Target=1/targeted) → Response model
● In fact, all response models are predictive by nature (a campaign has been done
previously) but not all predictive models are response models.
Think: is a Churn model predictive? Is it a response Model? predictive model (but not
response).
In this setting, some variables can highly damage your model performance when you move
from one observation period to another one, due to spurious timely relationship (not
repeating themselves), or due to hidden indirect relationship between the variable and the
target. Ex: prediction of charcoal (charbon de bois) buying. Period 1 had a fantastic weather:
sun & 30°. People having a garden are over-represented in target. In Apply period the
weather is heavily cloudy. People having a garden have more money and when it is heavily
cloudy, they move out on city trips. No Charcoal at all! Owning a garden has the inverse
effect from what is expected (you miss the weather variable!)
63
Ysaline Hermans MA1 - INGE
64
Ysaline Hermans MA1 - INGE
You oversample. You build a model, and you want to have an idea of the accuracy. How?
Oversampling: seems to learn something (less errors) but generalization you forced to have
is non-sense.
Always use natural density for the test set because unseen cases come from the true
population (not from oversampling).
65
Ysaline Hermans MA1 - INGE
1) Launching a campaign
66
Ysaline Hermans MA1 - INGE
2.52€ = expected benefit per person reached by the campaign. 130 persons will be
selected for the campaign (predicted yes).
Confusion matrix: if you overfit, you don’t expect to have the same performance on
fresh data. But use the confusion matrix to compare with similar perfume or small
samples (launch a small campaign to have data).
Campaign 1: number of people selected = 1140 (predicted yes), profit/person = 2.52€, ROI
= 25175/877 = 28.7.
Campaign 2: number of people selected = 4737 (predicted yes), profit/person = 3.04 €,
ROI = 30351/4386 = 6.92.
Is the second model better than the first one? Yes, if the CEO wants a maximum cash
whatever the investment. Otherwise, the first model is better (better ROI). The business
has the power to make the decision (not the data scientist).
67
Ysaline Hermans MA1 - INGE
With this new campaign settings, the first model generates a profit whilst the second
one lose money ...
! Data is the most important asset for us algorithm is really the last point. Then, model
design is important because it is the way you build the process to answer to a problem. If
your model is not good enough, the only thing you can do is searching for new data.
Reformulation: an example
68
Ysaline Hermans MA1 - INGE
What if we do not have enough past respondents on this product? How can we compute 𝐿𝑇𝑉
𝑥 ? Think!
● Buy the information: run the campaign without LTV and wait to get enough
conversions to build the model.
● Approximate the LTV on all customer having the product, without looking specifically
to respondent customers.
● If the product is new and hence no customer has it, approximate the LTV on a similar
product (branch 21 for example) for respondents or not
● If none of above is possible, make an approximation based on business knowledge:
Private segment → Value = 1500EUR - Mass Affluent → Value = 800EUR - Retail+ →
Value = 150EUR.
There is a good chance that the different segments will react differently to a
solicitation. Hence, it might be interesting to check if we could improve the prediction
models by segmenting them on theses populations...
Final protocol to solve the case according to our objective:
69
Ysaline Hermans MA1 - INGE
1. Build a model (from a random sample) for computing probability to respond positively,
given the customer profile:
● Test the following designs:
o Build a general model for probability to respond positively, given the customer
profile.
o Build 3 segmented models : Private Customers, Mass Affluent, Retails+
(better)
● Compare the 2 model designs and choose the best one (prefer global if segmented
models do not bring more than 5% expected net earnings for the campaign).
2. Build a model to approximate the LTV of respondents, which can be different from the LTV
of all customers.
● Use all customers for whom we have at least 5 years of observation in the period.
● Compute LTV by Net Present Value of all benefits of this population discounted at 5%
interest rate (already a heavy job! Each starting date is different for each customer...).
● Compute LTV per segment Private, Mass Affluent, and Retail+.
3. Compute the EP for each customer within each segment.
4. Select, in each segment, those for which EP > c (cost of acquisition).
What could be wrong with this approach, given that this protocol should be applied every 3
months on fresh data?
Best prospects for campaign today will be also the best ones in 3 months... →add a business
rule: no one can be selected twice within a period of 6 months.
What if the CEO plans for this year to get at least 15% EBITA?
You should ensure that any 1€ cost will generate 1.15€ in return and hence the selection rule
becomes 𝑃(𝑝|𝑥) . 𝐿𝑇𝑉p(𝑥) > 𝑐. 1,15.
What if there are other products to cross-sell to your customers that have even greater LTV?
You should understand if the best populations for those other products are the same as for
the B23? In fact, you need to set a global framework to balance the choices and to schedule
the campaigns. It’s a business issue...
What if the CMO plans for this year to increase the penetration ratio of capitalisation
products from 7% to 9% “at all costs” (his bonus depends on it!)?
How much are we prepared to invest (negative expected value) on all campaigns for this
product to increase its penetration rate? To answer this question, you need to look at how
much customers must be converted to get to that 9% and balance it with the total cost of
acquisition...
Number of clients = 1.000.000 - Actual clients of B23= 70.000 - Number of clients to convert:
20.000 – Process: sort scored client file with no B23, DESC, choose
→ Note: importance of having an unbiased P. Think: what if this SUM on all clients < 20.000 ?
Maybe you should schedule 12 smaller campaigns and tune each one based on learning of
the previous ones...
Conclusion
● Think carefully at the business question.
70
Ysaline Hermans MA1 - INGE
● Include a global vision in your plan: a model is supposed to go in production and run
frequently. Thinking on how it should be used on the long term is key (do not spam
your clients!).
● Rely on the strategy of the company to understand how to use your models: financial
objectives, market penetration, societal effects, etc.
● Always think first how Data Science can help achieving the strategy of the company
and only after that, look at tactical improvements.
● Be aware that predictive modelling is just one tool to be used to improve some
aspects of the company process. Other discipline values like product excellence &
operational excellence, must be improved as well to maximize the impact of data
science contribution.
71
Ysaline Hermans MA1 - INGE
72
Ysaline Hermans MA1 - INGE
● Second path: not more complex but not correct because salesperson and ordering
person are not the same employeeID in sale and employeeID in animalOrder are
not the same.
● Why this is not correct? 2 problems: Listprice used instead of the saleprice to
compute the margin & ShippingCost is for delivery of several animals if animals
belong to different breeds, then I count the shippingCost 2 times.
● Possible final answer:
73
Ysaline Hermans MA1 - INGE
Transactional DB are difficult to read with large databases because of the understanding of
the database and the business processes. It can be a real nightmare.
● I can’t find the data I need: data are scattered over many systems - many versions,
subtle differences.
● I can’t get the data I need: I do not understand ER: which tables to join and how to do
that? - I do not find any historical data to see the trends...
● I can’t understand the data I found: field names are not explicit: semantic? - Available
data poorly documented
● I can’t use the data I found: results are unexpected: there are so many errors! -
Depending on the source, I get different answers!? - Data needs to be transformed:
generalization, derived values.
● My query is refused by the computer: it takes ages to compute: how should I optimize
the query?
● Joins are very costly for the database (decrease performances).
Data science: 80% of the time is used to cope with data to make them usable (business
modelling).
74
Ysaline Hermans MA1 - INGE
We want to create a copy of the operational data that will have high consistent, qualitative
historical information. Data store: centralise info extracted from the different departments.
Metadata = documentation (info that we have, meaning of tables, fields, etc.)
● Challenge 1 – Data diversity: issue= how to integrate all these sources for decision
making? Solution = DWH: integrated, business oriented, accessible, query optimised,
stable & consistent.
● Challenge 2 – Data volume: issue = how to extract relevant info from our data?
Solution = throw data away without using it (when disk full for instance), query and
OLAP tools, data science: data mining, text mining, network mining.
75
Ysaline Hermans MA1 - INGE
OLTP VS DWH
OLTP = OnLine Transaction Processing DWH = Data WareHouse
Application oriented Subject oriented (focus, don’t keep useless
things (le grenier de la grand-mère)
Used to run business Used to analyse business
Clerical user (employé de bureau) Manager/analyst
Detailed data Details to be aggregated
Current up to date Snapshot data
Isolated data Integrated data
Std repeated small transactions Ad-hoc access using large queries
Read/update access Mostly read access (batch update)
No time stamps necessary Historical data is a must
76
Ysaline Hermans MA1 - INGE
● Build a comprehensive data dictionary for business users: understand what you look
at...
● Leave the user define and implement his needs: use your query or OLAP tools.
Details = granularity. It takes years to build a data warehouse but then when you have it,
achieving higher level is easy. Analysis starts from top (use cases) but building starts from the
bottom. You need to clean the data before integrating them into the DWH.
A data mart is a simple form of data warehouse focused on a single subject or line of
business. With a data mart, teams can access data and gain insights faster, because they
don't have to spend time searching within a more complex data warehouse or manually
aggregating data from different sources.
Characteristics of DWH
W.H Inmon and Ralph Kimball are fathers of DWH approach.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection
of data in support of management’s decision-making process.
● Subject-oriented:
o Organized around major subjects, such as customer, product, sales.
o Focusing on the modelling and analysis of data for decision makers, not on
daily operations or transaction processing.
o Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
77
Ysaline Hermans MA1 - INGE
o Ex: sales process (full costs includes also logistic costs), logistic process
(customer distance from central warehouse).
● Integrated: data in a DWH is ALWAYS integrated.
o Constructed by integrating multiple, heterogeneous data sources: relational
db, flat files, on-line transaction records.
o Data cleaning and data integration techniques are applied: ensure consistency
in naming conventions, encoding structures, attribute measures, etc. among
different data sources. E.g., hotel price: currency, tax, breakfast covered, etc.
o When data is moved to the warehouse, it is converted
- Missing data: decision support requires historical data which
operational DBs do not typically maintain.
- Data consolidation: decision support requires consolidation
(aggregation, summarization) of data from heterogeneous sources.
- Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled.
● Time-variant: there is always a time element in everything that we load in a
warehouse.
o The time horizon for DWH will be significantly longer that what we would
expect to see (compared to operational systems). Operational DB: current
value data. DWH: provide info from a historical perspective In most of the
cases: 5 to 10 years of data.
o Every key structure in the data warehouse contains an element of time,
mostly explicitly. Operational data may or may not contain “time element”.
● Non-volatile:
o A physically separate store of data from the operational environment.
o Operational update of data does not occur in the data warehouse
environment. It does not require transaction processing, recovery, and
concurrency control mechanisms. It requires only two operations in data
accessing: initial loading of data and access of data. A data warehouse is
NEVER updated, it is just loaded (SELECT).
o A query done 3 months ago will return the exact same value if it is run today.
+ 4 levels of data in DWH: old detail, current detail, lightly summarized data (= data distilled
from current detail data. It is summarized according to some unit of time and always resides
on disk) and highly summarized data (=data distilled from lightly summarized data. It is
always compact and easily accessible and resides on disk). Metadata (= the data providing
information about one or more aspects of the data; it is used to summarize basic information
about data that can make tracking and working with specific data easier) is also an important
part of the DWH environment.
78
Ysaline Hermans MA1 - INGE
Direct queries = SQL: allows to access all DB you want. Reporting tools: allow us to define
what is measured, the nature of the measure depending on the context.
DWH size: quite big (billions of lines), you will never have that in operational DB. Typical
transactional DB = 100 GB vs. typical DWH = 100 TB.
OLAP
It is the idea of using multi-dimensional models with hierarchies. It is the fast and interactive
answer to large, aggregated queries.
“I have sold for 1 thousand” Are you happy with that or do you have questions? What did
you sell? To whom? When? (dimensions).
Measure = what is additive (cost, number of clients, sales, etc.).
Dimension: give the context of the measure. Dimensions have natural hierarchies. E.g., for
the region: country, city, office. You could see that as a cube with 3 dimensions: region,
product, month for example.
79
Ysaline Hermans MA1 - INGE
Star query model: can answer to any question with this star key map.
Customer has no granularity here. Could it have some? Yes, with household level for
example. You could create a segmentation of customers but = slice not hierarchy.
Highest granularity level for time here: daily all points closest to the middle.
When you want to design a DWH, the first thing to do is to meet all people and write those
star schemas and that will be the basic info that you will use to start modelling the
warehouse.
OLAP characteristics:
● Navigation tool: you typically explore the data to look at some hypotheses that you
have in mind. Excel can play this role today (pivot tables).
● Hypothesis driven search: OLAP is dedicated to validate some hypothesis, for
example, factors affecting defaulters. Completely different of what we do in data
science. Ex: is defaulting rate related to age?
● Need interactive responses to aggregate queries while exploring: must be fast, limits
the number of dimensions that can be used.
OLAP Server Architectures:
● Relational OLAP (ROLAP):
o Use relational or extended-relational DBMS to store and manage warehouse
data and OLAP middle ware to support missing pieces.
o Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services.
o Greater scalability.
● Multidimensional OLAP (MOLAP)
o Array-based multidimensional storage engine (sparse matrix techniques).
o Fast indexing to pre-computed summarized data.
80
Ysaline Hermans MA1 - INGE
81
Ysaline Hermans MA1 - INGE
2 types of tables:
● Central table = fact table that contains factual information, i.e., measures that we
want to have (to be able to manipulate these ones)
● Other tables = dimension tables (do not take a lot of space: tables with 50 000
thousand lines are just tiny tables “tiny Mickey Mouse tables”).
Surrogate key = technical keys (different from keys in operational DB), all together these
keys constitute the PK of each record in the fact table
When you do a sum in the select, you need to do a group by. Example of query: SELECT
sum(sales), dayofweek FROM… GROUP BY dayofweek.
82
Ysaline Hermans MA1 - INGE
You should only do that if you have some info that are updated independently from others
extract table with this info and kept it apart.
Star schema
= single way to represent everything we need.
83
Ysaline Hermans MA1 - INGE
Modelling dates:
Fact tables contain time-period data Date dimensions are important.
84
Ysaline Hermans MA1 - INGE
85
Ysaline Hermans MA1 - INGE
There are M2M relationships between sales and product and between promotion and
product Bridge table: sales-line (coding an event between 2 entities in the system).
What if the date (= time dimension) is missing? You could use the downloading date; if you
download the data every night, you can add the date of the day to all new transactions.
The name of the key in the sales_fact table, we don’t care (clerk_key shows the link with the
employee table). In general, we have the operational keys in the dimensions. But we are
not using them to connect to the fact table. Why do we have them then? Because, at some
moment in time, we will have to update dimensions because of changes issued from the
operational DB.
The design process
1. Choosing the business subject (data mart):
o A specific set of business questions (belongs to a BL): finance, marketing, etc.
o A set of related fact and dimension tables: costs, sales, clients, etc.
o Single source or multiple source: ERP, legacy, Excel, etc.
o Conformed dimensions: dimensions shared by some facts of the
constellation. Identify which are the conformed dimensions is the key. It is
conformed only if the PK of the table is the same PK of the other table that is
connected to the fact table. The dimension is shared by 2 fact tables. We need
to have these conformed dimensions, otherwise we still have silo of
information.
o Typically have a fact table for each process (joins will be done through
conformed dimensions) → a data mart will contain several fact tables. Ex:
campaign selection.
2. Choosing the grain (unit of analysis): the granularity determines the level of detail of
each fact record. Better to focus on the smallest grain in general: moving from high
granularity to a lower one is always possible (it is just a group by), but the opposite
action is not possible.
3. Choosing the dimensions: a dimension table is a table connected to the fact table
with a surrogate FK (and not operational keys). Dimension tables contain text or
numeric attributes that gives the context of the measures available in a fact table
(time, shop, client, etc.). Dimension’s attributes will be the source of query
86
Ysaline Hermans MA1 - INGE
constraints. Surrogate keys will be used to store the history of the dimensions. They
are the central point of the whole dimension architecture. In general, dimensions are
1ToM relationships with fact tables, but it is not always the case. Then, a bridge table
is built. With a bridge table, a single record of the fact contains a unique key, but it
links to many records of the bridge table and the primary key of the bridge table is
the two keys of the tables joined with it. However, only one of the keys is used to join
with the fact table.
(Example slides 30-31).
4. Choosing the facts: the facts represent a process or reporting environment
interesting for some business users. The fact tables contain only measures (fact
attributes) and a primary key composed of all foreign keys connecting to dimensions.
It is important to determine exactly what it represents. Typically, it corresponds to an
associative entity (bridge table) in the ER model (→ measures). It tends to have huge
numbers of records (the max number of records is the combination of all records of
each attribute (cartesian product)).
5. Defining the measures: measures are the measurements associated with fact records
at the fact table granularity. Normally, numeric and additive. Measures that are not
additive on time dimension but additive along all other dimensions are semi-additive.
Finally, there are non-additive measures (“Value-per-unit” measures). Ex: exchange
rate. Attributes in dimension tables are constants. Facts attributes (the measures)
vary with the granularity of the fact table: when aggregations are done, they should
smoothly aggregate as well.
Good dimension attributes:
● Clear & precise: gives a rich context to the measure, use crystal clear naming and
semantics.
● Complete: all attributes within a dimension completely describe the entity
represented by the dimension. In case of conformed dimensions, completeness can
be dictated by many business processes: color is important for marketing but not for
supply.
● Quality assured: no trash value like 99 for AGE... NAME is standardized and have no
duplicates.
● Equally available: attribute must be available for most of the objects it represents –
‘age is available for only 1% of our clients’.
● Documented: from a business point: semantic and point in time (‘risk profile attribute
of clients is updated every night and is computed based on ...’). From a technical
point (‘risk profile is loaded from Table X with process Y, etc.’).
87
Ysaline Hermans MA1 - INGE
It is stupid to use it if you are not obliged. We don’t want to do it because we just
repeat things. It could be necessary for holidays for an international business for
example (dates are different in other countries for example).
● Type 2: create a new dimension record for each new value of the attribute. This will
create a new surrogate key to be used as from this moment in all new records of the
fact table, creating the historical context of the measures.
● Type 3: create an attribute in the dimension record for previous value. This will not
build a history at the level of the facts, but only in dimension.
88
Ysaline Hermans MA1 - INGE
89
Ysaline Hermans MA1 - INGE
closed (date-in/out) even before it is used in the fact. A factless fact table, containing
no measures, can be used to records the link between a client and Geo. It is triggered
by a change of address.
Entities (table):
● Fact table
Credit amount, TAEG can do that but in general we will not do that because maybe
the information has changed between the selection process and the answer of the
customer.
● Factless factable: with as many lines as customers
If selection process and we wait for the return in general, 2 fact tables (1 factless)
2 separate schemas with conformed dimensions to link the two.
● Time dimension (granularity: day): time-key, date (business key), dayofweek, etc.
Date will always stay the same But we could have a problem with bank holidays
for example. For example: 01/05 is a bank holiday in one country but not in
another. Thus, need also a time-key (surrogate key).
● POS dimension: POS-key, zip, etc.
● Customer dimension: customer-key
Do we want to put a portfolio string (with 1, 0, 1, 0, 0, etc.)? Not the best solution
(queries a bit harder). If the portfolio changes, need to update often the string will
often change and multiply the dimension of customers (more records).
● Portfolio dimension: customer, hypotheque, etc. bridge table between products
and fact table.
● Geo info dimension: apart from customers for the same reason than portfolio.
● Segmentation of client dimension: idem (don’t want to increase dimension of
customers)
90
Ysaline Hermans MA1 - INGE
Problems with dimensional models: not adaptative to new relationships later discovered
(imply to know everything now).
91
Ysaline Hermans MA1 - INGE
The processes:
92
Ysaline Hermans MA1 - INGE
Static extract = capturing a snapshot of the whole source data at a point in time.
Incremental extract = capturing changes in OLTP that have occurred since the last extract.
93
Ysaline Hermans MA1 - INGE
Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses,
missing data, duplicate data (happens a lot as soon as data are entered manually),
inconsistencies.
Also: decoding, reformatting, time stamping (can add a time stamp with the day of the load
(if loaded daily) for example), conversion, key generation, merging, error detection/logging,
locating missing data.
A lot of reference (decoding) tables = a lot of work
94
Ysaline Hermans MA1 - INGE
95
Ysaline Hermans MA1 - INGE
Data is never physically altered or deleted once it has been added to the store.
The 4 steps in loading the DWH from the staging area:
1. Update reference tables: make sure all reference tables are updated before loading
dimensions (what if Promotion 10 “Win a Bongo WE” is not there...). Each Reference
table MUST have an “owner” to make sure it is properly managed.
2. Update all dimensions: prepare each dimension record through a vector
representation based on the copy of the operational data and compare it with
Dimension If record is unchanged, do nothing, else close current record (date-out =
today-1) and use current vector to create a new one (with date-in = today). Before the
update, many cleaning steps are done.
3. Update lower-level facts (with the highest granularity): same as for dimensions, a
vector is created from the operational data. It is loaded based on date of operation or
extraction. Rollback: capacity of de-loading erroneous data of a certain day.
4. Update highest level datamarts: all other data marts are updated by simple
aggregation based on removing some dimensions, or removing some attributes part
of a hierarchy (generalized dimensions).
96
Ysaline Hermans MA1 - INGE
2. Dependent Data Mart and Operational Data Store (the right architecture):
Example – Bank:
97
Ysaline Hermans MA1 - INGE
98
Ysaline Hermans MA1 - INGE
3 data paradigms
2. Data in DWH
● Data is structured
● Dimensional modelling
● Creates memory of our companies
● Done to support business decisions (select statements and not updates)
● Extremely easy to use (query tools friendly)
● Tools: Teradata, Microsoft SQL DWH, Amazon redshift, Apache Hbase (open source)
99
Ysaline Hermans MA1 - INGE
Characteristics of systems:
● Non distributed RDBMS (Relational DataBase Management System): ACID
transactions
o Atomicity: If update cannot be completed, the engine will step back for
everything either it does the all process, either it does nothing (you will
never stay in between).
o Consistency: only valid data are written.
o Isolation: one operation at a time algorithms always written in sequence
(no synchronous update conflicts).
o Durability: once committed, it stays that way (persistence).
● Distributed data systems: CAP theorem
o Consistency: all clusters have the same copies of data (can take some time).
o Availability: clusters always accept reads and writes.
o Partition tolerance: at any moment in time, you guarantee that the system
will work even when there are some network failures.
Distributed systems have some constraints.
Note that:
● CAP theorem is related to distributed implementations where data and computation
processes are done in parallel in an asynchronous way.
● ACID is implemented using sequential operations, possible with some parallelisation
(multithreading) using the same shared memory, in a synchronised way.
● Non distributed system = all the parts of the system are in the same physical location
vs. distributed system = parts of the system exist in separate locations.
What & where?
100
Ysaline Hermans MA1 - INGE
Distributed systems are not there for operations (because you cannot have one store that
shows you different levels of stocks that one other because updates were not done yet) but
only for management and decisions.
Data governance
Data governance is a collection of practices and processes which help to ensure the formal
management of data within an organization. It deals with:
● Compliance, security and privacy: GDPR compliance, security of the data and access
to it, including anonymization of personal data.
● Integrity & quality: ensuring that data is correct and unambiguous, definition and
implementation of a Master Data Management (MDM) process.
● Availability & usability: easy access of the data for all business needs, in a format that
is usable for all.
● Roles and responsibilities.
● Overall management of the internal and external data flows within an organization.
You need to have team dedicated to the governance of the data to make sure that people
have the good rights to access the good data and that data transactions are done properly.
102
Ysaline Hermans MA1 - INGE
Only top tier of the pyramid is fully governed. We refer to this as the trusted tier of the big
data warehouse.
Big data warehouse: data is fully governed, all data is structured, partitioned/tuned for data
access, governance includes a guarantee of completeness, accuracy, and privacy, consumers:
data scientists, ETL processes, applications, data analysts, and business users.
The refinery: the feedback loop between data science and data warehouse is critical.
103
Ysaline Hermans MA1 - INGE
104
Ysaline Hermans MA1 - INGE
105
Ysaline Hermans MA1 - INGE
● Data minimization: want to collect as much info as possible because we don’t know
what will be useful But can only collect data required for the stated processing
purpose.
● Accuracy: reasonable steps must be taken to ensure the collected data is accurate
and up to date.
● Storage limitation: data can be kept for a while but not forever (not longer than
necessary) Approximately 6 years in Belgium.
● Integrity and confidentiality: appropriate cybersecurity measures must be put in
place to protect personal data being stored.
● Accountability: organizations are accountable for how they handle data and comply
with the GDPR.
GDPR compliance requirements
Depending on the type of data you collect and whether you are a processor or controller, you
may have to comply with some or all these changes.
● Data breach notifications: the controller should communicate the breach to the
supervisor authority asap in case of personal data breach (where feasible, not later
than 72 hours after having become aware of it).
● Data Protection Impact Assessments (DPIAs): a DPIA is an evaluation of the effect of
a data processing activity on the security of personal data. Article 35 requires
controllers to conduct DPIAs in the event that one of their data processing activities is
“likely to result in a high risk to the rights and freedoms of natural persons.” Note this
case: “automated processing for purposes of profiling intended to evaluate personal
aspects of data subjects”
● Privacy by Design (PbD): the controller shall implement appropriate technical and
organisational measures for ensuring that, by default, only personal data which are
necessary for each specific purpose of the processing are processed. This practice
should ultimately minimize data collection. It argues for privacy and security to be
fully integrated into the design processes, procedures, protocols, and policies of a
business. There are seven major principles that guide this concept: privacy should be
the default setting, privacy should be proactive, not reactive, privacy and design
should go hand in hand, privacy shouldn’t be sacrificed for functionality, PbD should
be implemented for the full life cycle of the data, data collection operations should be
fully visible and transparent, user protection must be prioritized.
● Consent acquisition: consent should be given by a clear affirmative act establishing a
freely given, specific, informed and unambiguous indication of the data subject’s
agreement to the processing of personal data relating to him or her, such as by a
written statement, including by electronic means, or an oral statement. Controllers
are no longer able to use opt-out or implied methods of consent — such as pre-ticked
boxes, silence, or not stating that they do not want ... Specificity and unambiguous
means that the usage of data collected is clearly stated and the consent is given only
for that usage.
● Data Subject Access Requests (DSAR): EU citizens have 8 rights over data collected
from them:
1. The right to be informed: data subjects should be able to easily learn how their
data is collected and processed.
2. The right of access: data subjects have the right to request to access any data that
has been collected from them.
106
Ysaline Hermans MA1 - INGE
3. The right of rectification: data subjects have the right to request to change
inaccurate or incomplete data that has been collected from them.
4. The right to erasure: individuals have the right to request the deletion of their data,
also referred to as the ‘right to be forgotten’.
5. The right to restrict processing: individuals have the right to request to block
specific data processing activities.
6. The right to data portability: individuals have the right to request to retain and
reuse their data for other services.
7. The right to object: data subjects have the right to object to the use of their data
for certain processing activities.
8. Rights in relation to automation: data subjects have the right not to be subject to a
decision based solely on automated processing, including profiling, which produces
legal effects concerning him or her or similarly significantly affects him or her.
To exercise these rights, data subjects can make direct requests to controllers,
whether it be through a phone call, email, or web form. These requests must be
addressed quickly as the GDPR only gives controllers 30 days to respond.
● Appointing a Data Protection Officer (DPO): a DPO must be appointed if the
processing is carried by a government entity, the controller/processor regularly
collects and processes a large amount of data, the controller/processor processes a
variety of sensitive personal information. A DPO plays several key roles in your GDPR
compliance plan. They are responsible for: educating controllers and processors on
how they must comply with the regulation, monitoring compliance efforts, offering
advice on data protection assessments, acting as the point of contact for the
supervisory authority.
What are the lawful bases for data processing?
Article 6 of the GDPR outlines six lawful bases for data processing:
1. Consent of the data subject
2. For fulfilment of a contract
3. Legal compliance
4. To protect the vital interests of the data subject
5. Necessity for carrying out a task that is in the public interest
6. Necessity for the purposes of legitimate interests of the data controller or third party
Legitimate Interest
Legitimate interest refers to any interest that provides a benefit to one or more parties
involved in the processing of data. Legitimate interests can be personal, commercial, or even
societal interests. For example, if you process data in the interest of your business
operations, your activities may fall under GDPR legitimate interests
● Fraud detection and crime prevention
● Network and information security
● Processing employee or client data
● Direct marketing although some data processing activities for marketing, like sending
marketing emails, require user consent.
Legitimate interests can be invoked except where such interests are overridden by the
interests or fundamental rights and freedoms of the data subject.
GDPR worldwide
China is clearly not using GDPR (cameras everywhere).
107
Ysaline Hermans MA1 - INGE
Over 100 countries have now implemented new data protection laws to regulate the flow of
personal data, and there is more legislation to come. One such law is the California
Consumer Privacy Act (CCPA), in effect since January 1, 2020. This law is already controversial
and has forced many US companies to rethink their data collection strategies.
US companies had varying responses to the GDPR. Since the GDPR legislation came into
effect, over 1000 major US publications have blocked users who are EU citizens, rather than
risk noncompliance. Based on an Ovum report commissioned by Intralinks, 52% of US
companies think that they are likely to be fined for noncompliance.
GDPR: good or bad?
Companies such as Facebook have a lot of power. But when you use it, you consent to give
your data, so Facebook has legitimate interest.
GDPR has damaged the competitive position of European companies: costly processes
changes around data collection and processing, data collector companies (such as Bisnode in
Belgium) had to dramatically reduce the scope of their data delivery, reducing the ability for
all companies to reach new clients at the advantages of companies like Facebook or Google,
add resistance within our companies to take full advantage of their data to improve customer
intimacy. In the future, it might turn into a competitive advantage, if all customers around
the globe ask for more protection of their privacy: European companies would be in a pole
position then...
108
Questions exams BI
1) What is a wrapper? What is used for? What are the advantages in regards with
correlation coefficient?
Learning curve
● What are training instance? Why does dt surpass logistic?
Training instance = data in the training set.
As the training size grows (x axis), generalization performance (y axis) improves.
Logistic regression has less flexibility, which allows it to overfit less with small data,
but keeps it from modeling the full complexity of the data. Tree induction is much
more flexible, leading it to overfit more with small data (overfitting because it grows
the tree until having pure leaf nodes), but to model more complex regularities with
larger training sets (more flexible: several decision boundaries).
● Linear, nonlinear? Why?
DT = nonlinear (split the instance space with several decision boundaries), logistic =
linear (ln applied on nonlinear proba function)
● DT have pruning here? Why?
1
No, because we see it overfits at the beginning (does not if pruned) because it grow
tree until having pure nodes).
4) How do you get overfitting on Decision Tree? How do you treat it? Explain the
various methods (errors measures)?
DT overfit because highly flexible (complex); it grows the tree until having pure nodes
(at the end, learn things on the training set that are not generalizable = noise). We
should control the complexity of the model; by limiting the number of nodes for
example or pruning the tree.
When to stop splitting to avoid overfitting? Pruning is a mechanism that
automatically reduces the size and complexity of a DT to grow a tree with the best
generalization capacity.
- Pre-pruning (early stopping): x2 test, Gini or information gain, error
classification)
- Post-pruning: fully grow and then post-prune, replace subtree by a terminal
node if when pruning, the error reduces more than a defined threshold
error rate: optimistic vs pessimistic approach/minimal cost-complexity
pruning
5) What are the interest points of DT? Compare them with the interest points of
Logistic Regression?
- DT: flexible thus can represent well complex models, easily understandable
(business sense), categorial & numerical features, risk of overfitting (from
overtraining) + unstable, not searching for the optimal model, include
embedded features selections and missing values treatment, class proba are
not reliable, sensitive to unbalanced classes, nonlinear, robust to outliers,
computationally expensive.
- Logistic regression: less in danger of overfitting (global decision surface, from
dimensionality), only numerical features, does not accept missing values,
highly influenced by noise, need to add regularization/dim reduction, probas,
linear, CPU intensive.
2
6) “What is data governance? How to combine Data Lake and DWH? I started talking
about the data lake, and he asked to describe + give example of the “pyramid”.
Data governance is the collection of practices and processes which help to ensure the
formal management of data within an organization. It deals with compliance, security
and privacy, integrity and quality, availability and usability, roles and responsibilities,
overall management of the internal and external data flows within an organization.
What is missing to a DWH is a Data Lake = storage and processing layer for all data.
The Hadoop Data Lake: different governance demand at each tier.
Pyramid: landing area: raw data collection – data lake: turn data into information –
DS: agile business insight – big data warehouse: user community queries and
reporting.
Why? When you need all type of data (cannot be supported by DWH) and keep DWH
because memory, consistency issue, subject oriented, large queries, etc.
How? Data Lake (DS can use data) ETL (to transform unstructured data) DWH.
7) What can cause overfitting with regression models (including the logistic)? What
are the solutions to address this?
8) You have a fact table with historical data from 3 years ago until now. This fact table
is connected to some dimension tables. Imagine you want to add a new attribute to
one dimension table? What will happen? Which elements do you have to be careful
with? (Explain the different situations possible)
You should pay attention to the granularity impact of the new element:
- If the element increases the granularity of the fact table, it is not possible
because it means that each record of the fact must be split into n records and
hence the actual surrogate keys are not valid anymore.
- If the element does not modify the granularity of the fact table, it can be
added.
9) It’s a case a bit similar to the capitalization insurance (the business value
framework). A university wants to send letters to their alumni to ask a donation
from them for a project (150€ donation) and with a cost of 10€ per letter. You have
to explain how you will target the people that you will send the letter to (you have
some data on them) and all the steps from the business value framework
(probability and profit matrixes, expected profit, etc.).
3
Business value framework:
-
-
(response model)
- Build a model (from random sample) for computing proba to respond
positively given the customer profile. Build a general model and a segmented
one compare the 2 model designs and choose the best one.
Cost/benefit matrix + expected value + probas
- Build a model to approximate the LTV of respondents (with data of 5 years of
observation) + for each segment.
- Compute the EP for each customer within each segment.
- Select in each segment, those for which EP > cost of acquisition. (No one can
be selected twice). TARGET.
10) How do you compute the score on a regression model? What do you need to have
as classifier? (Decision rule)
11) What are the strengths and disadvantages of dimensional modelling Vs 3NF?
Strengths:
- Predictable and understandable, standard framework: semantic not hidden ><
3NF
- Respond well to changes in user reporting needs: aggregation of low-level
facts, hence any report can be built on this.
- Multidimensional models are more efficient for queries and analysis: joining
small tables (dimensions) on large ones (fact) can be efficient with right
indexes. >< 3NF: joins are costly (decrease performance).
- There exists a number of products supporting the dimensional model
(semantic fixed).
Weaknesses:
- Dimensions might be too large.
- Relation between dimensions exists only through fact
- Normalized and indexed relation models are more flexible (3NF)
Overcoming weaknesses:
- We use a combination of both relational and dimensional models at high
granularity.
- Link across dimensional objects: factless factable.
12) What is variance and how is it computed? To which measures is it applicable? How
do you explain it to a random person (= someone not knowing DS)?
4
Variance is an important measure of the quality of a model in general. Variance tells
us how much chance influences the performance of any prediction. The best model
is the one with a high accuracy and a low variance.
It is computed through cross-validation (look at variance of results between folds);
you separate your dataset into k-folds, train model on k-1 folds, test it on the last one,
and then redo it (for each combination of folds). Then you can compare the
performance between each combination (AUC, accuracy, etc.) and look at the
variance.
13) How do you use a classification model to make predictions and what are the points
of attention? I explained the observation period, latency period, ...
5
- ! to variables that are artifacts of the target itself (find nonsense perfect
model).
- Variables must be stable in time (computed the same way between build and
apply).
- Latency can be used to remove artifacts.
- Apply data must always be computed in the same way as the build data.
- ! Seasonal event behaviour
Type 1: no history
Type 2: new dimension record for each new value (new surrogate key), valid_from
and valid_to attributes needed.
Type 3: attribute in the dimension record for previous values (history only at the
dimension level)
15) Data Science Case: You are a data scientist, and your client is a hospital, they
developed a new medicine for patients suffering from heart disease. When building
your model to classify potential patients that may have the disease, we ask you if
you would give more importance to precision, recall or none of them?
Minimize false negative = people sick but not detected against false positive (not sick
but receive the medicine less severe).
Recall = TP/TP+FN
Precision = TP/TP+FP
We want to maximize recall, more importance given to recall.
(Or focus on precision if we want to have only sick people and minimize the risk (for
insurances).
A surrogate key uniquely identifies each entity in the dimension table, regardless of
its natural source key. This is primarily because a surrogate key generates a simple
integer value for every new entity (auto-increment). Surrogate keys are necessary to
handle changes in dimension table attributes.
To store the history of dimensions in the DWH (add a new surrogate key for each
new change). We cannot do that with the business keys of the transactional model.
17) Overfitting: what is it? How to avoid it for decision tree, SVM and logistic
regression?
Overfitting occurs when the model classifies seen cases much better than unseen
cases: which means that the generalization of the model is poor. Trade-off to make
between overfitting and model complexity.
DT and logistic: see above
SVM: implicitly reduced by explicitly trying to maximize the margin, hence preserving
the generalization ability, dimensionality reduction or regularization.
6
18) What are the methods of variable selection?
20) You work at UPS, a delivery company, which own 50.000 small delivery trucks
worldwide. Everyday more than 80 of them break down on their delivery way, and
this causes enormous problems of re-assigning the parcels, and delivering them on
time. UPS collects a lot of data on their trucks in real time: engine temperature,
pressure, kms done, and many other technical measures on the engine and the
drive. When a break down occurs, data on the break down problem and its
resolution is captured as well. The COO wants you to reduce on the road break
downs using the data collected. (Same as in the course)
It is possible with a predictive model. Target = truck that breaks down. Observation
period: months (but try different model design and test). Target period: 1 day (latency
= 12h (night)).
21) Improving a bank credit scoring: You are responsible to improve short term loans
acceptance policy of your bank. The bank has many records of past acceptance and
refusal for such loans for the last 5 years.
7
Target = somebody who defaulted
● Which data do you want to use?
Demographic data (nothing else because of GDPR) as age, area, etc. people in the
same suburb will be in the same vector space. (Model for unknown clients)
Use data available of clients (model for known clients)
● How would select the best creditors
Expected value framework: sample (selection bias), model, compute EP for each
customer within each segment and choose all customers that have EP > c.
● Bonus question: how do would compute the profit of a future customer?
By doing a regression analysis to predict it. (Relation between X and Y)
22) How would you build a model to predict churn in your business knowing that the
model is dynamic and there is not a lot of churns in the number of data?
Predictive model: observation, latency, event period (times of period depends on the
business problem, here: months for observation) build the model (looking at target
occurrences in the event period, search for a model in the observation period) . Then
apply it on fresh data.
For rare target events, we might want to accumulate larger target sets by extending
the event period. But as the model is dynamic, we will lose precision.
Points of attention: see above.
Conformed dimensions are dimensions shared by several fact tables. We need to have
them and to identify them, otherwise, we still have silo of information.
To be able to manipulate them (sum, group by, etc.). Otherwise, fact table is useless,
we just have a lot of lines that cannot be grouped into measures.
26) Will you have only one answer with regressions, logistic regression and SVM?
Please draw the answer or answers if there are several.
Several answers (all the points on the regression line, margin for SVM)
● He asked me the formulas of the logistic regression and RSS
RSS = Logistic regression:
● What is the SVM? Classifier that tries to maximize the margin between training data
and the classification boundary.
8
27) Cite 2 ways to build hierarchies in a star schema?
Graph: tree size (model complexity) against performance (accuracy), a line for the
training set and another for the test set.
When the size of the tree increases, both accuracy increases but training one is
always higher as the model is built on training data.
After the sweet spot, the training accuracy continue to increase until = 1. The tree can
memorize the entire set. The test accuracy decreases. The subsets of data at leaves
get smaller and smaller, and the model generalizes from fewer and fewer data
(error-prone).
Big data is a term applied to datasets whose size is beyond the ability of commonly
used software tools to capture, manage, and process within a tolerable elapsed time.
Big data are non-structured, really big, and come from real-time streams. Example:
web logs at Google.
Bagging ensemble method: train learners in parallel on different samples of the data,
then combine by voting or by averaging.
RF = forest of decision trees that are created adding randomness to each DT build.
You grow K trees on datasets sampled from the original one with replacement
(bootstrap samples)
- Draw K boostrap samples of size N
9
- Grow each DT, by selecting a random set of m out of p features at each node
and choose the best feature to split on.
- Aggregate the prediction of the trees (voting from different learners) to
produce the final class.
How is diversity ensured among individual trees? Each tree is trained on different
data (bagging). + Random set of features thus corresponding nodes in different trees
can’t use the same feature set to split.
10
SEP
Solvay
Entraide
& Publications
Ouvert tous les jours de 12h à 14h
SEP - Solvay Entraide & Publications
besolvay cerclesolvay