Académique Documents
Professionnel Documents
Culture Documents
DATA SCIENCE
An introduction to the field
ALEX LEMM
PM PREDICTIVE ANALYTICS
© 2018 Software AG. All rights reserved. For internal use only
MOTIVATION
2 |
COURSE DESCRIPTION
3 | © 2017 Software AG. All rights reserved. For internal use only
COURSE GOALS
After this course you will After this course you won’t
• have a solid understanding of the entire data • be a data scientist nor an R expert (but you
science process and common data science will have a pretty good feeling and
terms understanding about how those guys work)
• have used data science to solve a real world
problem (in finance) using R
• know two classical classification methods:
decision trees and random forests
• understand the difference between model
building and model execution
• understand which parts of the data science
process are covered by Zementis
• be able to qualify AI/ML/Predictive Analytics
opportunities
• be able to handle initial talks with customer
analytics teams successfully on your own
4 | © 2017 Software AG. All rights reserved. For internal use only
MACHINE LEARNING
5 | © 2018 Software AG. All rights reserved. For internal use only
WHAT IS MACHINE LEARNING?
Machine Learning is
using data to answer questions
Training Prediction
6 | © 2017 Software AG. All rights reserved. For internal use only
THE 5 QUESTIONS MACHINE LEARNING CAN ANSWER
7 | © 2017 Software AG. All rights reserved. For internal use only
THE DATA SCIENCE PROCESS
Devices/
Customers/
etc Processes
Connect Enrich Predict Decide
Events/
Xactions
IT Operations
Alerts
Models Track
(PMML)
Back-end/ Workflow
Internal Re-train
Systems Integrate Store Deploy Alerts Visualize
Reference
MES
Data
Data Science
Analytics Team
? Export to
PMML
A question is posed
in the beginning
or comes from the data
Analytics Team
11 | © 2017 Software AG. All rights reserved. For internal use only
12 | © 2017 Software AG. All rights reserved. For internal use only
FUNCTIONAL PROGRAMMING VS OOP
FP OOP
f(x)
fireTorpedo() startEngine()
13 | © 2017 Software AG. All rights reserved. For internal use only
TYPICAL DATA STRUCTURE
Example
14 | © 2017 Software AG. All rights reserved. For internal use only
EXERCISE LOGISTICS
1 Start RStudio
15 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
R Introduction – Exercises 1 - 9
16 | © 2017 Software AG. All rights reserved. For internal use only
PART 1:
PROBLEM INTRODUCTION
?
17 | © 2017 Software AG. All rights reserved. For internal use only
OUR PROBLEM DOMAIN: CREDIT RISK MODELING
WHAT IS LOAN DEFAULT?
Borrower
Borrower
18 | © 2017 Software AG. All rights reserved. For internal use only
COMPONENTS OF EXPECTED LOSS (EL)
HOW DOES A BANK CALCULATE THE RISK?
EL = PD x EAD x LGD
19 | © 2017 Software AG. All rights reserved. For internal use only
RESEARCH QUESTION
• Application information:
– Income
– Age
– …
• Behavioral information
– Payment arrears in account history
– Spending history
– …
21 | © 2017 Software AG. All rights reserved. For internal use only
INTRODUCING THE LOAN DATA SET
22 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Problem Introduction – Exercise 1
23 | © 2017 Software AG. All rights reserved. For internal use only
PART 2:
IMPORTING DATA
Import Explore Transform Model Evaluate Communicate
24 | © 2017 Software AG. All rights reserved. For internal use only
POSSIBLE DATA SOURCES
25 | © 2017 Software AG. All rights reserved. For internal use only
FIRST EXPLORATION RIGHT AT THE BEGINNING
26 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Importing Data – Exercises 1 - 4
?
27 | © 2017 Software AG. All rights reserved. For internal use only
PART 3:
EXPLORING DATA
Import Explore Transform Model Evaluate Communicate
28 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING DATA OVERVIEW
Introduction
• The first step in data analysis is exploring the data using simple descriptive
statistics and graphical techniques
• This task is also called exploratory data analysis (EDA for short)
Purpose
• Develop a general understanding and "feeling" of your data
• Identify potential problems such as missing values and outliers
• Identify possible new features (feature engineering)
• Identify possible transformations
29 | © 2017 Software AG. All rights reserved. For internal use only
HOW TO DEVELOP AN UNDERSTANDING OF YOUR DATA
30 | © 2017 Software AG. All rights reserved. For internal use only
WHAT TYPE OF VARIABLES ARE THERE?
A ROUGH CLASSIFICATION
Variable types
31 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING CATEGORICAL VARIABLES
Visual
• Barplot
Statistical
• Frequency table
• Percentage frequency table
32 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE HOME_OWNERSHIP VARIABLE (1/2)
33 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE HOME_OWNERSHIP VARIABLE (2/2)
34 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Exploring Data – Exercises 1 - 5
35 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE LOAN_STATUS VARIABLE
36 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING NUMERICAL VARIABLES
Visual
• Histogram
• Density plot
• Scatter plot
• Boxplot
Statistical
• Five Stats Summary (Minimum, 1st quartile, Median, 3rd quartile, Maximum)
37 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE INT_RATE VARIABLE (1/2)
38 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE INT_RATE VARIABLE (2/2)
39 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE ANNUAL_INC VARIABLE
Histogram Scatterplot
1
2
40 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Exploring Data - Exercises 6 - 12
41 | © 2017 Software AG. All rights reserved. For internal use only
OUR FINDINGS
• Variables int_rate and the emp_length come with a lot of missing values
• There is one clear outlier in the income variable
• There is one clear outlier in the age variable
• loan_status should be recoded from numeric to categorical (important for many
ML algorithms)
42 | © 2017 Software AG. All rights reserved. For internal use only
PART 4:
TRANSFORMING DATA
Import Explore Transform Model Evaluate Communicate
43 | © 2017 Software AG. All rights reserved. For internal use only
TRANSFORMING DATA OVERVIEW
Introduction
• The primary purpose when transforming the data is to address all issues identified
during EDA
• The data needs to be prepared in such a way that it can directly be consumed by
any ML algorithm further downstream
• IMPORTANT: All steps conducted in this phase must be properly recorded because
also new data needs to follow these pre-processing steps later
Purpose
• Deal with missing values (Slides)
• Get rid of outliers if necessary (Exercise)
• Engineer new features and add them to the data
• Fully pre-process data so that it can go into modeling
44 | © 2017 Software AG. All rights reserved. For internal use only
DEALING WITH MISSING VALUES
• Delete rows/columns
• Replace values
• Keep values
45 | © 2017 Software AG. All rights reserved. For internal use only
WHAT SHOULD WE DO IN OUR CASE?
46 | © 2017 Software AG. All rights reserved. For internal use only
KEEP MISSING INFORMATION IN A NUMERIC VARIABLE
BINNING THE INT_RATE VARIABLE
47 | © 2017 Software AG. All rights reserved. For internal use only
INT_RATE AFTER THE TRANSFORMATION
48 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Transforming Data - Exercises 1 - 2
49 | © 2017 Software AG. All rights reserved. For internal use only
RESULTS
• We removed the one observation which produced the outlier in the age and in
the annual_inc variable
• We kept the information about missing values in the emp_length and the
int_rate variable using binning
• We transformed the loan_status variable from numeric to categorical
• The transformed data is stored in a new variable loan_data
50 | © 2017 Software AG. All rights reserved. For internal use only
PART 5:
MODELING DATA
Import Explore Transform Model Evaluate Communicate
51 | © 2017 Software AG. All rights reserved. For internal use only
SOME MACHINE LEARNING THEORY
This is what we want in the end:
52 | © 2017 Software AG. All rights reserved. For internal use only
SUPERVISED LEARNING
Training data
input
independent variables ML
predictors ML Model
Algorithm
output
dependent variable
response
Supervised learning is the machine learning task of inferring a model from labeled training data.
53 | © 2017 Software AG. All rights reserved. For internal use only
GETTING STARTED
loan_data
54 | © 2017 Software AG. All rights reserved. For internal use only
TRAINING AND TEST SET
training_set test_set
55 | © 2017 Software AG. All rights reserved. For internal use only
YOU WILL BUILD 2 ML MODELS IN THIS COURSE
2 1
What’s next?
4 3
• Complicated math
• Building can take A LOT
of time
• In most of the cases a
simple one-liner
56 | © 2017 Software AG. All rights reserved. For internal use only
1
WHAT IS DECISION TREE?
Income >
64K€?
yes no
?
Age < 27?
Non-default Age: 29 years
Income: 32K Euros
ML yes no
Model
Age > 42?
Non-default
yes no
Non-default Default
57 | © 2017 Software AG. All rights reserved. For internal use only
2
HOW TO BUILD A CLASSIFICATION TREE
Income >
64K€?
120
Income (In 1000’s of €)
yes no
100
Age < 27?
Non-default Default
80
yes no
1
60
Age > 42?
40 2 3 Non-default Default
yes no
30 40 50 60 80
Non-default
Age Default Non-default Default
ML ML
Algorithm Model
58 | © 2018 Software AG. All rights reserved. For internal use only
3 4
RANDOM FOREST THEORY: ALGORITHM TO BUILD RF
Variables in columns
59 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Modeling Data - Exercises 1 - 4
60 | © 2017 Software AG. All rights reserved. For internal use only
PART 6:
EVALUATING MODELS
Import Explore Transform Model Evaluate Communicate
61 | © 2017 Software AG. All rights reserved. For internal use only
HOW TO EVALUATE THE MODELS
1 2
Test data ML Model Predictions
62 | © 2017 Software AG. All rights reserved. For internal use only
A SYSTEMATIC WAY TO COMPARE
THE CONFUSION MATRIX
Model prediction
Non-
Default
default
(1)
(0)
Actual Non-
loan default 9 4
(0)
status
Default
(1) 1 2
63 | © 2017 Software AG. All rights reserved. For internal use only
MEASURES DERIVED FROM THE CONFUSION MATRIX
Model prediction
Accuracy = (TN + TP) / (Table Sum)
= (9 + 2) / 16 = 68,75%
Non-
Default
default
(1)
(0)
Actual Non- Specificity (TN Rate) = TN / (TN + FP)
9 4
loan default = 9 / (9 + 4) = 69,23%
(0) (TN) (FP)
status
Default 1 2 Sensitivity (TP Rate) = TP / (FN + TP)
(1) (FN) (TP) = 2 / (1 + 2) = 66,66%
64 | © 2017 Software AG. All rights reserved. For internal use only
HOW TO MEASURE OUR SUCCESS
BASELINE MODEL
The baseline model always predicts the majority group
0 1
13 3
65 | © 2017 Software AG. All rights reserved. For internal use only
RESULTS
67 | © 2017 Software AG. All rights reserved. For internal use only
PART 7:
EXPORTING THE MODEL
Export to
PMML
68 | © 2017 Software AG. All rights reserved. For internal use only
NOW WHAT? HOW TO USE YOUR MODEL IN PRODUCTION
POSSIBLE PRODUCTION SETUPS
1 2 3
ZEMENTIS
69 | © 2017 Software AG. All rights reserved. For internal use only
OUR PRODUCTION SETUP
TRANSFORM / DEPLOY / PREDICT NEW DATA
Transform
1 final model to ML Model <ML Model>
PMML
ZEMENTIS
Deploy PMML
2 file to <ML Model>
Zementis
ZEMENTIS
70 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Exporting the Model - Exercises 1 - 3
71 | © 2017 Software AG. All rights reserved. For internal use only
CONGRATS! YOU MADE IT THROUGH THE ENTIRE COURSE
? Export to
PMML
A question is posed
in the beginning
or comes from the data
Analytics Team
72 | © 2017 Software AG. All rights reserved. For internal use only
73 | © 2017 Software AG. All rights reserved. For internal use only