Vous êtes sur la page 1sur 73

DEMYSTIFYING

DATA SCIENCE
An introduction to the field

ZEMENTIS TRAINING DAY #1


BANGALORE | JUNE 07, 2018

ALEX LEMM
PM PREDICTIVE ANALYTICS

© 2018 Software AG. All rights reserved. For internal use only
MOTIVATION

“The customer gave me this R script which


represents a neural net.
Can you have a look and export it to PMML?”

“The Zementis acquisition is great.


We can offer completely self-learning algorithms to
our clients! ”

“Do we now power self-driving cars? ”

2 |
COURSE DESCRIPTION

Course Overview Prerequisites


• This data science introduction course is the • No prior analytics or R knowledge is required
first part of the Zementis 2-days ramp-up • The course is completely self-contained
training
• Students just should bring a general interest
• The course introduces the data science and curiosity for analytics topics
process and common analytics terminology
hands-on using R
• Students will solve a real-world data science
problem on their own
• Their understanding of the entire process will
lay the foundation for the subsequent parts of
the training

3 | © 2017 Software AG. All rights reserved. For internal use only
COURSE GOALS

After this course you will After this course you won’t
• have a solid understanding of the entire data • be a data scientist nor an R expert (but you
science process and common data science will have a pretty good feeling and
terms understanding about how those guys work)
• have used data science to solve a real world
problem (in finance) using R
• know two classical classification methods:
decision trees and random forests
• understand the difference between model
building and model execution
• understand which parts of the data science
process are covered by Zementis
• be able to qualify AI/ML/Predictive Analytics
opportunities
• be able to handle initial talks with customer
analytics teams successfully on your own
4 | © 2017 Software AG. All rights reserved. For internal use only
MACHINE LEARNING

5 | © 2018 Software AG. All rights reserved. For internal use only
WHAT IS MACHINE LEARNING?

Machine Learning is
using data to answer questions

Training Prediction

6 | © 2017 Software AG. All rights reserved. For internal use only
THE 5 QUESTIONS MACHINE LEARNING CAN ANSWER

1. It this A or B? Classification algorithms


2. How much or how many? Regression algorithms
3. How is this organized? Clustering algorithms
4. Is this weird? Anomaly detection algorithms
5. What should I do next? Reinforcement learning algorithms

7 | © 2017 Software AG. All rights reserved. For internal use only
THE DATA SCIENCE PROCESS

? 70% of the time


Put final results into production
if business value is eminent
A question is posed
in the beginning
or comes from the data
Analytics Team

Import Explore Transform Model Evaluate Communicate

8 | © 2017 Software AG. All rights reserved.


CLOSED-LOOP PREDICTIVE ANALYTICS
Data Sources Integration & Enrichment Prediction & Analytics Actions

Devices/
Customers/
etc Processes
Connect Enrich Predict Decide
Events/
Xactions
IT Operations

Alerts

Models Track

(PMML)
Back-end/ Workflow
Internal Re-train
Systems Integrate Store Deploy Alerts Visualize
Reference
MES
Data

ERP Models Outcomes/


(PMML) Results

Data Science
Analytics Team

Import Explore Transform Model Evaluate Review

9 | © 2017 Software AG. All rights reserved.


FOCUS OF THIS COURSE

? Export to
PMML
A question is posed
in the beginning
or comes from the data
Analytics Team

Import Explore Transform Model Evaluate Communicate

10 | © 2017 Software AG. All rights reserved.


PART 0:
R INTRODUCTION

11 | © 2017 Software AG. All rights reserved. For internal use only
12 | © 2017 Software AG. All rights reserved. For internal use only
FUNCTIONAL PROGRAMMING VS OOP

FP OOP

f(x)

fireTorpedo() startEngine()

13 | © 2017 Software AG. All rights reserved. For internal use only
TYPICAL DATA STRUCTURE

Example

Name height mass gender homeworld


Variables in columns
Luke 1.72 77 Male Tatooine
Skywalker

Observations Boba Fett 183 78.2 Male Kamino


in rows

Chewbacca 228 112 Male Kashyyyk

14 | © 2017 Software AG. All rights reserved. For internal use only
EXERCISE LOGISTICS

1 Start RStudio

2 Execute the following command in the RStudio console:


learnr::run_tutorial("data_science_introduction", package = "zementisTutorials")

15 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
R Introduction – Exercises 1 - 9

16 | © 2017 Software AG. All rights reserved. For internal use only
PART 1:
PROBLEM INTRODUCTION
?

Import Explore Transform Model Evaluate Communicate

17 | © 2017 Software AG. All rights reserved. For internal use only
OUR PROBLEM DOMAIN: CREDIT RISK MODELING
WHAT IS LOAN DEFAULT?

Borrower

Borrower

18 | © 2017 Software AG. All rights reserved. For internal use only
COMPONENTS OF EXPECTED LOSS (EL)
HOW DOES A BANK CALCULATE THE RISK?

• Probability of default (PD)


• Exposure at default (EAD)
• Loss given default (LGD)

EL = PD x EAD x LGD

19 | © 2017 Software AG. All rights reserved. For internal use only
RESEARCH QUESTION

Can we predict default for new clients


based on the default behavior of past
clients and their respective data?
?
1. It this A or B?
2. How much or how many?
! Is this A or B?
3. How is this organized? Classificaton problem
4. Is this weird?
5. What should I do next?
20 | © 2017 Software AG. All rights reserved. For internal use only
INFORMATION USED BY BANKS

• Application information:
– Income
– Age
– …
• Behavioral information
– Payment arrears in account history
– Spending history
– …

21 | © 2017 Software AG. All rights reserved. For internal use only
INTRODUCING THE LOAN DATA SET

22 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Problem Introduction – Exercise 1

23 | © 2017 Software AG. All rights reserved. For internal use only
PART 2:
IMPORTING DATA
Import Explore Transform Model Evaluate Communicate

24 | © 2017 Software AG. All rights reserved. For internal use only
POSSIBLE DATA SOURCES

25 | © 2017 Software AG. All rights reserved. For internal use only
FIRST EXPLORATION RIGHT AT THE BEGINNING

• Was the import successful?


• Did you import the expected data?

class(), dim(), str(), head(), tail()

26 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Importing Data – Exercises 1 - 4

?
27 | © 2017 Software AG. All rights reserved. For internal use only
PART 3:
EXPLORING DATA
Import Explore Transform Model Evaluate Communicate

28 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING DATA OVERVIEW

Introduction
• The first step in data analysis is exploring the data using simple descriptive
statistics and graphical techniques
• This task is also called exploratory data analysis (EDA for short)

Purpose
• Develop a general understanding and "feeling" of your data
• Identify potential problems such as missing values and outliers
• Identify possible new features (feature engineering)
• Identify possible transformations

29 | © 2017 Software AG. All rights reserved. For internal use only
HOW TO DEVELOP AN UNDERSTANDING OF YOUR DATA

• Easiest way to develop a "feeling" for your data is


to use questions to lead your investigation
? ?? • Feel free to investigate every idea and every
?? ?
? ? question that comes to your mind. Each additional
question that you ask will give you new insights
? ?? ? ??
?? ? ?? ?? • The following types of questions are always a
?? ? ? good starter for making discoveries within your
?
?? ? ? data:
• What type of variation occurs within my variables?
?? • What is the relationship/behavior between my
variables (covariation)?

30 | © 2017 Software AG. All rights reserved. For internal use only
WHAT TYPE OF VARIABLES ARE THERE?
A ROUGH CLASSIFICATION

Variable types

Numerical variables Categorical variables

31 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING CATEGORICAL VARIABLES

Visual
• Barplot

Statistical
• Frequency table
• Percentage frequency table

32 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE HOME_OWNERSHIP VARIABLE (1/2)

Barplot Frequency tables

33 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE HOME_OWNERSHIP VARIABLE (2/2)

Barplot Frequency table

34 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Exploring Data – Exercises 1 - 5

35 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE LOAN_STATUS VARIABLE

36 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING NUMERICAL VARIABLES

Visual
• Histogram
• Density plot
• Scatter plot
• Boxplot

Statistical
• Five Stats Summary (Minimum, 1st quartile, Median, 3rd quartile, Maximum)

37 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE INT_RATE VARIABLE (1/2)

Histogram Density plot

38 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE INT_RATE VARIABLE (2/2)

Histogram Density plot

39 | © 2017 Software AG. All rights reserved. For internal use only
EXPLORING THE ANNUAL_INC VARIABLE

Histogram Scatterplot

1
2

110 observations removed

40 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Exploring Data - Exercises 6 - 12

41 | © 2017 Software AG. All rights reserved. For internal use only
OUR FINDINGS

• Variables int_rate and the emp_length come with a lot of missing values
• There is one clear outlier in the income variable
• There is one clear outlier in the age variable
• loan_status should be recoded from numeric to categorical (important for many
ML algorithms)

42 | © 2017 Software AG. All rights reserved. For internal use only
PART 4:
TRANSFORMING DATA
Import Explore Transform Model Evaluate Communicate

43 | © 2017 Software AG. All rights reserved. For internal use only
TRANSFORMING DATA OVERVIEW

Introduction
• The primary purpose when transforming the data is to address all issues identified
during EDA
• The data needs to be prepared in such a way that it can directly be consumed by
any ML algorithm further downstream
• IMPORTANT: All steps conducted in this phase must be properly recorded because
also new data needs to follow these pre-processing steps later

Purpose
• Deal with missing values (Slides)
• Get rid of outliers if necessary (Exercise)
• Engineer new features and add them to the data
• Fully pre-process data so that it can go into modeling
44 | © 2017 Software AG. All rights reserved. For internal use only
DEALING WITH MISSING VALUES

• Delete rows/columns
• Replace values
• Keep values

45 | © 2017 Software AG. All rights reserved. For internal use only
WHAT SHOULD WE DO IN OUR CASE?

46 | © 2017 Software AG. All rights reserved. For internal use only
KEEP MISSING INFORMATION IN A NUMERIC VARIABLE
BINNING THE INT_RATE VARIABLE

47 | © 2017 Software AG. All rights reserved. For internal use only
INT_RATE AFTER THE TRANSFORMATION

48 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Transforming Data - Exercises 1 - 2

49 | © 2017 Software AG. All rights reserved. For internal use only
RESULTS

• We removed the one observation which produced the outlier in the age and in
the annual_inc variable
• We kept the information about missing values in the emp_length and the
int_rate variable using binning
• We transformed the loan_status variable from numeric to categorical
• The transformed data is stored in a new variable loan_data

50 | © 2017 Software AG. All rights reserved. For internal use only
PART 5:
MODELING DATA
Import Explore Transform Model Evaluate Communicate

51 | © 2017 Software AG. All rights reserved. For internal use only
SOME MACHINE LEARNING THEORY
This is what we want in the end:

input ML Model output


independent variables dependent variable
predictors response

This is how we achieve it: Supervised learning


Supervised learning is the machine learning task of inferring a ML model from labeled training data
Training data
input
independent variables ML
predictors ML Model
Algorithm
output
dependent variable
response

52 | © 2017 Software AG. All rights reserved. For internal use only
SUPERVISED LEARNING
Training data
input
independent variables ML
predictors ML Model
Algorithm
output
dependent variable
response

Supervised learning is the machine learning task of inferring a model from labeled training data.
53 | © 2017 Software AG. All rights reserved. For internal use only
GETTING STARTED

Create the model

loan_data

Evaluate the result

54 | © 2017 Software AG. All rights reserved. For internal use only
TRAINING AND TEST SET

Create the model

training_set test_set

Evaluate the result

55 | © 2017 Software AG. All rights reserved. For internal use only
YOU WILL BUILD 2 ML MODELS IN THIS COURSE

2 1

What’s next?
4 3

• Complicated math
• Building can take A LOT
of time
• In most of the cases a
simple one-liner
56 | © 2017 Software AG. All rights reserved. For internal use only
1
WHAT IS DECISION TREE?
Income >
64K€?

yes no
?
Age < 27?
Non-default Age: 29 years
Income: 32K Euros
ML yes no
Model
Age > 42?
Non-default

yes no

Non-default Default

57 | © 2017 Software AG. All rights reserved. For internal use only
2
HOW TO BUILD A CLASSIFICATION TREE
Income >
64K€?
120
Income (In 1000’s of €)

yes no

100
Age < 27?
Non-default Default
80
yes no
1
60
Age > 42?
40 2 3 Non-default Default

yes no

30 40 50 60 80
Non-default
Age Default Non-default Default

ML ML
Algorithm Model
58 | © 2018 Software AG. All rights reserved. For internal use only
3 4
RANDOM FOREST THEORY: ALGORITHM TO BUILD RF
Variables in columns

Original data set:


Observations in rows

Classification Tree Bagging Random Forest

59 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Modeling Data - Exercises 1 - 4

60 | © 2017 Software AG. All rights reserved. For internal use only
PART 6:
EVALUATING MODELS
Import Explore Transform Model Evaluate Communicate

61 | © 2017 Software AG. All rights reserved. For internal use only
HOW TO EVALUATE THE MODELS

1 2
Test data ML Model Predictions

62 | © 2017 Software AG. All rights reserved. For internal use only
A SYSTEMATIC WAY TO COMPARE
THE CONFUSION MATRIX

Model prediction
Non-
Default
default
(1)
(0)
Actual Non-
loan default 9 4
(0)
status
Default
(1) 1 2

63 | © 2017 Software AG. All rights reserved. For internal use only
MEASURES DERIVED FROM THE CONFUSION MATRIX

Model prediction
Accuracy = (TN + TP) / (Table Sum)
= (9 + 2) / 16 = 68,75%
Non-
Default
default
(1)
(0)
Actual Non- Specificity (TN Rate) = TN / (TN + FP)
9 4
loan default = 9 / (9 + 4) = 69,23%
(0) (TN) (FP)
status
Default 1 2 Sensitivity (TP Rate) = TP / (FN + TP)
(1) (FN) (TP) = 2 / (1 + 2) = 66,66%

64 | © 2017 Software AG. All rights reserved. For internal use only
HOW TO MEASURE OUR SUCCESS
BASELINE MODEL
The baseline model always predicts the majority group

0 1
13 3

65 | © 2017 Software AG. All rights reserved. For internal use only
RESULTS

Baseline Model Tree Model* Forest Model


Model prediction Model prediction Model prediction
non- non- non-
default default default
default default default
(1) (1) (1)
(0) (0) (0)
Actual non- Actual Actual
7759 0 non- non-
loan 6707 1052 loan 5904 1855
loan default
(TN) (FP) default default
(0) status (TN) (FP) status (TN) (FP)
status (0) (0)
default 969 0 default 709 259 default 609 359
(1) (FN) (TP) (1) (FN) (TP) (1) (FN) (TP)

• Accuracy = 88,91 % • Accuracy = 79, 82 % • Accuracy = 71,76 %


• Specificity = 100% • Specificity = 86,44 % • Specificity = 76,09 %
• Sensitivity = 0% • Sensitivity = 26,76 % • Sensitivity = 37,09 %

* Used threshold t > 0.4


66 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Evaluating Models – Exercises 1 - 2

67 | © 2017 Software AG. All rights reserved. For internal use only
PART 7:
EXPORTING THE MODEL
Export to
PMML

Import Explore Transform Model Evaluate Communicate

68 | © 2017 Software AG. All rights reserved. For internal use only
NOW WHAT? HOW TO USE YOUR MODEL IN PRODUCTION
POSSIBLE PRODUCTION SETUPS

1 2 3

ZEMENTIS

69 | © 2017 Software AG. All rights reserved. For internal use only
OUR PRODUCTION SETUP
TRANSFORM / DEPLOY / PREDICT NEW DATA

Transform
1 final model to ML Model <ML Model>
PMML

ZEMENTIS

Deploy PMML
2 file to <ML Model>
Zementis
ZEMENTIS

Predict new New data


3 data <ML Model>
Predictions

70 | © 2017 Software AG. All rights reserved. For internal use only
Time to Practice
Exporting the Model - Exercises 1 - 3

71 | © 2017 Software AG. All rights reserved. For internal use only
CONGRATS! YOU MADE IT THROUGH THE ENTIRE COURSE

? Export to
PMML
A question is posed
in the beginning
or comes from the data
Analytics Team

Import Explore Transform Model Evaluate Communicate

72 | © 2017 Software AG. All rights reserved. For internal use only
73 | © 2017 Software AG. All rights reserved. For internal use only

Vous aimerez peut-être aussi