OpenSAP Ds1 Week 2 Transcript

openSAP
Getting Started with Data Science

Week 2 Unit 1
00:00:12 Hi, welcome to the second week of the openSAP course "Getting Started with Data Science".
This week we will cover the topic "data preparation" and we will start in this unit with an
overview on the data preparation phase.
00:00:32 Data preparation is referred to as data wrangling, data munging, and data janitor work.
This includes everything from list verification to removing commas and debugging databases.
00:00:50 Messy data is by far the most time-consuming aspect of the typical data scientists work flow.
This New York Times article reported that data scientists spend from 50% to 80% of their time
00:01:06 mired in the more mundane task of collecting and preparing unruly data before it can be
explored for useful nuggets. The chart shows that 3 out of every 5 data scientists spend the
most time during their working day cleaning and organizing data
00:01:27 while only 9% spend most time mining the data and building the models. Phase 3 of the
CRISP methodology is data preparation.
00:01:42 This phase covers all of the activities to construct the final analytical dataset from the initial raw
data. Data preparation tasks are likely to be performed multiple times and not in any
prescribed order.
00:01:59 Tasks include table, record, and attribute selection, as well as transformation and cleaning of
data for the modeling tools. The select data task decides on the data to be used for the
analysis:
00:02:15 The criteria to choose data include relevance to the data mining goals, data quality, and the
technical constraints. This covers the selection of attributes as well as the selection of records
in a table.
00:02:31 The clean data task raises the data quality to the level required by the selected analysis
techniques. The construct data task includes the data preparation operations, such as the
production of derived attributes,
00:02:49 entire new records, or the transformed values for existing attributes. The integrate data task
combines information from multiple tables or records to create new records or values.
00:03:05 Finally, the format data task produces the transformations that are primarily syntactic
modifications made to the data that do not change its meaning, but might be required by the
modeling tool.
00:03:24 There are two separate outputs in this phase, not related to a task: Firstly the dataset (or
datasets) produced by the data preparation phase,
00:03:37 which will be used for modeling or the major analysis work of the project. Secondly, a dataset
description that describes the dataset (or datasets) that will be used for the modeling
00:03:50 or the major analysis work of the project. This task decides on the data to be used for the
analysis.
00:03:59 The criteria will include the relevance of the data to the data mining goals, data quality, and
any technical constraints such as limits on data volume or data types.
00:04:11 Data selection covers the selection of attributes (the columns) in a table as well as selection of
records (the rows) in a table. The output is a list of the data to be included or excluded and the
reasons for these decisions.
00:04:31 The next task raises the data quality to the level required by the selected analysis techniques.
This may involve selection of clean subsets of the data, the insertion of suitable defaults,
1
00:04:45 or more ambitious techniques such as the estimation of missing data by modeling. The output
is a data cleaning report that describes what decisions and actions were taken to address the
data quality problems.
00:05:01 It also describes any transformations of the data required for cleaning purposes and any
possible impact on the analysis results. This next task includes constructive data preparation
operations,
00:05:19 such as the production of derived attributes, entire new records, or transformed values of the
existing attributes. The outputs include a list of derived attributes that are constructed from one
or more existing attributes in the same record.
00:05:37 For example, calculating area by multiplying together length and width. Also a description of
the completely new variables that are created is required.
00:05:51 For example, a new record flagging customers who made no purchases during the past year.
This next task combines information from multiple tables or records to create new records or
values.
00:06:09 The output is the merged data. Merging tables refers to joining together two or more tables
that have different information about the same objects.
00:06:20 For example, a retail chain has one table with information about each stores general
characteristics, for example its floor space or the type of mall.
00:06:31 It will have another table with summarized sales data, for example the profit, the percent
change in sales from previous years, and another table with information about the
demographics of the surrounding area.
00:06:46 Each of these tables contains one record for each store. These tables can be merged together
into a new table with one record for each store, combining fields from the source tables.
00:07:00 Merged data also covers aggregations. Aggregation refers to operations where new values are
computed by summarizing together information from multiple records and tables.
00:07:16 For example, converting a table of customer purchases, where there is one record for each
purchase, into a new table where there is one record for each customer, with fields such as
number of purchases, average purchase amount,
00:07:34 percent of orders charged to credit card, percent of items under promotion, for example. This
next task produces the formatting transformations that refer to primarily syntactic modifications
00:07:53 made to the data that do not change its meaning, but might be required by the modeling tool.
The output is the reformatted data.
00:08:03 Some tools have requirements on the order of the attributes, such as the first field being a
unique identifier for each record or the last field being the outcome field the model is trying to
predict.
00:08:18 It might be important to change the order of the records in the dataset. Perhaps the modeling
tool requires that the records be sorted according to the value of the outcome attribute.
00:08:31 A common situation is that the records of the dataset are initially ordered in some way but the
modeling algorithm needs them to be in a fairly random order.
00:08:42 For example, when using neural networks, it is generally best for the records to be presented
in a random order although some tools handle this automatically without explicit user
intervention.
00:08:57 Additionally, there are purely syntactic changes made to satisfy the requirements of the
specific modeling tool. For example, removing commas from within text fields in comma-
delimited data files.
00:09:13 So, thats it for Unit 1 of the second week. In the next unit we will look at the "predictive
modeling methodology".
00:09:21 See you.
Week 2 Unit 2
00:00:12 Welcome back to Week 2, Unit 2: An Overview to the Predictive Modeling Methodology.
Predictive modeling encompasses a variety of statistical techniques from modeling, machine
learning,
00:00:28 and data mining that analyze current and historical facts to make predictions about future or
otherwise unknown events. The output of a predictive model is a score or probability of the
targeted event occurring in the specified time frame in the future.
00:00:51 Although most often the unknown event of interest is in the future, predictive analytics can be
applied to any type of unknown whether it is in the past, present, or future.
00:01:03 For example, identifying suspects after a crime has been committed, or credit card fraud as it
occurs. "Descriptive Analytics" uses data visualization to provide insight into the past and
answer the question: What has happened?
00:01:24 Descriptive statistics are useful for company reports, giving total stock inventory, average
dollars spent per customer, and year-over-year change in sales.
00:01:36 "Predictive Analytics" uses statistical models and forecasts to understand the future and
answer the question: What could happen? Predictive analytics analyzes patterns in historical
data and develops models that predict the probability of events happening in the future.
00:01:58 "Prescriptive Analytics", which uses optimization and simulation algorithms, enables us to
answer the question: What should we do?.
00:02:11 Prescriptive analytics predicts not only what will happen, but also why it will happen, providing
recommendations regarding actions that will take advantage of the predictions.
00:02:24 Prescriptive analytics uses a combination of techniques and tools such as business rules,
algorithms, optimization, machine learning, and mathematical modeling processes.
00:02:40 Many lines of business are looking at technologies like predictive analytics to mine the masses
of data they are collecting and use the data to help them make decisions,
00:02:52 understand customers, recommend products, improve business performance, predict asset
maintenance, and, in a nutshell, use data to get a forward- looking perspective.
00:03:09 There are two phases to the predictive modeling process: Model Build (the learning phase)...
00:03:17 Predictive models are built or trained on historic data with a known outcome. The input
variables are called explanatory or independent variables.
00:03:31 For model building, the target or dependent variable is known. It can be coded, so if the
model is to predict the probability of response for a marketing campaign,
00:03:44 the responders could be coded as 1's and the non-responders as 0's, or as yes and no, for
example. The model is trained to differentiate between the characteristics of the customers
who are 1's and
00:04:02 Model Apply (the applying phase)... Once the model has been built, it is applied onto new,
more recent data which has an unknown outcome because the outcome is in the future.
00:04:18 The model calculates the score or probability of the target category occurring. So in our
example, it's the probability of a customer responding to the marketing campaign.
00:04:34 This example represents the model training of a churn model. We are trying to predict if a
customer is going to switch to another supplier.
00:04:45 We train the model (in the learning phase) using historical data where we know if customers
churned or not so we have a known target. The target variable flags churners as "yes" and
non-churners as "no".
00:05:02 This type of model, with a binary target, is called a classification model. This slide shows a
simple representation where we only have two explanatory characteristics age and city.
00:05:16 In a real predictive model, we might have hundreds or even thousands of these characteristics.
During the learning phase, the data is split into two or three sub-samples.
00:05:29 This is often a random split. Models are trained on the "Estimation" sub- sample and tested on
the "Validation" sub- sample,
00:05:40 and this process is described in more detail in this course. The predictive model identifies the
difference in the characteristics of a churner and a non- churner.
00:05:53 This can be represented as a mathematical equation, or in a scorecard format. In this
example, the scorecard gives a partial score depending on the city the customer lives
00:06:08 for example if the customer lives in Miami, they have a partial score of +0.7. Then another
partial score is given depending on their age.
00:06:20 In this example, depending on the customers age, the partial score is calculated using a
simple linear model. The partial scores are added up to give an overall score for each
individual customer.
00:06:35 The higher the total score, the more likely the customer is to be a "target = yes", and the lower
the overall score the more likely the customer is to be "target = no".
00:06:52 When we apply the model onto new data, we do not know the target because of course it is
in the future. We are trying to calculate the probability of customers churning in the future
this is the score or the probability output of the model.
00:07:09 The higher the score for each customer in the new data, the more likely they are to churn
(target = yes), and the lower the score, the more likely they are to remain (target = no).
00:07:27 Every time the model is used, the apply data has to be updated to the most recent time frame.
The same dataset is required at different points of time, for example the first of every month,
00:07:41 so that models can be applied to generate updated scores. Depending on the business
requirements, models may need to be applied each month, week, day, minute, or second.
00:07:55 Therefore, we need to automate the dataset preparation so that it can be quickly updated to
the required timeframe. When the dataset timeframe changes, the definition of each variable
does not change, only the reference date changes.
00:08:14 Any derived variables, such as a customers age, will be updated relative to the new reference
we could calculate the days difference between the reference date and the customers birth
date
00:08:28 as would each customers tenure as a customer, for example the days difference between the
reference date and the date the customer made their first purchase or joined a loyalty scheme.
00:08:42 Also, any event data, or transactional data in the last month, two months, three months, for
example, will also need to be updated relative to the moving reference date.
00:08:54 For example, we might want to calculate the number of transactions in the month prior to our
reference date, and if the reference date moves forward, then the number of transactions will
need to be recalculated for the new month
00:09:09 that will become the prior month to the reference date. So, the solution to automate this
process is to define the different data operations in a relative way: relative to a reference date.
00:09:30 For predictive modeling, most datasets have the following structure: There is "historical" data,
which is data in the past, compared to the reference date,
00:09:40 with dynamic data computed in relation to the reference date. These can be short-term, mid-
term, and long-term indicators.
00:09:50 There is a "latency" period that starts after the reference date and is a period where no data is
collected. This period is used to represent the time required by the business to collect new
data, apply the model, produce the scores,
00:10:08 define the campaign, and deploy the call center or mailing house. In some businesses, this
may be a few days and in others a few months.
00:10:20 Then there is a "target" period, starting after the reference date and latency period where the
targeted behavior we are predicting is observed.
00:10:31 For example, did the customer churn or not in this period? Durations for each of these periods
depend on the customer business, but it could be days, weeks, or months.
00:10:46 The model can be defined multiple times on data in different time frames, simply by moving the
reference date. Is a good model the one that reproduces exactly the data distribution on the
training data
00:11:04 or the one providing the same level of error both on the training dataset and any new data?
This question helps us define what we mean by model overfitting, underfitting, and the concept
of model robustness.
00:11:22 Look at the three models on the slide, all trained on the same data, but with very different
accuracy when compared to the known training data and the new unseen data.
00:11:35 The training data is shown in red. And you can see that the model at the top left is very
accurate, the model at the top right is very inaccurate,
00:11:45 and the model at the bottom is fairly accurate. However, when we apply these models onto
new data, shown in green,
00:11:54 you can see that the model at the top left is very inaccurate. The model at the top right is as
accurate as it was previously when we built the model.
00:12:04 However, the model at the bottom is still very accurate. The model at the top left is overfitted, it
has low robustness.
00:12:15 So you can see that there is no training error, but very high test error. The model at top right
has very high training error and also very high test error.
00:12:28 This model has been underfitted. It has high robustness, however, because the magnitude of
the training error is the same as the magnitude of the test error.
00:12:40 Of course what we are trying to achieve is a robust model where we have low training error
and low test error, which is shown at the bottom there.
00:12:52 When building a predictive model, care must be taken not to overfit the model. An overfitted
model is very accurate when tested on the model training data, but has high inaccuracy when
applied onto new data.
00:13:09 The model does not generalize. Our goal is to build robust models that generalize.
00:13:16 In overfitting, the model describes random error or noise instead of the underlying relationship.
Overfitting occurs when a model is excessively complex.
00:13:30 A model that has been overfitted has poor predictive performance as it overreacts to minor
fluctuations in the training data. The possibility of overfitting exists because the criterion used
for training the model
00:13:48 is not the same as the criterion used to judge the efficacy of a model. In particular, a model is
typically trained by maximizing its performance on some set of training data.
00:14:03 However, its efficacy is determined not by its performance on the training data but by its ability
to perform well on unseen data, on new data.
00:14:16 Overfitting occurs when a model begins to "memorize" training data rather than "learning" to
generalize from trend. In order to avoid overfitting, it is necessary to use additional techniques
00:14:32 for example, cross-validation, regularization, early stopping, or pruning of decision trees.
These can indicate when further training does not result in better generalization.
00:14:47 These techniques can either penalize overly complex models or test the model's ability to
generalize by evaluating its performance on a set of data not used for training,
00:14:59 which is assumed to approximate the typical unseen data that a model will encounter when it
is used. Okay, thats it for this unit.
00:15:11 In the next unit, we will cover the topic "Data Manipulation". See you.
Week 2 Unit 3
00:00:12 Hi, now in this unit, lets manipulate the data. Most data mining activities will require the data to
be prepared before the analysis is undertaken.
00:00:26 Data manipulation is often driven by domain knowledge. This is a process where database
tables are merged and aggregated,
00:00:35 new variables and transformations created in order to try and improve model quality, IF/THEN
conditions are created, and filters applied. Data manipulation is part of the CRISP data
preparation phase 3.
00:00:51 Phase 3.3: Constructs Data. This task includes constructive data preparation operations such
as the production of derived attributes,
00:01:03 entire new records, or transformed values for existing attributes. Phase 3.4: Integrates Data.
00:01:13 These are methods whereby information is combined from multiple tables or records to create
new records or values. Phase 3.5: Formats Data.
00:01:26 Formatting transformations refer to primarily syntactic modifications made to the data that do
not change its meaning, but might be required by the modeling tool.
00:01:41 The first step is to identify the entity for the analysis. An entity is the object targeted by the
planned analytical task.
00:01:53 It may be a customer, a product, or a store, for example, and is usually identified by a unique
identifier. The entity defines the granularity of the analysis.
00:02:08 To define an entity, you must take into account the fact that the entity makes business sense,
that it can be characterized in terms of attributes,
00:02:18 and that it can be associated with predictive metrics in relation to the tasks you want to
perform. Defining an entity is not a minor challenge.
00:02:30 Entities may be used in many projects and cannot be changed without an impact analysis on
all the deployed processes using this entity. For example, you have to determine, together with
everyone involved in this project,
00:02:47 if the entity for a project is the account or the customer, or the association between the
account and the customer, describing the role of the person with the account.
00:02:59 This can sometimes be very difficult to agree. To give you some examples...
00:03:06 A predictive model that is designed to predict if a customer of a utility company is going to
respond to an up-sell offer, will have the CustomerID as the entity.
00:03:20 A churn model, designed to predict if a postpaid telco customer is not going to extend their
subscription when their 12-month contract expires, but switch to a competitor,
00:03:34 could have the CustomerID or AccountID as entities, depending on the appropriate level of
analysis. The next step is to create the analytical record.
00:03:50 This is a 360-degree view of each entity, collecting all of the static and dynamic data together
that can be used to define the entity. This data will be the explanatory variables in the analysis.
00:04:09 This step requires data manipulation merging tables, aggregating data, creating new data
transformations and derived variables. The analytical record is an overall view of the entities.
00:04:25 When this entity is a customer, it is sometimes called the "customer analytic record", or the
"360-degree view of the customer", or sometimes it is even referred to as "customer DNA".
00:04:40 This view characterizes entities by a large number of attributes the more, the better which
can be extracted from the database or even computed from events that occurred for each of
them.
00:04:54 The list of all attributes corresponds to what is called an "analytical record" of the entity
"disposition". This analytical record can be decomposed into domains such as:
00:05:09 demographic, geo-demographic, complaints history, contacts history, products history, loan
history, purchase history (coming from all the different transactional events), segments, and
model scores.
00:05:30 SAP have made it very easy to create data manipulations and write the correct SQL
statements. Here you see the Data Manager expression editor that contains a wide range of
functions
00:05:45 that can be applied on any variable to create transformations. Data Manager can also be used
to create aggregated variables.
00:05:59 Aggregation refers to operations where new values are computed by summarizing together
information from multiple records or tables. For example, converting a table of customer
purchases, where there is one record for each purchase,
00:06:17 into a new table where there is one record for each customer, with fields such as number of
purchases, average purchase amount, percent of orders charged to credit card, and percent of
items under promotion.
00:06:34 Aggregates are required when you need to join an events table where you have a one-to-
many join which cannot be undertaken by a simple merge.
00:06:45 Based on these aggregated variables, you can then build new variables for example the
difference or ratios between time periods using the expression editor.
00:07:01 There are a range of data processing functions available in the SAP expert tool as well.
Sometimes you want to convert data types for example numeric to string.
00:07:13 Certain decision tree algorithms only accept string data types for the target variable. You may
want to rename a variable for better understanding in the analysis.
00:07:29 You may need to create new variables. The ratio of two variables may be more valuable in an
analysis than the numerator and denominator separately.
00:07:40 For example weekday phone calls divided by total number of phone calls and similar ratios
and percentages often are very predictive when developing models for telco organizations.
00:07:58 Sometimes you want to scale all numeric variables to the range 0 to 1 for equal input to a
model, or by calculating a Z score or using decimal scaling.
00:08:11 For example, this is often used when training cluster models so that all of the explanatory
characteristics are rescaled so they have equal importance.
00:08:23 You might want to convert categorical data to numeric. In this example we are creating two
new variables, which are binary flags, meaning 1's and 0's,
00:08:34 indicating "Apples" and "Bananas" from the single variable called "Fruit". This is sometimes
referred to as disjunctive or dummy coding.
00:08:48 We may take samples of the data to test our models prior to running them on all the data, or
we may take samples when there is too much data to analyze.
00:09:00 There are a wide variety of sampling approaches available in Data Manager and Expert. For
example, "First N" and "Last N" that selects a number or percentage of records,
00:09:13 "Simple Random" sampling, "Systematic Random" that takes "buckets" and randomly samples
from each bucket. This is an example of a simple condition created in Data Manager.
00:09:30 If <Condition>, Then <Value>, Else <Value> rules can be created very easily. All of the
manipulations must be fully documented.
00:09:46 Data Manager documents automatically for you, creating a graphical summary, a list of visible
and invisible fields, any user prompts,
00:09:58 details of all of the expressions created, and details of the filters. That's the end of this unit.
00:10:09 In the next unit, we will handle the topic "Selecting Data Variable and Feature Selection".
Week 2 Unit 4
00:00:13 Welcome to the unit "Selecting Data Variable and Feature Selection". Feature or variable
selection is the process of selecting a subset of relevant explanatory variables or predictors
00:00:30 for use in data science model construction. It is also known as variable selection, attribute
selection, or variable subset selection.
00:00:42 Often, data contains many features that are either redundant or irrelevant, and can be
removed without incurring too much loss of information.
00:00:53 A feature selection algorithm can be seen as the combination of a search technique for
proposing new feature subsets, along with an evaluation measure which scores the different
feature subsets.
00:01:09 Feature selection techniques are used for three reasons: Simplification of models to make
them easier to interpret by users
00:01:19 we want to explain the data in the simplest way, so redundant predictors should be removed.
Shorter model training times and reduced cost
00:01:31 if the model is to be used for prediction, we can save time and money by not measuring
redundant predictors. Enhanced generalization by reducing overfitting in some cases,
unnecessary predictors
00:01:46 will add noise to the estimation of other quantities that we are interested in. Remember that
domain knowledge can be the best selection criterion of all.
00:02:02 Traditional approaches to selecting the variables to go into a model can be very time-
consuming, especially when there are thousands of variables to analyze.
00:02:13 Typical approaches are: for continuous variables performing a univariate logistic regression
on each variable. And for categorical variables performing a significance test using Chi-
Square.
00:02:29 One of the most popular forms of automated feature selection is called stepwise regression.
This is an algorithm that adds the best feature or deletes the worst feature in a series of
iterative steps.
00:02:45 The main control issue is deciding when to stop the algorithm. Other automated selection
processes are called backward elimination and forward selection.
00:02:59 For automated selection, the exclusion or inclusion of features at each step depends on a
comparison criterion, such as an F-test or ttest, although other metrics can be used.
00:03:16 Backward elimination starts with all candidate features. Step 1 is a test on the deletion of each
feature using the chosen model comparison criterion,
00:03:30 deleting the feature that improves the model the most by it being deleted. Step 2 repeats this
process until no further improvement is possible.
00:03:42 This is the simplest of all variable selection procedures. One benefit of backward elimination
over forward selection is that it allows features of lower significance to be considered
00:03:55 in combinations that might never enter the model in a forward selection. Therefore, the
resulting model may depend on more equal contributions of many features
00:04:07 instead of the dominance of one or two powerful features. Forward selection just reverses the
backward method.
00:04:18 It starts with no features in the model. Step 1 is a test on the addition of each feature using the
chosen model comparison criterion.
00:04:30 Step 2 is to add the feature (if any) that improves the model the most. Step 3 repeats this
process until no other feature additions improve the model.
00:04:46 Stepwise regression is a combination of the backward elimination and forward selection
processes. At each stage in the process, after a new variable is added, a test is made to check
if some variables can be deleted
00:05:02 without appreciably increasing the error. The procedure terminates when the measure is
maximized, or when the available improvement falls below some critical value.
00:05:15 One of the main issues with stepwise regression is that it is prone to overfitting the data.
However, this problem can be mitigated if the criterion for adding or deleting a variable is stiff
enough.
00:05:31 Widespread incorrect usage of this process, and the availability of more modern approaches
which we will discuss next or using expert judgement to identify relevant variables, have led
to calls to totally avoid stepwise model selection.
00:05:51 Modern approaches use subset selection that evaluates a subset of features as a group for
suitability. Subset selection algorithms can be broken up into "Filters", "Wrappers", and
"Embedded" methods.
00:06:11 For filters Filter selection methods apply a statistical measure to assign a scoring to each
explanatory variable, which we refer to as features.
00:06:23 The features are ranked by the score and either selected to be kept or removed from the
dataset. The methods are often univariate and consider the feature independently, or with
regard to the dependent variable.
00:06:40 These methods are particularly effective in computation time and robust to overfitting. Filters
are usually less computationally intensive than wrappers, but they produce a feature set
00:06:55 which is not tuned to a specific type of predictive model. Wrappers use a search algorithm to
search through possible features and evaluate each subset
00:07:10 by running a predictive model to score them. Each new subset is used to train a model, which
is tested on a hold-out sample.
00:07:19 Counting the number of mistakes made on that hold-out set, for example the error rate of the
model, gives the score for that subset. Because wrapper methods evaluate subsets of
variables, they allow the detection of possible interactions between the variables.
00:07:41 The two main disadvantages of this method are firstly, the increasing overfitting risk when the
number of observations is insufficient,
00:07:53 and secondly, the significant computation time when the number of variables is large.
Embedded techniques are embedded in, and specific to, a model.
00:08:07 They try to combine the advantages of both filter and wrapper methods and perform feature
selection as part of the model construction process.
00:08:18 The most common type of embedded feature selection methods are called regularization
methods. These are penalization methods that introduce additional constraints into the
optimization of a predictive algorithm
00:08:35 such as a regression algorithm that bias the model toward lower complexity, with fewer
coefficients. An example of a regularization algorithm is ridge regression, which we deploy in
the SAP automated modeling toolset.
00:08:58 So, with the next unit which is called "Data Encoding" we will finalize Week 2. See you there.
Week 2 Unit 5
00:00:12 Hi and welcome back. In the last unit of the second week we will now take a look at data
encoding.
00:00:22 Data encoding is an essential part of the data preparation process. It prepares missing values
in the data, deals with outliers, and creates data bins or bands
00:00:35 to transform raw data into a mineable source of information. Different encoding strategies are
deployed depending on the variable description value nominal, ordinal, or continuous
variable
00:00:50 and we will look at this in more detail in this presentation. Data encoding each explanatory
variable can take a large amount of time, but must not be ignored.
00:01:03 SAP Predictive Analytics Automated deploys some very powerful functionality that encodes
data automatically. There are three different types of variable: nominal, ordinal, and
continuous.
00:01:22 A nominal variable is a discrete (categorical), qualitative variable that characterizes, describes,
or names an element of a population. For example, hair color you can be brown, blond,
ginger.
00:01:42 Make of car Mercedes or Ford. Your gender male or female.
00:01:49 Post or ZIP code. Residence city London, New York, or Paris.
00:01:55 These are all categories, but for nominal variables the order of the categories does not matter.
It is important to remember these definitions because they affect the way data is interpreted.
00:02:13 An ordinal variable is a discrete (categorical) qualitative variable. However, importantly, the
categories have an order.
00:02:24 I'll give you some examples. Gold, silver, and bronze.
00:02:28 Satisfaction level very dissatisfied, dissatisfied, neutral, satisfied, and very satisfied. Pain
level can be mild, moderate, or severe.
00:02:41 Again these are all categories, but for ordinal variables the order of the categories does matter.
Let's look at an ordinal variable example.
00:02:53 A bank assigns the values 1 through 10 to denote financial risk. The value 1 characterizes no
late payments, with low risk;
00:03:05 the value 10 denotes a bankruptcy, high risk; values from 2 through 9 denote previous
delinquencies.
00:03:16 A customer with a 10 rating is definitely riskier than one with a 1 rating. However, the customer
is not 10 times riskier, and the difference in the ranks (10-1=9) has absolutely no meaning.
00:03:35 With ordinal variables, the order matters, but not the difference between the values. For
example, patients are asked to express the amount of pain they are feeling on a scale of 1 to
10.
00:03:48 A score of 7 means more pain than a score of 5, and that is more than a score of 3. But the
difference between the 7 and 5 might not be the same as the difference between the 5 and 3.
00:04:03 The values simply express an order. A continuous variable is a quantitative variable, not a
qualitative variable.
00:04:15 It is a real number that can take any value with fractions or decimal places between two
specific numbers. It accommodates all basic arithmetic operations, meaning addition,
subtraction, multiplication, and division.
00:04:33 Examples are: income, age (in years), running time (in minutes), bank account balance (in
dollars), distance (in miles), any ratio or calculated value this includes most business data.
00:04:52 A missing value is an empty cell in your dataset. Missing values can occur due to an error, for
example a data input error, or because they are simply not available.
00:05:04 Whatever the reason, it is worth investigating why there are missing values. They can be
removed from the dataset, estimated, or kept as they are.
00:05:16 The analysis could also be postponed so that further investigation of the reason for missing
values can be undertaken. There are many approaches to estimating missing values.
00:05:29 For example, for numeric data, the mean value may be used or an estimate made through
interpolation between the values. Like outliers, they should not be ignored.
00:05:42 Some algorithms handle missing values as part of the analysis, for example Apriori association
analysis. Others cannot, for example linear regression.
00:05:56 SAP Predictive Analytics Automated handles missing values automatically for you. For a
continuous variable, an outlier is a single or low-frequency occurrence of the value of a
variable
00:06:15 that is far from the mean as well as the majority of other values for that variable. For a
categorical variable (nominal or ordinal), an outlier is a single or very low-frequency
occurrence of a category of a variable.
00:06:33 Determining whether a value is an outlier or a data error can be difficult. If outliers exist in a
dataset, they can significantly affect the analysis.
00:06:45 Therefore, the initial data analysis (IDA) should include a search for outliers. Often, outliers are
removed from the analysis using trimming or Winsorizing techniques.
00:07:00 A typical strategy is to set all outliers to a specified percentile of the data. For example, a 90%
Winsorization would see all data below the 5th percentile set to the 5th percentile,
00:07:19 and data above the 95th percentile set to the 95th percentile value. However, outliers should
not necessarily be omitted from the analysis as they may be genuine observations in the data,
00:07:36 and they may indicate that a robust model solution is not possible. They can be searched for
visually and by using various algorithms which we will be looking at in more detail in this
course.
00:07:51 Popular plots for outlier detection are scatter plots and box plots. As data volumes increase,
then data visualization to identify outliers does get harder.
00:08:05 Again, SAP Predictive Analytics Automated system handles outlier values automatically for A
numeric variable may have many different values, and for some algorithms this may lead to
very complex models
00:08:25 which are very hard to interpret. Data scientists like simplicity in models they call it
parsimony.
00:08:35 Simplicity also helps explainability, which is important when seeking decision-maker support. A
continuous numeric variable can be "binned" or "grouped" to lessen the possible outcomes.
00:08:50 Binning is subjective, so various binning strategies should be considered. On the slide you will
see different ways to create bins of an age variable from the raw data.
00:09:02 Binning helps to improve model performance. It captures non-linear behavior of continuous
variables.
00:09:11 It minimizes the impact of outliers. It removes noise from large numbers of distinct values.
00:09:20 It makes the models more explainable grouped values are easier to display and understand.
It improves model build speed predictive algorithms build much faster as the number of
distinct values decreases.
00:09:39 With this Id like to close the second week. I hope you enjoyed these units and I am happy to
get in touch with you in our Discussion Forum if you have any content-related question.
00:09:52 Now, I wish you all the best for the weekly assignment, and see you next week when we will
start the topic of modeling.
www.sap.com
2016 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form
or for any purpose without the express permission of SAP SE or an SAP
affiliate company.
SAP and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP SE (or an
SAP affiliate company) in Germany and other countries. Please see
http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for
additional trademark information and notices. Some software products
marketed by SAP SE and its distributors contain proprietary software
components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for
informational purposes only, without representation or warranty of any kind,
and SAP SE or its affiliated companies shall not be liable for errors or
omissions with respect to the materials. The only warranties for SAP SE or
SAP affiliate company products and services are those that are set forth in
the express warranty statements accompanying such products and services,
if any. Nothing herein should be construed as constituting an additional
warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue
any course of business outlined in this document or any related presentation,
or to develop or release any functionality mentioned therein. This document,
or any related presentation, and SAP SEs or its affiliated companies
strategy and possible future developments, products, and/or platform
directions and functionality are all subject to change and may be changed by
SAP SE or its affiliated companies at any time for any reason without notice.
The information in this document is not a commitment, promise, or legal
obligation to deliver any material, code, or functionality. All forward-looking
statements are subject to various risks and uncertainties that could cause
actual results to differ materially from expectations. Readers are cautioned
not to place undue reliance on these forward-looking statements, which
speak only as of their dates, and they should not be relied upon in making
purchasing decisions.

OpenSAP Ds1 Week 2 Transcript

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

OpenSAP Ds1 Week 2 Transcript

Transféré par

Droits d'auteur :

Formats disponibles

openSAP

Getting Started with Data Science

2016 SAP SE or an SAP affiliate company. All rights reserved.

Vous aimerez peut-être aussi