Académique Documents
Professionnel Documents
Culture Documents
Data Mining
with SPSS
Modeler
Theory, Exercises and Solutions
Data Mining with SPSS Modeler
ThiS is a FM Blank Page
€ren Gro
Tilo Wendler • So €ttrup
Data Analytics, Data Mining and Big Data are terms often used in everyday
business. Companies collect more and more data and store it in databases, with
the hope of finding helpful patterns that can improve business. Shortly after
deciding to more use of such data, managers often confess that analysing these
datasets is resource-consuming and anything but easy. Involving the firm’s
IT-experts leads to a discussion regarding which tools to use. Very few applications
are available in the marketplace that are appropriate for handling large datasets in a
professional way. Two commercial products worth mentioning are ‘Enterprise
Miner’ by SAS and ‘SPSS Modeler’ by IBM.
At first glance, these applications are easy to use. After a while, however, many
users realize that more difficult questions require deeper statistical knowledge.
Many people are interested in gaining such statistical skills and applying them,
using one of the data mining tools offered by the industry.
This book will help users to become familiar with a wide range of statistical
concepts or algorithms and apply them to concrete datasets. After a short statistical
overview of how the procedures work and what assumptions to keep in mind, step-
by-step procedures show how to find the solutions with the SPSS Modeler.
The authors of the book, Tilo Wendler and S€oren Gr€ottrup, want to thank all the
people who supported the writing process. These include IBM support experts who
dealt with some of the more difficult tasks, discovering more efficient ways to
handle the data. Furthermore, the authors want to express gratitude to Jeni
Ringland, Katrin Minor and Maria Sabottke for their outstanding efforts and their
help in professionalising the text, layout, figures and tables.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Concept of the SPSS Modeler . . . . . . . . . . . . . . . . . . . . . 2
1.2 Structure and Features of This Book . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Prerequisites for Using This Book . . . . . . . . . . . . . . 5
1.2.2 Structure of the Book and the Exercise/Solution
Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Using the Data and Streams Provided with the
Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Datasets Provided with This Book . . . . . . . . . . . . . . 9
1.2.5 Template Concept of This Book . . . . . . . . . . . . . . . 10
1.3 Introducing the Modeling Process . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Basic Functions of the SPSS Modeler . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Defining Streams and Scrolling Through a Dataset . . . . . . . . . 25
2.2 Switching Between Different Streams . . . . . . . . . . . . . . . . . . 32
2.3 Defining or Modifying Value Labels . . . . . . . . . . . . . . . . . . . 34
2.4 Adding Comments to a Stream . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7 Data Handling and Sampling Methods . . . . . . . . . . . . . . . . . . 49
2.7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7.2 Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.3 String Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7.4 Extracting/Selecting Records . . . . . . . . . . . . . . . . . . 61
2.7.5 Filtering Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.7.6 Data Standardization: Z-Transformation . . . . . . . . . . 73
2.7.7 Partitioning Datasets . . . . . . . . . . . . . . . . . . . . . . . . 82
2.7.8 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.7.9 Merge Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
vii
viii Contents
The amount of collected data has risen exponentially over the last decade, as
companies worldwide store more and more data on customer interactions, sales,
logistics, and production processes. For example, Walmart handles up to ten million
transactions and 5000 items per second (see Walmart 2012). According to The
Economist (2010a), the company feeds 2.5 petabyte of databases. To put that in
context, that is the same as the total number of letters sent within the US over
6 months.
In recent years, companies have discovered that understanding their collected
data is a very powerful tool and can reduce overheads and optimize work-flows,
giving their firms a big advantage in the market.
The challenge in this area of data analytics is to consolidate data from different
sources and analyze the data, to find new structures or patterns and rules that can
help predict the future of the business more precisely, or create new fields of
business activity. The job of a data scientist is predicted to become one of the
most interesting jobs in the twenty-first century (see Davenport and Patil 2012), and
there is currently strong competition for the brightest and most talented analysts in
this field.
A wide range of applications, such as R, SAS, MATLAB, and SPSS Statistics,
provide a huge toolbox of methods to analyze large data and can be used by experts
to find patterns and interesting structures in the data. Many of these tools are mainly
programming languages, which assumes the analyst has deeper programming skills
and an advanced background in IT and mathematics. Since this field is becoming
more important, graphic user-interfaced data analysis software is starting to enter
the market, providing “drag and drop” mechanisms for career changers and people
who are not experts in programming or statistics.
One of these easy to handle, data analytics applications is the IBM SPSS
Modeler. This book is dedicated to the introduction and explanation of its data
analysis power, delivered in the form of a teaching course.
IBM’s SPSS Modeler offers more than a single application. In fact it is a family of
software tools based on the same concept. These are:
" SPSS Modeler is a family of software tools provided for use with a wide
range of different operating systems/platforms. Besides the SPSS
Modeler Premium used here in this book, other versions of the
Modeler exist. Most of the algorithms discussed here are available in
all versions.
7RROEDU
0RGHOHU0DQDJHU
6WUHDP
1RGHV3DOHWWH
For now, these are the most important details of the IBM SPSS Modeler editions
and the workspace. We will go into more detail in the following chapters. To
conclude this introduction to the Modeler, we want to present a list of the
advantages and challenges of this application, by way of a summary of our findings
while working with it.
– The Modeler supports data analysis and the model building process, with its
accessible graphical user-interface and its elaborate stream concept. Creating a
stream allows the data scientist to very efficiently model and implement the
several steps necessary for data transformation, analysis, and model building.
The result is very comprehensible; even other statisticians that are not involved
in the process can understand the streams and the models very easily.
– Due to its powerful statistics engine, the Modeler can be used to handle and
analyze huge datasets, even on computers with restricted performance. The
application is very stable.
– The Modeler offers a connection with the statistics program R. We can link to
the dataset, doing several calculations in R, and send the results back to the
Modeler. Users wishing to retain the functionalities of R have the chance to
implement them in the Modeler streams. We will show how to install and how to
use R in Chap. 9.
– Finally, we have to mention the very good IBM support that comes as part of the
professional implementation of the Modeler in a firm’s analytics environment.
Even the best application sometimes raises queries that must be discussed with
experts. IBM support is very helpful and reliable.
All in all the IBM SPSS Modeler, like any other application, has its advantages
and its drawbacks. Users can analyze immense datasets in a very efficient way.
Statisticians with a deep understanding of “goodness of fit” measures and sophisti-
cated methods for creating models will not always be satisfied, because of the
limited amount of output detail available. However, the IBM SPSS Modeler is
justifiably one of the leading data mining applications on the market.
This book can be read and used with minimal mathematical and statistical knowl-
edge. Beside interest in dealing with statistical topics, the reader is required to have
a general understanding of statistics and its basic terms and measures, like fre-
quency, frequency distribution, mean, and standard deviation. Deeper statistics and
mathematics are shortly explained in the book when needed, or references to the
relevant literature are given, where the theory is explained in an understandable
way. The following books focussing on more basic statistical analyses are
recommended to the reader Herkenhoff and Fogli (2013) and Weiers et al. (2011).
Since the main purpose of this book is the introduction of the IBM SPSS
Modeler and how it can be used for data mining, the reader needs a valid IBM
SPSS Modeler licence in order to properly work with this book and solve the
provided exercises.
For readers that are not completely familiar with the theoretical background, the
authors explain at the beginning of each chapter the most relevant and needed
statistical and mathematical fundamentals and give examples where the outlined
methods can be applied in practice. Furthermore, many exercises are related to the
explanations in the chapters and are intended for the reader to recapitulate the
theory and gain more advanced statistical knowledge. The detailed explanations of
and comments to these exercises clarify more essential terms and procedures used
in the field of statistics.
6 1 Introduction
" It is recommended that the reader of this book has a valid IBM SPSS
Modeler licence and should be interested in dealing with current
issues in statistical and data analysis applications. This book facilitates
the easy familiarization with statistical and data mining concepts
because . . .
" The following books focusing on more basic statistical analyses are
recommended to the reader:
The goal of this book is to help users of the Modeler to become familiar with the
wide range of data analysis and modeling methods offered by this application. To
this end, the book has the structure of a course or teaching book. Figure 1.2 shows
the topics discussed, allowing the user easy access to different focus points and the
information needed for answering his own particular questions.
Each section has the following structure:
Table 1.2 Overview of stream and data structure used in this book
Stream name distribution_analysis_using_data_audit_node
Based on dataset tree_credit.sav
Stream structure
Table 1.3 Example of a Name of the solution streams File name of the solution
solution
Theory discussed in section Section XXX
The names of the dataset and stream are listed before the final stream is depicted,
followed by additional important details. At the end, the exercise numbers are
shown, where the reader can test and practice what he/she has learned.
1.2.3 Using the Data and Streams Provided with the Book
The SPSS Modeler streams need access to datasets that can then be analyzed. In the
so-called “Source” nodes, the path to the dataset folder must be defined. To work
more comfortably the streams provided with this book are based on the following
logic:
– All datasets are copied into one folder in the “C:” drive
– There is just one file “registry_add_key.bat” in the folder “C:
\SPSS_MODELER_BOOK\”
– The name of the dataset folder is “C:\SPSS_MODELER_BOOK\001_Datasets”
– The streams are normally copied to “C:\SPSS_MODELER_BOOK
\002_Streams”, but the folder names can be different.
If other folders are to be used, then the procedure described here, in particular the
Batch file, must be modified slightly.
" All datasets, IBM SPSS Modeler streams, R scripts, and Microsoft Excel
files discussed in this book are provided as downloads on the website:
" http://www.statistical-analytics.net
" For ease, the user can add a key to the registry of Microsoft Windows.
This is done using the script “registry_add_key.bat” provided with
this book.
1.2 Structure and Features of This Book 9
The key can also be added to the Windows Registry manually by using the
command:
REG ADD “HKLM\Software\IBM\IBM SPSS Modeler\17.0\Environment”/v
“BOOKDATA”/t REG_SZ/d “C:\SPSS_MODELER_BOOK\001_Datasets”
Instead of using the original Windows folder name, e.g., “C:
\SPSS_MODELER_BOOK\001_Datasets”, to address a dataset, the shortcut
“$BOOKDATA” defined in the Windows registry should now be used (see also
IBM Website 2015b).
We should pay attention to the fact that the backslash in the path must be
substituted with a slash. So the path
“C:\SPSS_MODELER_BOOK\001_Datasets\car_sales_modified.sav”
equals
“$BOOKDATA/car_sales_modified.sav”.
With this book, the reader has access to more than 30 sets which can be downloaded
from the author’s website using the given password (see Sect. 1.3.1). The data are
available in different file formats, such as:
10 1 Introduction
" All datasets discussed in this book can be downloaded from the authors’
website using the password provided. The datasets have different file
formats, so the user learns to deal with different nodes to load the
sets for data analysis purposes.
" Since version 17, SPSS Modeler does not support Excel file formats
from 1997 to 2003. It is necessary to use file formats from at
least 2007.
In this book, we show how to build different streams and how to use them for the
analysis of datasets. We will explain now how to create a stream from scratch.
Despite outlining this process, it is unhelpful and time-consuming to always have to
add the data source and some suggested nodes, e.g., the Type node, to the streams.
To work more efficiently, we hereby would like to introduce the concept of
so-called template streams. Here, the necessary nodes for loading the dataset and
defining the scale types for the variables are implemented. So the users don’t have
to repeat these steps in each exercise. Instead, they can focus on the most important
steps and learn the new features of the IBM Modeler. The template streams can be
extended easily by adding new nodes. Figure 1.3 shows a template stream.
We should mention the difference between the template streams and the solution
streams provided with this book. In the solution streams, all the necessary features
and nodes are implemented, whereas in the template streams only the first steps,
e.g., the definition of the data source, are incorporated. This is depicted in Fig. 1.4.
So the solution is not only a modification of the template stream. This can be seen
by comparing Fig. 1.5 with the template stream in Fig. 1.3.
1.2 Structure and Features of This Book 11
" Template streams access the data by using the “$BOOKDATA” short-
cut defined in the registry. Otherwise the folder in the Source nodes
would need to be modified manually before running the stream.
The details of the datasets and the meaning of the variables included are
described in Sect. 10.1.
Before the streams named above can be used, it is necessary to take into account
the following aspects of data directories. The streams are created based on the
concept presented in Sect. 1.2. As shown in Fig. 1.6, the Windows registry shortcut
“$BOOKDATA” is being used in the Source node. Before running a stream,
the location of the data files should be verified and probably adjusted. To do
this, double-click on the Source node and modify the file path if necessary (see
Fig. 1.6).
1.3 Introducing the Modeling Process 13
Before we dive into statistical procedures and models, we want to address some
aspects relevant to the world of data from a statistical point of view. Data analytics
are used in different ways, such as for (see Ville 2001, pp. 12–13):
This broad range of applications lets us assume that there are an infinite number
of opportunities for the collection of data and for creating different models. So we
have to focus on the main aspects that all these processes have in common. Here we
want to first outline the characteristics of the data collection process and the main
steps in data processing.
The results of a statistical analysis should be correct and reliable, but this always
depends on the quality of the data the analysis is based upon. In practice, we have to
deal with the effects that dramatically influence the volume of data, the quality, and
the data analysis requirements. The following list, based on text from the IBM
Website (2015c), gives an overview:
1. Scale of data:
New automated discovery techniques allow the collection of huge datasets.
There has been an increase in the number of devices that are able to generate data
and send it to a central source.
2. Velocity of data:
Due to increased performance in all business processes, data must be analyzed
faster and faster. Managers and consumers expect results in minutes or seconds.
3. Variety of data:
The time has long passed since data were collected in a structured form,
delivered, and used more or less directly for data analysis purposes. Data are
produced in different and often unstructured or less structured forms, such as
through social networks comment, information on websites, or through stream-
ing platforms.
4. Data in doubt:
Consolidated data from different sources enable statisticians to draw a more
accurate picture of which entities to analyze. The data volume increases dramat-
ically, but improved IT performance allows the combination of many datasets
and the use of a broader range of sophisticated models.
The source of data determines the quality of the research or analysis. Figure 1.7
shows a scheme to characterize datasets by source and size. If we are to describe a
dataset, we have to use two terms, one from either side of the scheme. For instance,
collected data that relate to consumer behavior, based on a survey, is typically a
sample. If the data are collected by the researcher his/herself, it is called primary
data because he/she is responsible for the quality.
Once the data are collected, the process of building a statistical model can start,
as shown in Fig. 1.8. As a first step, the different data sources must be consolidated,
using a characteristic that is unique for each object. Once the data can be combined
in a table, the information must be verified and cleaned. This means removing
duplicates or finding spelling mistakes or semantic failures.
At the end of the cleaning process, the data are prepared for statistical analysis.
Typically, further steps such as normalization or re-scaling are used and outliers are
detected. This is to meet the assumptions of the statistical methods for prediction or
pattern identification. Unfortunately, a lot of methods have particular requirements
1.3 Introducing the Modeling Process 15
that are hard to achieve, such as normal distributed values. Here the statistician has
to know the consequences of deviation from theory to practice. Otherwise, the
“goodness of fit” measures or the “confidence intervals” determined, based on the
assumptions, are often biased or questionable.
A lot of further details regarding challenges in the data analysis process could be
mentioned here. Instead of stressing theoretical facts however, we would like to
dive in to the data handling and model building process with the SPSS Modeler. We
recommend the following exercises to the reader.
16 1 Introduction
1.3.1 Exercises
1. Write all the data measurement units in the correct order and explain how to
convert them from one to another.
2. After a terabyte of data, come the units petabyte, exabyte, and zetabyte. How do
they relate to the units previously mentioned?
3. Using the Internet, find examples that help you to imagine what each unit means
in terms of daily live data volume.
1. Name and briefly explain at least three advantages and three drawbacks of a
personal interview, a mail survey, or an internet survey from a data collection
perspective.
2. Name various possible sources of secondary data.
The advertising industry obtains its data in two ways. “First-Party” data are collected by
firms with which the user has a direct relationship. Advertisers and publishers can compile
them by requiring users to register online. This enables the companies to recognize
consumers across multiple devices and see what they read and buy on their site.
“Third-party” data are gathered by thousands of specialist firms across the web [. . .] To
gather information about users and help servers appropriate ads, sites often host a slew of
third parties that observe who comes to the site and build up digital dossiers about them.
Using the classification scheme of data sources shown in Fig. 1.7, list the correct
terms for describing the different data sources named here. Explain your decision,
as well as the advantages and disadvantages of the different source categories.
customers. They do this so that stores, such as Target and Walmart, can generate
significantly higher margins from selling their baby products. So if the consumer
realizes, through reading personalized advertisements, that these firms offer inter-
esting products, the firm can strengthen its relationship with these consumers. The
earlier a company can reach out to this target group, the more profit can be
generated later. Here we want to discuss the general procedure for such a data
analysis process.
The hypothesis is that consumer habits of women and their male friends change
in the event of a pregnancy. By the way, the article also mentioned that according to
studies, consumers also change their habits when they marry. They become more
likely to buy a new type of coffee. If they divorce they start buying different brands
of beer and if they move into a new house there is an increased probability they will
buy a new kind of breakfast cereal.
Consider you have access to the following data:
• Primary data
– Unique consumer ID generated by the firm
– Consumers’ credit card details
– Purchased items from the last 12 months, linked to credit card details or
customer loyalty card
– An internal firm registry, including the parents personal details collected in a
customer loyalty program connected with the card
• Secondary data:
– Demographic information (age, number of kids, marital status, . . .)
– Details, e.g., ingredients in lotions, beverages, types of breads, etc. bought
from external data suppliers
Fig. 1.9 IT system with central database in a firm [Figure adapted from Abts and Mülder (2009,
p. 68)]
1. Describe the main characteristics of the firm’s IT system in your own words.
2. List some advantages of data consolidation realized by the firm.
3. Discussing the actual status, describe the drawbacks and risks that the firm faces
from centralizing the data.
4. Summarize your findings and make a suggestion of how to implement data
warehousing within a firm’s IT landscape.
1.3.2 Solutions
The details of a possible solution can be found in Table 1.4. This table is based on
The Economist (2010b).
Table 1.6 shows the solution. More details can be found, e.g., in Ghauri and
Grønhaug (2005).
1. It is necessary to clean up and consolidate the data of the different sources. The
consumer ID is the primary key in an interesting database table. Behind these
keys, the other information must be merged. In the end, e.g., demographic
information including the pregnancy status calculated based on the baby register
is linked to consumer habits in terms of purchased products. The key to
consolidating the purchased items of a consumer is the credit card or the
consumer loyalty card details. As long as the consumer shows one of these
cards and doesn’t pay cash, the purchase history becomes more complete
each time.
2. Assuming we have clean personal consumer data linked to the products bought,
we can now determine the relative frequency of purchases over time. Moreover,
the product itself may change, and then the percentage of specific ingredients in
each of the products may be more relevant. Pregnant consumers buy more
unscented lotion at the beginning of their second trimester and sometimes in
the first 20 weeks they buy more scent-free soap, extra-big bags of cotton balls
and supplements such as calcium, magnesium, and zinc, according to the The
New York Times Magazine (2012) article.
3. We have to analyze and check the purchased items per consumer. If the pattern
changes are somehow similar to the characteristics we found from the data
analysis for pregnancy, we can reach out to the consumer, e.g., by sending
personalized advertisements to the woman or her family.
4. There is another interesting aspect to the data analytics process. Analyzing the
business risk is perhaps more important than missing a chance to do correct
analysis at the correct point in time. There are at least some risks: how do
consumers react when they realize a firm can determine their pregnancy status?
Is contacting the consumer by mail a good idea? One interesting issue the firm
Target had to deal with was the complaints received from fathers who didn’t
even know that their daughters were pregnant. They only discovered upon
receipt of the personalized mail promotion. So it is necessary to determine the
risk to business before implementing, for, e.g., a pregnancy-prediction model.
1. All data is stored in one central database. External and internal data are
consolidated here. The incoming orders from a website, as well as the manage-
ment of relevant information, are saved in the database. A database management
system allows for restriction of user access. By accessing and transforming the
database content, complex management reports can be created.
2. If data is stored in a central database, then redundant information can be reduced.
Also each user can see the same data at a single point in time. Additionally, the
database can be managed centrally, and the IT staff can focus on keeping this
system up and running. Furthermore, the backup procedure is simpler and
cheaper.
22 1 Introduction
Literature
Abts, D., & Mülder, W. (2009). Grundkurs Wirtschaftsinformatik: Eine kompakte und praxisor-
ientierte Einf€
uhrung, STUDIUM (6th ed.). Wiesbaden: Vieweg + Teubner.
Davenport, T., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century. Harvard
Business Review, 90(10), 70–76.
de Ville, B. (2001). Microsoft data mining: Integrated business intelligence for e-Commerce and
knowledge management. Boston: Digital Press.
Ghauri, P. N., & Grønhaug, K. (2005). Research methods in business studies: A practical guide
(3rd ed.). New York: Financial Times Prentice Hall.
Herkenhoff, L., & Fogli, J. (2013). Applied statistics for business and management using Microsoft
Excel. New York: Springer.
IBM Website. (2015a). SPSS modeler edition comparison. Accessed June 16, 2015, from http://
www-01.ibm.com/software/analytics/spss/products/modeler/edition-comparison.html
IBM Website. (2015b). Why does $CLEO_DEMOS/DRUG1n find the file when there is no
$CLEO_DEMOS directory? Accessed June 22, 2015, from http://www-01.ibm.com/support/
docview.wss?uid¼swg21478922
IBM Website. (2015c). Big data in the cloud. Accessed June 24, 2015, from http://www.ibm.com/
developerworks/library/bd-bigdatacloud/
Lavrakas, P. (2008). Encyclopedia of survey research methods. London: Sage Publications.
Literature 23
The Economist. (2010a). Data, data everywhere: A special report on managing information (Vol.
2010, No. 3).
The Economist. (2010b). All too much. Accessed June 26, 2015, from http://www.economist.com/
node/15557421
The Economist. (2014). Data – Getting to know. Economist, 2014(13), 5–6.
The New York Times Magazine. (2012). How companies learn your secrets. Accessed June
26, 2015, from http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r¼0
Walmart. (2012). Company thanks and rewards associates for serving millions of customers.
Accessed July 1, 2015, from http://news.walmart.com/news-archive/2012/11/23/walmart-us-
reports-best-ever-black-friday-events
Weiers, R. M., Gray, J. B., & Peters, L. H. (2011). Introduction to business statistics (7th ed.).
Mason, OH: South-Western Cengage Learning.
Basic Functions of the SPSS Modeler
2
So the successful reader will gain proficiency in the use of computer and the
IBM SPSS Modeler to prepare data and handle even more complex datasets.
Related exercises: 1, 2
Theoretical Background
The first step in an analytical process is getting access to the data. We assume that
we have access to the data files provided with this book and we know the folder
where the data files are stored. Here we would like to give a short outline of how to
import the data into a stream. Later in Sect. 3.1.1, we will learn how to distinguish
between discrete and continuous variables. In Sect. 3.1.2, we will refer to the
procedure for determining the correct scale type in more detail.
1. We open a new and empty stream by using the shortcut “Ctrl+N” or the toolbar
item “File/New”.
2. Now we save the file to an appropriate directory. To do this, we use “File/Save
Stream”, as shown in Fig. 2.1.
3. The source file used here is an SPSS-Data file with the extension “SAV”. So, we
add a “Statistics File” node from the Modeler tab “Sources”. We then double-
click on the node to open the settings dialog window. We define the folder, as
shown in Fig. 2.2. Here the prefix “$BOOKDATA/” represents the placeholder
for the data directory. We have already explained that option in Sect. 1.2.3. If we
choose not to add such a parameter to the Windows registry, this part of the file
path should be replaced with the correct Windows path, e.g., “C:\DATA\” or
similar paths. The filename is “tree_credit.sav” in every case, and we can
confirm the parameters with the “OK” button.
4. Until now, the stream has only included a Statistics File node. In the next step,
we add a Type node from the Modeler toolbar tab “Field Ops”, by dragging and
dropping it into the stream. Using this method, the nodes are unconnected (see
Fig. 2.3).
5. Next, we have to connect the Statistics File node with the Type node. To do this,
we simply left-click on the Statistics file node once and press F2. Then left-click
on the Type node. The result is shown in Fig. 2.4.
28 2 Basic Functions of the SPSS Modeler
There is also another method for connecting two nodes: In step four, we added
a Statistics File node to the stream. Normally, the next node has to be connected
to this Statistics File node. For this reason, we would left-click once on the
current node in the stream, e.g., the Statistics File node. This would mark the
node. If we now left-click twice on the new Type node in “Field Ops”, the Type
node will be added and automatically connected to the Statistics File node.
6. So far, we have added the data to the stream and connected the Statistics File
node to the Type node. If we want to see and probably modify the settings, we
double-click on the Type node. As shown in Fig. 2.5, the variable names appear,
their scale type and also their role in the stream. We will discuss the details of the
different scales in Sect. 3.1.1 and 3.1.2. For now we can summarize that a Type
node can be used to define the scale of measurement per variable, as well as their
role, such as with input or target variables. If we use nodes other than a Statistics
File node, the scale types are not predefined and must be determined by the user.
2.1 Defining Streams and Scrolling Through a Dataset 29
7. To have the chance to scroll through the records, we should also add a
Table node. We can find it in the Modeler toolbar tab “Output”, and add it to
the stream. Finally, we should connect the Type node to the Table node, as
outlined above. Figure 2.6 shows the three nodes in the stream.
" To extend a stream with a new node, we can use two different
procedures:
" A Type node must be included right after a Source node. Here the
scale of measurement can be determined, as well as the role of a
variable.
" We have to keep in mind that it is not possible to show the results of
two sub-streams in one Table node. To do that, we would have to use a
merge or append operation to consolidate the results. See Sect. 2.7.9
and 2.7.10.
8. If we now double-click on the Table node, the dialog window in Fig. 2.7 appears.
To show the table with the records, we click on “Run”.
9. Now we can scroll through the records as shown in Fig. 2.8. After that, we can
close the window with “OK”.
The aim of this section is to show how to handle different streams at the
same time. We will explain shortly the functions of both streams used here in
Sect. 2.7.5.
1. We start with a clean SPSS Modeler environment, so please make sure each
stream is closed. We will probably need to close the Modeler and open it again.
2. We open the stream “Filter_processor_dataset”. In this stream several records
from a dataset will be selected and others are hidden in the nodes at the end of the
stream.
3. We also open the stream “Filter_processor_dataset modified”.
4. At this time, two nodes are open in the Modeler. We can see that in the Modeler
Managers sidebar on the right. Here on top we can find all the open streams. By
clicking on the stream name, we can switch between the streams (Fig. 2.10).
5. Additionally, it is possible to execute other commands, e.g., open a new stream
or close a stream here. To do so we click with the right mouse button inside the
“Streams” section of the sidebar (see Fig. 2.11).
6. Switching between the streams sometimes helps us find out which different
nodes are being used. As depicted in Fig. 2.12 in the stream
“Filter_processor_dataset”, the Filter node and the Table node in the middle are
being added, in comparison with stream “Filter_processor_dataset modified”.
We will explain the functions of both streams in Sect. 2.7.5.
" Switching between streams is helpful for finding the different nodes
used or for showing some important information in data analysis,
without modifying the actual stream. If the streams are open, then in
the Modeler sidebar on the right we can find them and can switch to
another stream by clicking it once.
" Using the right mouse button in the modelers sidebar on the right, a
new dialog appears that allows us, for example, to close the active
stream.
34 2 Basic Functions of the SPSS Modeler
" If there are SuperNodes, they would also appear in the Modeler’s
sidebar. For details see Sect. 3.2.5.
Related exercises: 3
2.3 Defining or Modifying Value Labels 35
Theoretical Background
Value labels are useful descriptions. They allow convenient interpretation. By defin-
ing a good value label, we can later determine the axis annotations in diagrams too.
1. Here we want to use the stream we discussed in the previous section. To extend
this stream, let’s open “Read SAV file data.str” (Fig. 2.13).
2. Save the stream under another name.
3. To open the settings dialog window shown in Fig. 2.14, we double-click on the
Type node. In the second column of the settings in this node, we can see that the
Modeler automatically inspects the source file and tries to determine the correct
Fig. 2.13 Stream for reading the data and modifying the value labels
scale of measurement. In Sects. 3.1.1 and 3.1.2, we will show how to determine
the correct of measurement and to modify the settings. Here we accept the
parameters.
4. Let’s discuss here the so-called “value labels”. If the number of different values
is restricted, it makes sense to give certain values a name or a label. In this
example we will focus on the variable “Credit_cards”. In column 3 of Fig. 2.14,
we can see that at least two different values, 1.0 and 2.0, are possible. If we click
on the Values field that is marked with an arrow in Fig. 2.14, a drop-down list
appears. See step 1 of how to specify a value label, Fig. 2.15.
5. First click the button “Read Values”. See Fig. 2.15! To show and to modify the
value labels, let’s click on “Specify”. Figure 2.16 shows that in the SPSS-data
file, two labels are predefined: 1 ¼ “Less than 5” and 2 ¼ “5 or more” credit
cards. Normally, these labels are passed through the Type node without any
modification (see arrow in Fig. 2.16).
6. If we wish to define another label, we set the option to “Specify values and
labels”. We can now modify our own new label. In this stream, we use the label
“1, 2, 3, or 4” instead of “Less than 5”. Figure 2.17 shows the final step for
defining a new label for the value 1.0 within the variable “Credit_cards”.
7. We can confirm the new label with “OK”. The “<Read+>” text in the
column “Values” shows us that we successfully modified the values labels
(Fig. 2.18).
Usually, the Modeler will determine the correct scale of measurement of the
variables. However, an additional Type node will give us the chance to actually
change the scale of measurement for each variable. We therefore suggest adding
this node to each stream.
The dataset “Credit_cards.sav” includes a definition of the scale of measure-
ment, but this definition is incorrect. So, we should adjust the settings in the
“Measurement” column, as shown in Fig. 2.19. It is especially important to
check variables that should be analyzed with a “Distribution node”. They have
to be defined as discrete! This is the case with the “Credit_cards” variable.
8. We can close the dialog window without any other modification and click on
“OK”.
38 2 Basic Functions of the SPSS Modeler
9. If we want to scroll through the records and wish to have an overview of the
new value labels, we should right-click on the Table node. Then we can use
“Run” to show the records. Figure 2.20 shows the result.
10. To show the value labels defined in step 6, we use the button in the middle of
the toolbar of the dialog window. It is marked with an arrow in Fig. 2.20.
11. Now we can see the value labels in Fig. 2.21. We can close the window with
“OK”.
40 2 Basic Functions of the SPSS Modeler
Good and self-explanatory documentation of a stream can help reduce the time
needed to understand the stream or at least to identify the nodes that have to be
modified to find the correct information. We want to start, therefore, with a simple
example of how to add a comment to a template stream.
5. Now we move and resize the comment so that it is in the background of both
nodes (see Fig. 2.24).
6. If we want to assign a comment to more than one node, we have to first mark the
nodes with the mouse by clicking them once. We will probably have to use the
Shift key to mark more than one node.
7. Now once more, we can add a comment by using the icon in the toolbar (see
Fig. 2.22). Alternatively, we can right-click and choose “New Comment . . .”.
(see Fig. 2.25).
8. We define the comment, e.g., as shown in Fig. 2.26, with “source node to load
dataset”. If there is no additional comment in the background, a dotted line
appears to connect the comment to the assigned node.
42 2 Basic Functions of the SPSS Modeler
" Commenting on a stream helps us work more efficiently and can describe
the functionalities of several nodes. There are two types of comments:
" Comments can be added by using the “Add new comment . . .” toolbar
icon or by right-clicking the mouse.
2.5 Exercises 43
" Due to print-related restrictions with this book, we did not use comments
in the streams. Nevertheless, we strongly recommend this procedure.
2.5 Exercises
1. Name the nodes that should, at a minimum, be included in each stream. Explain
their role in the stream.
2. Explain especially the different roles of a Type node in a stream.
3. In Fig. 2.6 we connected the Table node to the Type node.
(a) Open the stream “Read SAV file data.str”.
(b) Remove the connection of this node with a right click on it. Then connect it
instead directly to the Source/Statistics File node.
2.6 Solutions
1. The final stream can be found in “using_excel_data.str”. Figure 2.29 shows the
stream and its two nodes.
2. To get access to the Excel data, the parameters of the Excel node should be
modified, as shown in Fig. 2.30.
2.6 Solutions 45
3. The path to the file can be different, depending on the folder where the file is
being stored. Here we used a placeholder “$BOOKDATA”. (see Sect. 1.2.3).
4. In particular, we would like to draw attention to the options “Choose worksheet”
and “On blank rows”.
" The Modeler does not always import calculations included in an Excel-
Worksheet correctly. Therefore, the new values should be inspected,
with a Table node for example. If NULL-values occur in the Excel-
Worksheet, all cells should be marked, copied, and pasted using the
function “Paste/Values Only” (see Fig. 2.31).
In this exercise, we want to show how to define value labels in a stream, so that
the numbers are more self-explanatory.
2.6 Solutions 47
Fig. 2.32 User-defined value labels are specified in a Type node (step 1)
Fig. 2.33 User-defined value labels are specified in a Type node (step 2)
2.7.1 Theory
methods that help us to generate specific subsets within the original dataset. By so
doing, more methods, e.g., cluster analysis, can be applied. This is necessary
because of the complexity of these multivariate techniques. If we can separate
representative subsets, then these methods become applicable despite any
limitations of time or hardware restrictions.
Figure 2.37 outlines the big picture for methods discussed in the following
sections. The relevant section has been listed alongside each method. We will use
various datasets to discuss the different procedures. To have the chance to focus on
particular sets and use different methods to deal with the values or records included,
we have reordered the methods applied.
2.7.2 Calculations
Theory
Normally, we want to analyze all the variables that are included in our datasets.
Nevertheless, often it is not enough to calculate measures of central tendency, etc.,
or to determine the frequency of several values. In addition, we also have to
calculate other measures, e.g., performance indicators. We will explain here how
to normally do that using the Modeler.
" With a Derive node new variables can be calculated. We suggest using
self-explaining names for those variables. Short names may be easier
to handle in a stream but often it is hard to figure out what they
stand for.
52 2 Basic Functions of the SPSS Modeler
4. To show the results, we add a Table node behind the Derive node. We connect
both nodes and run the Table node (see Fig. 2.41).
The last column of Fig. 2.40 shows the result of the calculation.
5. To interpret the results more easily we can use a frequency distribution, so we
add a Distribution node behind the Derive node. We select the new variable
“training_expected_total_1” in the Distribution node to show the results. Fig-
ure 2.41 depicts the actual status of the stream. As we can see in Fig. 2.42, more
than 30 % of the users expect to have more than 3 days sponsored by the firm to
become familiar with the IT system.
6. Now we want to explain another option for calculating the result, because the
formula “training_days_actual+training_days_to_add” used in the first Derive
node shown in Fig. 2.39 can be substituted with a more sophisticated version.
Using the predefined function “sum_n” is simpler in this case, and we can also
learn how to deal with a list of variables.
The new variable name is “training_expected_total_2”. Figure 2.43 shows the
formula.
7. To define the formula, we double-click on the Derive node. A dialog window
appears, as shown in Fig. 2.43. With the little calculator symbol on the left-hand
side (marked in Fig. 2.43), we can start using the “expression builder”. It helps us
to select the function and to understand the parameter each function expects.
Figure 2.44 shows the expression builder. In the middle of the window, we can
select the type of function we want to use. Here, we choose the category
“numeric”. In the list below we select the function “sum_n”, so that we can
find out the parameters this function expects. An explanatory note below the
table tells us that “sum_n(list)” defines a list. The most important details are the
brackets [. . .] used to create the list of variables. The final formula used here is:
sum_n([training_days_actual,training_days_to_add])
2.7 Data Handling and Sampling Methods 53
Fig. 2.42 Distribution of the total training days expected by the user
" The expression builder in the Modeler can be used to define formulas.
It offers a wide range of predefined functions.
" In functions that expect lists of variables, we must use brackets [].
8. To show the result, we add a new Table node behind the second Derive node.
Figure 2.45 shows the final stream. The results of the calculation are the same, as
shown in Fig. 2.40.
2.7 Data Handling and Sampling Methods 55
We want to add another important remark here: we defined two new variables
“training_expected_total_1” and “training_expected_total_2” in the Derive nodes.
Both have unique names. Even so, they cannot be shown in the same table node. To
connect one table node with both Derive nodes would make sense here. We should
56 2 Basic Functions of the SPSS Modeler
Related exercises: 6
2.7 Data Handling and Sampling Methods 57
Theory
Until now we used the Derive node to deal with numbers, but this node type can also
deal with strings. If we start the expression builder in a Derive node (Fig. 2.43 shows
how to do this), we find the category “string” on the left-hand side (see Fig. 2.46). In
this section, we want to explain how to use these string functions generally.
Separating Substrings
The dataset “england_payment_fulltime_2014_reduced” includes the median of
the weekly payments in different UK regions. The source of the data is the UK
Office for National Statistics and its website NOMIS UK (2014). The data are based
on an annual workplace analysis coming from the Annual Survey of Hours and
Earnings (ASHE). For more details see Sect. 10.1.12. Figure 2.47 shows some
records, and Table 2.3 shows the different region or area codes.
We do not want to examine the payment data, however. Instead, we want to
extract the different area types from the first column. As shown in Fig. 2.47, in the
first column the type of the region and the names are separated by an “:”. Now we
use different string functions of the Modeler to extract the type. We will explain
three “calculations” to get the region type. Later in the exercise we want to extract
the region names.
58 2 Basic Functions of the SPSS Modeler
Fig. 2.50 Table node with the extracted area type in the last column
Related exercises: 4
Theoretical Background
Datasets with a large number of records are common. Often not all of the records
are useful for data mining purposes, so there should be a way to determine records
that meet a specific condition. Therefore, we want to have a look at the Modeler’s
Select node.
4. To modify the parameters of the Select node, we double-click on it. In the dialog
window we can start the expression builder using the button on the right-hand
side. It is marked with an arrow in Fig. 2.55.
In the expression builder (Fig. 2.56), we first double-click on the variable “firm”
and add it to the command window. Then we can extend the statement manually
by defining the complete condition as: firm ¼ “Intel”.
Figure 2.56 shows the complete statement in the expression builder. We can
confirm the changes with “OK” and close all Select node dialog windows.
5. Finally, we should add a Table node behind the Select node to inspect the
selected records. Figure 2.57 shows the final stream.
Running the Table node at the end of the stream, we can find the selected
12 records of processors from Intel (see Fig. 2.58).
" A Select node can be used to identify records that meet a specific
condition. The condition can be defined by using the expression
builder.
Related exercises: 7
Theoretical Background
In data mining we often have to deal with many variables, but usually not all of
them should be used in the modeling process. The record ID or the names of objects
that are often included in each dataset are good examples for that. With the ID we
can specify certain objects or records, but neither the ID nor the name are useful for
the statistical modeling process itself.
66 2 Basic Functions of the SPSS Modeler
" A Filter node can be used to (1) reduce the number of variables in a
stream and (2) to rename variables. Indeed in the Source nodes of the
Modeler, the same functionalities are available, but an extra Filter
node should be used for transparency reasons, or if the number of
variables should be reduced in a specific part of the stream (and not
in the whole one).
" The Filter node does not filter the records! It only reduces the number
of variables.
5. Finally, we connect the new nodes in the right directions: we connect the Excel
node with the Filter node and the Filter node with the Type node. Figure 2.60
shows the result.
6. To have the chance to understand the functionality of the Filter node, we should
add another Table node and connect it with the Filter node (see Fig. 2.60).
7. To exclude some variables, we now double-click on the Filter node. Figure 2.61
shows the dialog window with all four variables.
68 2 Basic Functions of the SPSS Modeler
8. To exclude the variable “processor type” from the part of the stream behind the
Filter node and the analysis process, we click on the second arrow. Figure 2.62
shows the result.
9. We now can click “OK” to close the window.
10. To have an idea and to check the functionalities of the Filter node, we should
now inspect the dataset before and after usage of the Filter node. To do this
we double-click on the left as well as on the right Table node, as Fig. 2.63
visualizes. Figures 2.64 and 2.65 show the results.
The names of the variables are sometimes hard to figure out. If we would like to
improve the description and to modify the name, we can use the Filter node too. We
open the Filter node and overwrite the old names. Figure 2.66 shows a example.
2.7 Data Handling and Sampling Methods 69
Fig. 2.63 Table nodes to compare the variables in the original dataset and those behind the
Filter node
Unfortunately, now this stream won’t work correctly anymore! We have to also
adjust this variable name in the following nodes or formulas of the stream. In the
end, we can summarize that renaming a variable makes sense and should be done in
the Filter node.
Here, we have explained how to use the Filter node in general, but there are also
other options for reducing the number of variables. If we would like to reduce them
for the whole stream, we can also use options in the Sources node itself, e.g., in an
Excel Input node or a Variable File node. Figure 2.67 shows how to disable the
variable “processor type” directly in the Excel File node.
" The parameters of the Source nodes can be used to reduce the
number of variables in a stream. We suggest, however, adding a
separate Filter node right behind a Source node, to create easy to
understand streams.
2.7 Data Handling and Sampling Methods 73
Related exercises: 9
Theory
In Chap. 4, we will discuss tasks in multivariate statistics. In such an analysis more
than one variable is used to assess or explain a specific effect. Here—and also in
the following chapter—we are interested in determining the strength a variable
contributes to the result/effect. For this reason, and also to interpret the variables
themselves more easily, we should rescale the variables to a specific range.
In statistics we can distinguish between normalization and standardization. To
normalize a variable, all values will be transformed with
xi xmin
xnorm ¼
xmax xmin
to an interval of [0, 1].
In statistics, however, we use more often the so-called standardization or z-
transformation to equalize the range of each variable. After the transformation all of
them spread around zero. First we have to determine the mean x and the standard
deviation s of each variable. Then we use the formula
74 2 Basic Functions of the SPSS Modeler
xi x
zi ¼
s
to standardize each value xi. The result is zi. These values zi have interesting
characteristics. First we can compare them in terms of standard deviations. Second,
the mean of zi is always zero and the standard deviation of zi is always one.
In addition, we can identify outliers easily: standardizing the values means that
we can interpret the z-values in terms of multiples of the standard deviations they
are away from the mean. The sign tells us the direction of the deviation from the
mean to the left or to the right.
" Standardized values (z values) are calculated for variables with differ-
ent dimensions/units and make it possible to compare them.
Standardized values represent the original values in terms of the
distance from the mean in standard deviations.
" A standardized value of, e.g., 2.18 means that it is 2.18 standard
deviations away from the mean to the left. The standardized values
itself have always have a mean of zero and a standard deviation
of one.
Standardizing Values
In this chapter, we want to explain the procedure for “manually” calculating the z
values. We also want to explain how the Modeler can be used to do the calculation
automatically. This will lead us to some functionalities of the Auto Data Prep
(aration) node.
We use the dataset “test_scores”, which represents test results in a specific
school. See also Sect. 10.1.31. These results should be standardized and the
calculated values should be interpreted. We use the template stream “Template-
Stream test_scores” to build the stream. Figure 2.68 shows the initial status of the
stream.
4. We first want to standardize the value of the variable “pretest” manually, so that
we can understand how the standardization procedure works.
As shown in Sect. 2.7.2, we therefore should add a Drive node and connect it
with the Type node. Figure 2.71 shows the name of the new variable in the Drive
node, that is “pretest_manual_standardized”. As explained in the first paragraph
of this section, we usually standardize values by subtracting the mean and then
dividing the difference by the standard deviation. The formula is “(pretest-
54.956)/13.563”.
We should keep in mind that this procedure is for demonstration purposes
only! It is never appropriate to use fixed values in a Derive node—considering
that the values in the dataset can be different each time. Then fixed values would
not be appropriate and the results would be wrong! Unfortunately in this case we
cannot substitute the mean 54.956. As shown in Sect. 2.7.9, the predefined
function “mean_n” calculates the average of values in a row using a list of
variables. Here we would need the mean of a column—respectively a variable.
Figure 2.72 shows the actual status of the stream.
5. To show the results we add a Table node behind the Derive node.
Fig. 2.71 Parameters of the Derive node to standardize the pre-test values
2.7 Data Handling and Sampling Methods 77
Figure 2.73 shows some results. We find that the pretest result of 84 points
equals a standardized value of (8454.956)/13.563 ¼ +2.141. That means that
84 points is 2.141 away from the mean to the right. It is outside the 2 s-interval
(see also Sect. 3.2.7) and therefore a very good test result!
6. Finally, to check the results we add a Data Audit node to that sub-stream (see
Fig. 2.74).
7. As explained, the standardized values have a mean of zero and a standard
deviation of one. Beside small deviations for the mean that is not exactly zero,
Fig. 2.75 shows the correct results.
8. To use a specific procedure to calculate the z-values, we have to make sure that
the variable “pretest” is defined as continuous and as input variable. To check
this, we double-click on the Type node. Figure 2.76 shows that in the template
stream the correct options are being used.
9. As explained above, it is error-prone to use fixed values (e.g., the mean and the
standard deviation of the pretest results) in the Derive node. So there should be
a possibility to standardize values automatically in the Modeler. Here we want
to discuss one of the functionalities of the Auto Data Prep(aration) node for this
purpose.
Let’s select such an Auto Data Preparation node from the Field Ops tab of
the Modeler toolbar. We add it to the stream and connect it with the Type node
(see Fig. 2.77).
Fig. 2.78 Auto Data Preparation node parameters to standardize continuous input variables
10. If we double-click on the Auto Data Preparation node, we can find an over-
whelming number of different options (see Fig. 2.78). Here we want to focus on
the preparation of the input variables. These variables have to be continuous.
We checked both assumptions by first extracting the Type node parameters in
step 8.
11. Now we activate the tab “Settings” in the dialog window of the Auto Data
Preparation node (see Fig. 2.78). Additionally, we choose the category “Pre-
pare Inputs and Target” on the left-hand side.
At the bottom of the dialog window, we can activate the option “Put all
continuous fields on a common scale (highly recommended if feature construc-
tion will be performed)”.
12. We can now close the dialog window of the Auto Data Preparation node.
Finally, we should add a Table node as well as a Data Audit node to this part
of the stream. This is to show the results of the standardization process.
Figure 2.79 shows the final stream.
13. Figure 2.80 shows the results of the standardization procedure using the Auto
Data Preparation node. We can see that the variable name is
“pretest_transformed”. We can define the name extension by using the Field
Names settings on the left-hand side in Fig. 2.78.
2.7 Data Handling and Sampling Methods 81
Fig. 2.80 Table node with the results of the Auto Data Preparation procedure
The standardized values are the same as calculated with the Derive node
before, however. We can find the value 2.141 in Fig. 2.80. It is the same as that
shown in Fig. 2.73.
Scrolling through the table we can see that the variable “pretest” is no longer
presented here. It is being replaced by the standardized variable “pre-
test_transformed”.
82 2 Basic Functions of the SPSS Modeler
" The Auto Data Preparation node offers a lot of options. This node can
also be used to standardize input variables. Therefore, the variables
should be defined as input variables, using a Type node in the stream.
Related exercises: 9
Theoretical Background
In data mining, the concept of cross-validation is often used to test the model for its
applicability to unknown and independent data and to determine the optimal
parameters for which the model best fits the data. For this purpose, the dataset
has to be split into several parts: a training dataset, test dataset, and validation
dataset. The training dataset will be used to find the correct parameters, and the
smaller validation and test datasets will be used for finding the optimal parameters
2.7 Data Handling and Sampling Methods 83
3. Let’s have a first look at the data by double-clicking the Data Audit node. After
that we use the Run button to see the details in Fig. 2.84. In the column for the
valid number of values per variable, we find the sample size of 2133 records.
We will compare this value with the sample size of the subsets after the
procedure to divide the original dataset. For now we can close the dialog window
in Fig. 2.84 with “OK”.
4. After the Type node, we must add a “Partition” node. To do so, we first activate
the Type node by clicking on it once. Then double-click on a Partition node in
the Modeler tab “Field Ops”. The new Partition node is now automatically
connected with the Type node (Fig. 2.85).
5. Now we should adjust the parameters of the Partition node. We double-click on
the node and modify the settings as shown in Fig. 2.86. We can define the name
2.7 Data Handling and Sampling Methods 85
of the new field that consists of the “labels” for the new partition. Here we can
use the name “Partition”.
In addition, we would like to create two subsets, “Training and test”. We use
the option with the name “Train and test”.
86 2 Basic Functions of the SPSS Modeler
" All of them must be representative. The Modeler defines which record
belongs to which subset and assigns a name of the subset.
" Normally, the partitioning procedure will assign the records randomly
to the subset. If we would like to have the same result in each trial, we
should use the option “Repeatable partition assignment”.
Fig. 2.89 Table with the new variable “Partition” (RANDOM VALUE!)
Fig. 2.90 Frequency distribution for the new variable “Partition” (RANDOM VALUES)
In Sect. 5.3.7, the procedure to divide a set into three subsets (training, valida-
tion, and test) will be discussed.
Theory
In the previous chapter, we explained methods to divide the data into several
subsets or partitions. Every record belongs to one of the subsets and no record
(except outliers) is being removed from the sample.
2.7 Data Handling and Sampling Methods 89
" The key point of sampling is to allow the usage of complex methods,
by reducing the number of records in a dataset. The subsample
must be unbiased and representative, to come to conclusions
that correctly describe the objects and their relationship to the
population.
Related exercises:
Here we want to use the dataset “test_scores.sav”. The values represent the test
scores of students in several schools with different teaching methods. For more
details see Sect. 10.1.31.
5. To have the chance to sample the dataset, we add from the Record Ops tab a
Sample node and connect it with the Type node.
In addition, we extend the stream by adding a Table node and a Data Audit
node behind the Sample node. This gives us the chance to show the result of the
sampling procedure. Figure 2.95 shows the actual status of the stream.
So far we have not selected any sampling procedure in the Sample node, but
what we can see is that the Sample node reduces the number of variables by
one. The reason is that in the Type node the role of the “student_id” is defined
as “None” for the modeling process (see Fig. 2.96). It is a unique identifier and
so we do not need it in the subsamples for modeling purposes.
6. Now we can analyze and use the options for sampling provided by the Sample
node. Therefore, double-click on the Sample node. Figure 2.97 shows the
dialog window.
2.7 Data Handling and Sampling Methods 93
7. The most important option in the dialog window in Fig. 2.97 is the “Sample
method”. Here we have activated the option “Simple”, so that the other
parameters will indeed be simple to interpret.
The mode option checks if the selected records, determined by using the
other parameters, are included or excluded from the sample. Here we should
normally use “Include sample”.
The option “Sample” has three parameters to determine what happens in the
selection process. “First” just cuts off the sample after the number of records
specified. This option is only useful if we can be sure there is definitely no
pattern in the dataset. The first n selected records are representative for the
whole sample also.
The option “1-in-n” selects each n-th record, and the option “Random %”
selects randomly n % of the records.
8. Here we want to reduce the dataset by 50 %. So we have two choices: either we
select every second record or we choose randomly 50 % of the records. We start
with the “1-in-n” option as shown in Fig. 2.98. We can also restrict the number
of records by using the “Maximum sample size”, but this is not necessary here.
We confirm the settings with “OK”.
9. The Table node at the end of the stream tells us that the number of records
selected is 1066 (see Fig. 2.99).
94 2 Basic Functions of the SPSS Modeler
10. The corresponding Data Audit node in Fig. 2.100 shows us that the mean and
the standard deviation of “pre test” and “post test” differ slightly in comparison
with the original values in Fig. 2.94.
11. To check the usage of the option “Random %” in the Sample node we add
another Sample node as well as a new Data Audit node to the stream (see
Fig. 2.101).
12. If we run the Data Audit node at the bottom in Fig. 2.101, we can see that the
number of records differs each time. Note the half of the original 2133 records
and therefore 2133/2 ¼ 1066.5 records are selected. Sometimes the new sample
size is with 1034 really different.
" The Sample node can be used to reduce the number of records in a
sample. To create a representative sub-sample, the simple sampling
methods “1-in-n” or “Random %” can be used. The number of records
selected can be restricted. Variables whose roles are defined as
“None” in a Source or Type node are excluded from the sampling
process. Using the option “Random %”, the sample size differs from
the defined percentage.
2.7 Data Handling and Sampling Methods 95
Fig. 2.100 Details of the sampled records in the Data Audit node
Related exercises:
Random sampling can avoid unintentional pattern appearing in the sample, but
very often random sampling also destroys patterns that are evidently useful and
necessary in data mining. Recalling the definition of the term “representative
sample”, which we discussed at the beginning of this section, we have to make
sure that “The frequency distribution of the variables of interest is the same in the
population and in the sample”.
Obviously, we cannot be sure that this is the case if we select each object
randomly. Consider the random sampling of houses on sale from an internet
2.7 Data Handling and Sampling Methods 97
database; the regions in the sample are not distributed as known from the whole
database.
The idea is to add constraints to the sampling process, to ensure the representa-
tiveness of the sample. The concept of stratified sampling is based on the following
pre-conditions:
Now we can simply copy and paste them. We get a stream as shown in
Fig. 2.110.
10. Now we have to connect the Type node and the new sub-stream, so we click on
the Type node once and press the F2 key. At the end we click on the Target
node—which in this case is the Sample node. Figure 2.111 shows the result.
" Nodes with connections between them can be copied and pasted. To
do this the node or a sub-stream must be marked with the mouse,
then simply copy and paste. Finally, the new components have to be
connected to the rest of the stream.
11. We now want to modify the parameters of the new Sample node. We double-
click on it. In the “Cluster and Stratify . . .” option, we defined the gender as the
variable for the relevant strata (see Fig. 2.105). If we think that records with
2.7 Data Handling and Sampling Methods 101
14. Finally, we can modify the description of the node as shown in Fig. 2.116. We
can then close this dialog window too.
15. Running the Table node in this sub-stream we get the result as shown in
Fig. 2.117. We can see that the number of female customers increased in
comparison with the first dataset shown in Fig. 2.109.
Now we want to use another option to ensure we get a representative sample.
In the procedure used above, we focused on gender. In the end we found a
representative sample regarding the variable “gender”. If we want to analyze
products that are often sold together, however, we can consider reducing the
sample size, especially in the case of an analysis of huge datasets. Here a
104 2 Basic Functions of the SPSS Modeler
stratified sample related to gender is useless. We have to make sure that all the
products sold together are also in the result.
In this scenario, it is important to understand the characteristics of a flat table
database scheme: as shown once more in Fig. 2.118, the first customer bought
three products, but the purchase is represented by three records in the table. So
it is not appropriate to sample the records randomly based on the customer_ID.
If one record is selected then all the records of the same purchase must also be
assigned to the new subset or partition.
Here we can use the customer_ID as a unique identifier. Sometimes it can be
necessary to define another primary key first for this operation.
This is a typical example where the clustering option of the Sample node can
be used. We want to add a new sub-stream by using the same original dataset.
Figure 2.119 shows the actual status of our stream.
2.7 Data Handling and Sampling Methods 105
16. Now we add another Sample node and connect it with the Type node (see
Fig. 2.122). In the parameters of the Sample node, we activate complex
sampling (see Fig. 2.112) and click on “Cluster and Stratify . . .”. Here, we
select the variable “customer_ID” in the drop-down list of the “Clusters” option
(see Fig. 2.120).
By using this option, we make sure that if a record of purchase X is selected,
all other records that belong to purchase X will be added to the sample.
We can close this dialog window with “OK”.
106 2 Basic Functions of the SPSS Modeler
17. Within the parameters of the Sample node, we can define a name for the node as
shown in Fig. 2.121. After that we can close the dialog box for the Sample node
parameters.
18. Now we can add a Table node at the end of the new sub-stream. Figure 2.122
shows the final stream.
2.7 Data Handling and Sampling Methods 107
19. Double-clicking the Table node marked with an arrow in Fig. 2.122, we
probably get the result as shown in Fig. 2.123. Each trial gives a new dataset
because of the random sampling, but the complete purchase of a specific
customer is always included in the new sample, as we can see by comparing
2.7 Data Handling and Sampling Methods 109
the original dataset in Fig. 2.118 and the result in Fig. 2.123. The specific
structure of the purchases can be analysed by using the new dataset.
" The Sample node allows stratified sampling that produces represen-
tative samples. In general, variables can be used to define strata.
Additionally, cluster variables can be defined to make sure that all
the objects belonging to a cluster will be assigned to the new dataset,
in case only one of them is randomly chosen. Furthermore, the
sample sizes of the strata proportions can be individually defined.
110 2 Basic Functions of the SPSS Modeler
In practice, we often have to deal with datasets that come from different sources
or that are divided into different partitions, e.g., by years. As we would like to
analyze all the information, we have to consolidate the different sources first. To do
so, we need a primary key in each source, which we can use for that consolidation
process.
112 2 Basic Functions of the SPSS Modeler
Considering the case of two datasets, with a primary key in each subset, we can
imagine different scenarios:
1. Merging the datasets to combine the relevant rows and to get a table with more
columns, or
2. Adding rows to a subset by appending the other one.
In this section, we would like to show how to merge datasets. Figure 2.124
depicts the procedure. Two datasets should be combined by using one variable as a
primary key. The SPSS Modeler uses this key to determine the rows that should be
“combined”.
Figure 2.124 shows an operation called inner join. Dataset 1 will be extended by
dataset 2, using a primary key that both have in common, but where there are also
two keys in each dataset (3831 and 6887) that are not in the other one. The
difference between the ways to join lies in how they handle these difficulties.
Table 2.6 shows the join types that can be found in the Modeler.
In the following scenario, we want to merge two datasets coming from the UK.
For more details see Sect. 10.1.12. Figure 2.125 shows the given information for
2013, and Fig. 2.126 shows some records for 2014. In both sets, a primary key
“area_code” can be identified that is provided by the official database. The primary
key should be unique because there is no area with the same official code.
The relatively complicated variable “admin_description” (administrative
description) stands for a combination of the type of the region with their names.
We will separate both parts in the Exercise 5 “Extract area names” at the end of this
section. Here we want to deal with the original values.
Looking at Figs. 2.125 and 2.126, it is obvious that besides the weekly
payment, the confidence value (CV) for these variables also exists in both subsets,
114 2 Basic Functions of the SPSS Modeler
but there is no variable for the year. We will solve that issue by renaming the
variables in each subset to make clear which variable represents the values for
which year.
The aim of the stream is to extend the values for 2013 with the additional
information from 2014. In the end, we want to create a table with the area code,
the administrative description of the area, the weekly gross payment in 2013 and
2014 as well as the confidence values for 2013 and 2014.
4. In the end, we want to create a table with the area code, the administrative
description of the area, the weekly gross payment in 2013 and 2014 as well as the
confidence values for 2013 and 2014. To do so we have to exclude all the other
variables, but additionally we must rename the variables so they get a correct and
unique name for 2013 and 2014.
Figure 2.129 shows the Parameters of the Filter node behind the Source node
for 2013. In rows three and four, we changed the name of the variables
“weekly_payment_gross” and “weekly_payment_gross_CV” by adding the
year. Additionally, we excluded all the other variables. To do so we click once
on the arrow in the column in the middle.
5. In the Filter node for the data of 2014 we must exclude the “admin_description”
and the “area_name”. Figure 2.130 shows the parameters of the second Filter
node. The variable names should also be modified here.
6. Now the subsets should be ready for the merge procedure. First we add from the
“Record Ops” tab a Merge node to the stream. Then we connect both Filter nodes
with this new Merge node (see Fig. 2.131).
116 2 Basic Functions of the SPSS Modeler
Fig. 2.129 Filter node for data in 2013 to rename and exclude variables
7. We double-click on the Merge node. In the dialog window we can find some
interesting options. We suggest taking control of the merge process by using the
method “Keys”. As shown in Fig. 2.132, we use the variable “area_code” as a
primary key. With the four options at the bottom of the dialog window we can
determine the type of the join-operation that will be used. Table 2.6 shows the
different join-types and a description.
8. To check the result we add a Table node behind the Merge node. Figure 2.133
shows the actual stream that will be extended later. In Fig. 2.134 we can find
some of the records. The new variable names and the extended records are
shown for the first four UK areas.
In general we can also rename the variables and exclude several of them in the
“Filter” dialog of the Merge node. We do not recommend this, as we wish to
increase the transparency of the streams functionalities (Fig. 2.135).
2.7 Data Handling and Sampling Methods 117
Fig. 2.130 Filter node for data in 2014 to exclude and rename variables
" A Merge node should be used to combine different data sources. The
Modeler offers five different join types: inner join, full outer join,
partial outer join left/right, and anti-join. There is no option to select
the leading subset for the anti-join operation.
" The names of the variables in both datasets must be unique. Therefore
Filter nodes should be implemented before the Merge node is applied.
Renaming the variables in the Merge node is also possible, but to
ensure a more transparent stream, we do not recommend this option.
" In the case of a full outer join, we strongly recommend checking the
result. If there are records in both datasets that have the same
primary key but different values in another variable the result will
be an inconsistent dataset. For example, one employee has the ID
“1000” but different “street_names” in its address.
" If two datasets are to be combined row by row then the Append node
should be used (see Sect. 2.7.10).
9. From the tab “Field Ops” we add a Derive node to the stream and connect it with
the Merge node (see Fig. 2.136).
2.7 Data Handling and Sampling Methods 119
10. With a double-click on the Derive node we can now define the name of the new
variable with the average income. We use “weekly_payment_gross_2013_
2014_MEAN” as shown in Fig. 2.137.
11. Finally we have to define the correct formula using the Modelers expression
builder. To start this tool, we click on the calculator symbol on the right-hand
side, as depicted in Fig. 2.137.
12. A new dialog window pops up as shown in Fig. 2.138.
As explained in Sect. 2.7.2, we can use the formula category list to determine
the appropriate function. The category “Numeric” is marked with an arrow in
Fig. 2.138. The correct formula to determine the average weekly income for
2013 and 2014 per UK region is:
mean_n([weekly_payment_gross_2013,weekly_payment_gross_2014])
We can click “OK” and close the Derive node dialog (Fig. 2.139).
13. Finally we add another Table node to show the results. The predicted result of
£468.75 per week for Hartlepool is the last value in the first row of Fig. 2.140.
120 2 Basic Functions of the SPSS Modeler
Fig. 2.138 Using the expression builder to find the correct formula
2.7 Data Handling and Sampling Methods 123
Related exercises: 13
Fig. 2.141 Filter node for data in 2014 to exclude and rename variables
2. If we double-click on the Table nodes on the top and the bottom we get the
values as shown in Figs. 2.144 and 2.145.
3. Now we need to add a new variable that represents the year in each subset. We
add a Derive node and connect it with the Variable file node. To name the new
unique variable we use “year” and the formula we define as a constant value
2013, as shown in Fig. 2.146.
4. We add a second Derive node and connect it with the Excel file node. We use
the name “year” here also, but for the formula the constant value is 2014
(Fig. 2.147).
2.7 Data Handling and Sampling Methods 127
5. To show the results of both operations, we add two Table nodes and connect
them with the Derive nodes. Figure 2.148 shows the actual status of the stream.
6. If we use the Table node at the bottom of Fig. 2.148, we get the records shown
in Fig. 2.149.
7. Here it is not necessary to exclude variables in the dataset for 2014. Neverthe-
less, as shown in Fig. 2.144, we can remove some of them from the dataset for
2013, because they do not match with the other ones from 2014 (Fig. 2.145).
128 2 Basic Functions of the SPSS Modeler
Fig. 2.149 Values for 2014 and the new variable “year”
Fig. 2.150 Filter node is added behind the Variable File node
To enable us to exclude these variables from the subset 2013, we add a Filter
node behind the Variable File node. We can find this node in the Field Ops tab
of the Modeler. Figure 2.150 shows the actual status of the stream.
8. Now we can exclude several variables from the result. Figure 2.151 shows us
the parameters of the Filter node.
130 2 Basic Functions of the SPSS Modeler
9. Now we can append the modified datasets. We use an Append node from the
Records Ops tab. Figure 2.152 shows the actual status of the stream.
10. In the Append node, we must state that we want to have all the records from
both datasets in the result. Figure 2.153 shows the parameters of the node.
11. Finally, we want to scroll through the records using a Table node. We add a
Table node at the end of the stream (see Fig. 2.154).
2.7 Data Handling and Sampling Methods 131
" The Append node can be used to combine two datasets row by row. It
is absolutely vital to ensure that the objects represented in the
datasets are unique. A Merge node with an inner join can help here.
For details see the Exercise 13 “Append vs. Merge Datasets”.
" The option “Tag records by including source dataset in field” can be
used to mark the records with the number of the dataset they come
from. To have the chance of differentiating between both sets and
using user-defined values, e.g., years, we suggest using Derive nodes
to add a new variable with a constant value.
12. Running the Table node we will find the records partially shown in Fig. 2.155.
We can find the variable with the year and the expected sample size 2048. As
explained in Fig. 2.142, the rows of variables that were not present in both
datasets are filled with $null$ values. We can see that in the last column of the
table in Fig. 2.155.
2.7.11 Exercises
would like to have. To do so, the function “sum_n” needs a list of variables. In this
case it was simply “([training_days_actual,training_days_to_add])”.
It can be complex to add more variables, however, because we have to select
all the variable names in the expression builder. The predefined function
“@FIELDS_MATCHING()” can help us here.
134 2 Basic Functions of the SPSS Modeler
1. You can find a description of this procedure in the Modeler help files. Please
explain how this function works.
2. Open the stream “simple_calculations”. By adding a new Derive and
Table node, calculate the sum of the training days in the last year and the days
to add by using the function “@FIELDS_MATCHING()”.
The source of the data is the UK Office for National Statistics and its website
NOMIS UK (2014). The data represent the Annual Survey of Hours and Earnings
(ASHE).
We should note that the coefficient of variation for the median of the sum of the
payments is described on the website NOMIS UK (2014) in the following
statement:
“The quality of an estimate can be assessed by referring to its coefficient of
variation (CV), which is shown next to the earnings estimate. The CV is the ratio of
the standard error of an estimate to the estimate. Estimates with larger CVs will be
less reliable than those with smaller CVs.
In their published spreadsheets, ONS use the following CV values to give an
indication of the quality of an estimate . . .” (see Table 2.8).
Therefore, we should pay attention to the records with a CV value above 10 %.
Please do the following:
1. Explain why we should use the median as a measure of central tendency for
payments or salaries.
2. Open the stream with the name “Template-Stream England payment 2014
reduced”.
3. Save the stream using another name.
4. Extend the stream to show in a new Table node the CV values in descending
order.
5. In another table node, show only the records that have a confidence value
(CV) of the weekly payments below (and not equal to) “reasonably precise”.
Determine the sample size.
6. In addition, add nodes to show in a table those records with a CV that indicates
they are at least “reasonably precise”. Determine the sample size.
1. Explain the functionality that each of these nodes can help to realize.
2. Outline finally the difference between both node types.
1. In Sect. 2.7.4, the stream “selecting_records” was used to select records that
represent processor data for the manufacturer “Intel”. Please now open the
stream “selecting_records” and save it under another name.
2. Make sure or modify the stream so that the “Intel” processors will now be
selected.
3. Add appropriate nodes to standardize the price (variable “EUR”) and the
Cinebench benchmark results (variable “CB”).
4. Show the standardized values.
2.7 Data Handling and Sampling Methods 137
6. To have a chance of analyzing the partitions, add two Data Audit nodes. For the
final stream, see Fig. 2.158.
7. Now please run the Data Audit node behind the Select node for the training
records TWICE. Compare the results for the variables and explain why they are
different.
8. Open the dialog window of the Partition node once more. Activate the option
“Repeatable partition assignment” (see Fig. 2.157).
9. Check the results from two runs of the Data Audit node once more. Are the
results different? Try to explain why it could be useful in modeling processes to
have a random but repeatable selection.
Fig. 2.162 Final stream to check the effects of different merge operations
2.7 Data Handling and Sampling Methods 141
4. As shown in Fig. 2.163, Merge nodes in general offer four methods for merging
data. By changing the type of merge operation (inner join, full outer join, . . .),
produce a screenshot of the results. Explain in your own words the effect of the
join type. See, e.g., Fig. 2.164.
Fig. 2.167 Final stream to check the functionalities of the Append node
3. Add a Table node to show the results produced by the Append node. For the final
stream, see Fig. 2.167.
4. By modifying the parameters of the Append node, using (see Fig. 2.168) the
options of
– “Match fields by . . .” or
– “Include fields from . . .” as well as
– “Tag records by including source dataset in field”,
Find out what happens and describe the functionality of these options.
records of these datasets, with a sample size of three records each, are shown in
Figs. 2.170 and 2.171.
The aim of this exercise is to understand the difference between the Merge and
the Append node. Therefore, both nodes should be implemented into the stream.
After that, the mismatch of results should be explained. You will also become
aware of the challenges faced when using the Append node.
9. Describe the problems you find with the results of the Append operation. Try to
make a suggestion as to how we could be aware of such problems before append
two datasets.
146 2 Basic Functions of the SPSS Modeler
2.7.12 Solutions
1. We extended the template stream by first adding a Data Audit node from the
Output tab and then connecting that node with the Source node. We do this to
have a chance of analyzing the original dataset. Figure 2.174 shows that the
number of values in row “weekly_payment_gross” is 1021.
Fig. 2.174 Data Audit node analysis for the original dataset
148 2 Basic Functions of the SPSS Modeler
necessary but it helps us to have a better overview, both in the expression builder
of the Derive node, and when displaying the results in the Table node.
In the Derive node, we use the function
count_equal(5,[starttime, system_availability, performance])
as shown also in Fig. 2.182. Figure 2.183 shows the results.
3. The second sub-stream is also connected to the Filter node. That’s because we
also want to analyze the three variables mentioned above. The formula for the
Derive node is (Fig. 2.184):
count_greater_than(3,[starttime,system_availability,performance])
To count the number of answers that represent satisfaction of at least “5 ¼
good”, we used the function “count_greater_than”, with the first parameter “3”.
Figure 2.185 shows the result.
The variables “start-time”, “system_availability”, . . ., “slimness” should be
analyzed here. We are interested in the number of values that are 5 or 7. To get a
list of the variables names, we can use the function “@FIELDS_BETWEEN”.
We have to make sure that all the variables between “start-time” and “slimness”
have the same coding. The formula
count_greater_than(3,@FIELDS_BETWEEN(start-time, slimness))
can be found in the Derive node in Fig. 2.186. Figure 2.187 shows the results.
4. The function “@FIELDS_BETWEEN(start-time, slimness)” produces a list of
all the variables between “start-time” and “slimness”. If we want to exclude a
variable we can filter or reorder the variables. Here we want to show how to use
the Field Reorder node.
152 2 Basic Functions of the SPSS Modeler
If we add a Field Reorder node to the stream and double-click on it, the list of
the fields is empty (see Fig. 2.188). To add the field names, we click on the
button to the right of the dialog window that is marked with an arrow in
Fig. 2.188. Now we can add all the variables (see Fig. 2.189).
After adding the variable names, we should reorder the variables. To exclude
“system_availability” from the analysis, we make it the first variable in the list.
To do so, we select the variable name by clicking on it once and then we use the
reorder buttons to the right of the dialog window (see Fig. 2.190).
In the Derive node, we only have to modify the name of the variable that is
calculated. We use “count_5_and_7_for_all_except_system_availabilty”. The
formula is the same as in the third sub-stream. As shown in Fig. 2.191, it is:
count_greater_than(3,@FIELDS_BETWEEN(start-time, slimness))
Running the last Table node, we get slightly different results than in the third
sub-stream. Comparing Figs. 2.187 and 2.192, we can see that the number of
answers with code 5 or 7 is sometimes smaller. This is because we excluded the
variable “system_availability” from the calculation, by moving it to first place
and starting with the second variable “start-time”.
1
See also the solution of Exercise 7 in Sect. 3.2.8 and here especially Table 3.15.
2.7 Data Handling and Sampling Methods 159
5. To extract only the records that have a confidence value (CV) of the weekly
payments below (and not equal to) “reasonably precise”, we can use a Select
node with the parameters shown in Fig. 2.196 from the Record Ops tab and a
Table node to show the results.
To determine the sample size, we run the Table node. Figure 2.197 shows a
sample size of 983 records.
160 2 Basic Functions of the SPSS Modeler
Figure 2.199 shows the extended template stream. We added a Derive node and a
Table node.
There are a lot of possible formulas for extracting the region names. Figure 2.200
shows the formula:
Substring(locchar(“:”,1, admin_description)+1,length(admin_description)-
locchar(“:”,1, admin_description),admin_description)
The function substring helps us to separate parts of a stream. With the first
parameter “locchar(‘:’,1, admin_description)+1”, we determine the position of the
character “:”. The substring procedure should then begin at the position+1 to extract
the region name.
With the second parameter “length(admin_description)-locchar(‘:’,1,
admin_description),admin_description)”, we calculate the length of the string to
be extracted. The first part “length(admin_description)” gives us the length of the
string and then we subtract the first part that should be ignored.
Fig. 2.200 Derive node with the formula to extract the region names
variables. Consider a table with 100 rows and 10 columns. We can minimize the
number of columns shown in the modeling process if we use a Filter node,
whereas the Select node helps us to cut down the number of records or rows that
are disposable. We can also use a Filter node to rename the variables.
If we compare the results in the Type node before (see Fig. 2.203) and after (see
Fig. 2.204) the modification of the Excel File node, we can see that the variable
“processor type” no longer appears.
Fig. 2.205 Auto Data Preparation node parameters to standardize continuous input variables
Fig. 2.207 Standardized values of the price and the benchmark test result
If we run the Data Audit node, we get the results shown in Fig. 2.208. Using
the 3s-rule, explained in more detail in Sect. 3.2.7, we can definitely identify one
or more outliers. The maximum standardized price is 3.057. That means that the
largest price is 3 standard deviations away from the mean to the right. The value
is outside the 3s interval and therefore this processor is very expensive. Scrolling
through the records in the Table node, we can see that the processor “Core
2 Quad QX9770” is the outlier. It also has the maximum CPU performance.
In practice, it could be helpful to filter the records that have a absolute
standardized value larger than 3. This can be done by using another Select
node. For an explanation, see also Exercise 3.
6. Data on the AMD processors can be analyzed by modifying the condition in the
Selection node. No outliers are found in the results from the standardized values.
Figure 2.209 shows the final stream once more. Additionally, Fig. 2.210 shows
the option of the Partition node.
If the option “Repeatable partition assignment” in the Partition node (see
Fig. 2.210) is not activated, the records will be selected randomly each time and
will be assigned to one of the partitions. Each of the trials produced different results
in the Data Audit node.
If we want to assign the same records to the partitions and thereby get the same
results in each trial, we can activate the option “Repeatable partition assignment”
and set a seed value. This gives us the opportunity to reproduce the results each time
we run the stream.
2.7 Data Handling and Sampling Methods 167
Fig. 2.208 Maximum values of the price and the test results
Fig. 2.210 The Select nodes are generated using a Partition node
1. First we have to make sure we access the correct source files. In the Excel File
nodes, we have to define the filenames “england_payment_fulltime_female_
2014” in the upper node and “england_payment_fulltime_male_2014” in the
node below. Figure 2.212 shows the parameters of the first Excel File node.
2. We can check the parameters by using the Table nodes connected with the
Source nodes. Figure 2.213 shows some records from female employee data,
2014.
3. We can disable a lot of variables in each of the following Filter nodes behind the
Source nodes. Apart from the weekly gross payment, we do not need the other
payment related variables (see Fig. 2.214).
4. Because the variable names have to be unique, we have to disable the variable
“area”. Figure 2.215 shows the parameters for the payments to male employees
dataset.
5. After filtering the variables needed, we can now merge the two subsets (see
Fig. 2.216).
2.7 Data Handling and Sampling Methods 169
Figures 2.219 and 2.220 show the records of the sample datasets with their small
sample size of three records each. This gives us the chance to understand the effect
of the different join operations. The final stream can be found in Fig. 2.221.
Figure 2.222 shows again the other option of a Merge node and in particular the
different merge operations.
We choose and describe the merge operations step-by-step and show the results.
Inner join
170 2 Basic Functions of the SPSS Modeler
Fig. 2.214 Filter node parameters for the payments of female employees dataset
Fig. 2.215 Filter node parameters for the payments to male employees dataset
172 2 Basic Functions of the SPSS Modeler
Rows are only matched where the customer_ID in both datasets is the same (see
Fig. 2.223).
Full outer join
All rows from both datasets are in the joined table, but a lot of values are not
available ($null$) (see also Fig. 2.124). Figure 2.224 shows that all the variables
that are not in one of the subsets are filled with “non-available” or “$null$” values.
We get a sample of size four because we can find the customer_ID’s 1711 and
9001 in both datasets, and the ID’s 3831 and 6887 are only included in one of the
sets. So the number of unique values is four.
Partial outer join
Records from the first-named dataset are in the joined table. From the second
dataset, only those with a key that matches the first dataset key will be copied.
In the Merge node, we select the partial outer join option and select the first
dataset, as shown in Fig. 2.225.
2.7 Data Handling and Sampling Methods 173
Figure 2.226 shows the result. ID 3831 appears in the result because it is in the
first subset, selected as the leading subset. If we set dataset 2 as the leading subset
with the option of the merge node (see Fig. 2.227), then ID 6887 appears in the
result instead of 3831 (see Fig. 2.228).
174 2 Basic Functions of the SPSS Modeler
It does not make sense to activate both datasets in the options, using partial outer
join, because the result equals the full outer join (see also the explanation in the
middle of Fig. 2.225).
Anti-join
The joined table only shows the records with a key that does not appear in the
second dataset. Here, the leading subset is “employee_dataset_001.xls”, as shown
in Fig. 2.219. The only unique customer ID not included in the second dataset is
3831. The result is therefore a dataset with just one record, as shown in Fig. 2.229.
2.7 Data Handling and Sampling Methods 175
Figure 2.230 shows once more the different parameters of an Append node.
If we modify the field “Match fields by . . .”, we can determine how the append
operation will select the variables to append. In the case of the datasets used here,
there is no difference in the results. The option “Match case” enables case sensitiv-
ity for the names of the variables. So that variables such as “employer” or
“Employer” would be differentiated. This is sometimes useful. In our datasets
however, we cannot see any changes that depend on this option.
With the option “Include fields from . . .”, we can define which variables will
appear in the result. Though please be sure to realize that the results depend also on
176 2 Basic Functions of the SPSS Modeler
the option “Match fields by”. The append operation will only have an effect if
“Name” is activated! (see Fig. 2.230).
If we choose “Main dataset only”, then the variables “customer_ID”,
“monthly_salary”, and “employer” will be in the result. Only if we enable “All
datasets” will “gender” also be in the result. The difference is shown in Figs. 2.231
and 2.232.
2.7 Data Handling and Sampling Methods 177
The last option “Tag records by including source dataset in field” is easy to
understand. As outlined also in Sect. 2.7.10, here we can determine if there is a new
variable in the results that represents an indicator with the number of the dataset
where the records come from. If we enable the option, we get the result as shown in
Fig. 2.233. We can also determine the name of the new variable, but unfortunately it
2.7 Data Handling and Sampling Methods 179
is not possible to redefine the values that are used to mark the subsets. Often we
need perhaps the year that the data represents, but for this we have to use an extra
Derive node to determine the value of a new variable and then we append the sets
afterwards. This is shown in Sect. 2.7.10.
It is important to realize that the customer ID’s 1711 and 9011 are included in
both datasets. As we outlined in Sect. 2.7.10, the objects represented in each dataset
should be unique if we want to use the Append node. Otherwise we get inconsistent
data, as we will show here.
If we run the Merge node, we get the results shown in Fig. 2.236. As expected,
all unique customer ID’s are present. Variables for ID 3831 not present in dataset
2, are filled with $null$. Variables not present in dataset 1 for ID 6887 are also filled
with $null$ for unavailable data. So far, no failures can be found here. The data are
consistent.
Now we run the Table node behind the Append node. Figure 2.237 shows the
result. As we can see, this result is useless. That’s because we have two records per
2.7 Data Handling and Sampling Methods 183
Fig. 2.238 Merge node is modified to find duplicated primary keys in two datasets
Fig. 2.239 Result of the Merge node to find duplicated primary keys in two datasets
Literature
€
c’t Magazine for IT Technology. (2008). CPU-Wegweiser: x86-Prozessoren im Uberblick (Vol.
2008, No. 7, pp. 178–182).
IBM. (2014). SPSS Modeler 16 Source, Process, and Output Nodes. Accessed September
18, 2015, from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.
0/en/modeler_nodes_general.pdf
NOMIS UK. (2014). Official labour market statistics – Annual survey of hours and earnings –
Workplace analysis. Accessed September 18, 2015, from http://nmtest.dur.ac.uk/
Univariate Statistics
3
So the successful reader is familiar with the statistical theory of assigning correct
scale of measurement to variables and after that to select and apply correct methods
to show, determine, and assess frequency distributions.
3.1 Theory
Measuring a variable means assigning values to it. How we choose the method to
examine the variables highly depends upon the scale of measurement or so-called
“scale type” of each variable. Therefore, it is important to assign the correct scale to
the variables in the first place, before starting the analysis.
As an example, Fig. 3.1 shows a small stream called “car_sales_modified”.
Before we go into the details and explain how to use the SPSS Modeler, it is
necessary to understand the theory behind how we describe the variables used in
each sample data file. Each analysis in data mining is based on this information. In
other words: if the description of the type of information a variable represents is
wrong, the variable cannot be handled correctly and the results of the analysis may
be incorrect. So it is essential to at least check the scale of measurement.
The dataset inspected in Fig. 3.1 includes details of different cars. The dialog
window on the right shows several details from the included variables. For more
details see also Sect. 10.1.5.
First, we are interested in the manufacturer of each car, represented by the
variable “manufact”. Its possible values “Acura”, “Audi”, etc. in the column
marked “values” give us some initial rough information about the car. Variables
that have such a restricted, i.e., finite, or countable number of values are called
discrete, categorical, or qualitative. A variable that can only take on two different
values is called dichotomous, for example, male or female.
Discrete Variables
Discrete variables can only take on a limited number of values. It is theoretically
impossible to find a third value between two very close values of the variable.
Often discrete variables are unsuitable for data mining, however, because of
their limited information. Therefore, lets discuss the variable “fuel capacity”. It is
denoted “fuel_cap” and is the third in the list of variables, as shown in Fig. 3.1.
In comparison with “manufact”, “fuel_cap” values can be measured with utmost
precision. The number of decimal places after the decimal point is infinite, and the
precision of the values is theoretically unrestricted. These types of variables are
called continuous or quantitative.
Continuous Variables
Continuous variables have an infinite number of values between two points.
Furthermore, a variable can be called continuous if there is always a theoretical
chance that between two very close values another third value can exist. Variables
that represent currencies will also always be called continuous.
In the case of the SPSS modeler, an additional explanation for the term “contin-
uous” is necessary. This is because the Modeler also uses this term for so-called
“absolute scaled variables”. In the dataset “car_sales_modified”, for example, the
variable “sales” is included. We can easily see that between 1.000 and 1.001 sold
3.1 Theory 187
cars, no other car can be sold, but we cannot just order the sales numbers. We can
calculate with them and, more importantly, we can interpret ratios of the values. We
therefore assign the scale type “continuous” to such variables in the modeler. These
are variables that include the most detailed information we can expect.
The discrete and the continuous variables represent two extremes of informa-
tion: one gives us a rough overview, the other, very detailed information.
There are three other important terms for distinguishing the different types of
variable measurements. They are used in the SPSS Modeler. Let’s focus once
more on the example shown in Fig. 3.1. As well as the continuous variable
“fuel_cap”, in the column “Measurement” we can find the terms nominal and
ordinal. To distinguish between both types let’s define these terms.
Nominal Scale
Discrete variables with values that have no natural order are called “nominally
scaled”. The variable values are most likely strings/text or numbers, but, regarding
numbers usage, these values are only used for the purpose of assigning the object to
a particular group of objects. The variable values can only be ordered alphabeti-
cally, but there is no implicit order. The SPSS Modeler uses three circles to
symbolize the scale type “nominal” (see Fig. 3.1).
Ordinal Scale
An ordinal variable is similar to a nominally scaled variable, but the values have
a natural order too. The “type” of a car can be “small” or “large”, and these two
values can only be ordered ascending or descending. In the SPSS Modeler, this
scale type is represented by a column chart (see Fig. 3.1).
188 3 Univariate Statistics
Table 3.1 Variables and their scale type used in the “car_sales_modifed” dataset
Variable name Scale type
manufact Represents the manufacturer of a car. Values are “Acura”, “Audi”, etc. The
values have no implicit order and between two values we cannot find a third.
Therefore, the variables are discrete.
type Possible values are “Automobile” (small) or “Truck” (large), and so there is an
implicit order. But there is a limited number of possible values. The variables
are ordinally scaled.
fuel_cap The fuel capacity of a car can be determined theoretically with infinite
precision. The variable is metrically/continuously scaled.
Metrical Scale
A variable is called metrical if it can take on an infinite number of values. So all
continuous variables and currency variables are metrical. In the SPSS Modeler, the
metrical scaled variables are called “continuous” and represented with a ruler
symbol (see Fig. 3.1).
Coming back to the variables shown in Fig. 3.1. Table 3.1 shows a more detailed
description of the variable scale types used in the dataset “car_sales_modified”. All
the other variables included in this dataset are discussed in Exercise 4 of Sect. 3.1.3.
There are many more terms to describe the characteristics of a variable in more
detail, but we do not need more sophisticated descriptions to use the functionalities
of the SPSS Modeler. Interested readers are referred to Anderson et al. (2014).
3.1.3 Exercises
1. Explain in your own words the difference between a discrete and a continuous
variable. Give two examples for each term.
2. We also discussed the scale types nominal, ordinal, and metrical. Can you
remember which of these scale types can also be named “discrete”?
3. There are also variables that are called dichotomous. Explain in your own words
the meaning of this term and give an example.
4. Give two examples for each of these categories: Nominally, ordinally, and
metrically scaled variables.
3.1.4 Solutions
(a) Continuous
(b) Continuous
(c) Continuous
(d) Discrete
(e) Quasi-continuous (see the example “ticket price for local transport in different
countries” discussed above)
(f) Dichotomous and therefore discrete
1. The metrical scale defines an order and also a reference point. The “number of
rooms” in an apartment is an example of such a variable. It has an order because
the number of rooms can be ordered ascending or descending. With the value
zero, there is also a (theoretical) reference point. It is necessary to mention that
the values of a nominally scaled variable can never be ordered. Whereas, an
ordinally scaled variable, e.g., the classification of hotels, can be used to order
the records of a dataset. With hotel classification, there is also a theoretical
reference point of zero, but in comparison to the number of rooms there is no
measurable distance between a three- and a four-star hotel. The only thing we
know is that the four-star hotel is hopefully of better quality.
2. The nominal and ordinal scales are used to label qualitative statements about
statistical units.
3. The ratio scale and the absolute scale are metrical scales.
4. To avoid information loss, the inherent order has to be preserved and the
function has to be unique for a transformation function.
5. Minimum and maximum can be calculated without additional assumptions.
6. Yes, they are used to record the values of the variables of interest qualitatively
or quantitatively.
7. No, it’s wrong.
8. Yes, this is a dichotomous variable.
9. Yes, variables with currency are always quasi-continuous.
10. Yes, dichotomous variables are discrete.
11. No, it’s wrong. It’s ordinal scaled.
12. No, it’s wrong.
13. Yes, this variable is continuous.
Figure 3.3 shows the solution in the column “Measurement”. We also explain
the correct scale type in Table 3.2.
3.1 Theory 193
Table 3.2 Scale type of the variables included in the “car_sales_modified” dataset
Name of the variable Scale type Explanation
manufact Nominal This variable represents the name of the manufacturer.
The values represent “groups” of cars that can have no
implicit order. Additionally, the values are discrete.
type Ordinal This variable is also discrete but the values can be
ordered.
fuel_cap Continuous Between two very close values for the fuel capacity of a
car, it is always possible to find a theoretical value that
can exist. Additionally, it is possible to measure the
fuel capacity with infinite precision.
sales Continuous The variable is discrete because between two values,
e.g., 1.000 and 1.001, there can be no third value.
Nevertheless, we can say that ratios make sense. So
sales of 2.000 are double 1.000. In this case, we assign
the type “continuous” to the variable. See also the
detailed explanation in Sect. 3.1.1.
model Nominal These values are once more just names that cannot be
ordered.
resale, price, Continuous The values of these variables can also be measured with
horsepower, width, an infinite precision.
length
194 3 Univariate Statistics
3.2.1 Theory
Using the correct data analysis tool and method depends on the determination of the
scale of measurement of the variables included in the dataset. We discuss the
different scales of measurement and their characteristics in Sect. 3.1. Figure 3.4
now shows the different steps in a univariate analysis depending if the scale is
discrete or continuous. In case of a continuous variable, the researcher must define
classes and bin the values before creating a frequency table. In the following
sections we will show examples of datasets the appropriate methods to apply.
Related exercises: 2, 3, 4, 6, 7
Theoretical Background
In Sect. 3.1.1 we learn to distinguish between discrete and continuous variables. In
Sect. 3.1.2, we discussed a procedure for determining the correct scale type. It is
easy to understand that nominally and ordinally scaled variables are always dis-
crete. The reverse, however, is not always the case, as discrete values, e.g., the
number of car accidents, can be metrically scaled.
A suitable example of a discrete variable is included in the dataset “tree_credit”
and is analyzed here: the variable, “number of credit cards”, describes the number
of credit cards owned per person and is coded as follows:
Begin to Create A Stream: Adding a Data Source and Defining the Scale Type
Now we describe how to add a data source as well as how to assign the correct scale
type to the variables. We will start with an empty stream.
1. We open a new stream by using the shortcut “Ctrl+N” or the toolbar item “File/
New”.
2. We save the file to an appropriate directory.
3. The source file used here is an SPSS-Data file with the extension “SAV”. We add
from the Modeler tab “Sources” a node of type “Statistics File” by double
clicking on the settings. We define the folder and the filename “tree_credit.
sav” and confirm the modification with the “OK” button.
4. Next, we add a Type node from the tab “Field Ops”. As outlined in Sect. 2.1, we
activate the Source node by clicking on it once and then we click on the new
Type node. Both nodes are now connected in the correct order.
5. To open the settings of the Type node as shown in Fig. 3.5, we double click on
this node. In the second column of the settings of this node, we can see that the
Modeler automatically inspects the source file and tries to determine the correct
scale type.
Usually, the Modeler will determine the correct scale types of the variables,
but nonetheless, we have the chance to change this scale type with an additional
Type node. We suggest adding this node to each stream.
The dataset “Credit_cards.sav” includes the definition of the scale types, but
this definition is incorrect. So, we should adjust the settings in the column
“Measurement” as shown in Fig. 3.5. It is especially important to check the
variables that are to be analyzed with a node called “Distribution”. It has to be
defined as discrete. This is correct for the variable “Credit_cards”.
6. We can close the dialog window without any other modification and click “OK”.
196 3 Univariate Statistics
" We strongly suggest adding a Type node to each stream right after
the Source node. The Type node has several functionalities, such as:
1. To inspect and to modify the scale type of a variable (see Sect. 3.1).
2. To define or to modify the value labels (see Sect. 2.3).
3. To disable a variable by using the option “none” in the column.
1. We add a node of type “Distribution” from the “Graphs” section to the stream.
2. We connect this node with the Type node. Figure 3.6 shows the current stream.
Up to now, we could find the question mark beneath the Distribution node.
Obviously, we have to define the variable, which should be visualized with the
Distribution node.
3. To define the target variable, we double-click on the Distribution node and select
“Credit_cards” in the first dialog box as the variable of interest (see Fig. 3.7). We
can also modify the other parameters that influence the diagram.
3.2 Simple Data Examination Tasks 197
Fig. 3.6 Stream “Distribution discrete values” before the selection of the target variable
4. Finally, we click “OK” and we will get the stream shown in Fig. 3.8.
5. We use the button “Run” in the upper toolbar. A new window will appear that
shows the frequency distribution of the variable “Credit_cards” (see Fig. 3.9).
6. The number of credit cards is a discrete variable. In the dataset, a name for each
category is defined (see the description in Fig. 3.9). If we want to see the data
labels instead of the values, we should use the button “Display field and value
labels”, which is included in the middle of the windows toolbar and marked with
an arrow in Fig. 3.9.
7. Furthermore, in the tab “Graph” of the dialog window, we can find a more
detailed diagram of the frequency distribution as shown in Fig. 3.10.
198 3 Univariate Statistics
8. This is the result of the graphical analysis using a bar plot, and we can close the
window.
In general, discrete variables are used to reduce the amount of information and to
get a better overview. Variables with continuous values are used in order to obtain
more information (see Sect. 3.1.1 for more details).
Two types of charts are necessary to distinguish in the analyses of values. A bar
chart can be used when we have discrete or categorical values to analyze, e.g., the
number of credit cards as shown in Sect. 3.2.1.
If a variable is continuously scaled, however, this approach won’t work. This is
because of the huge/infinite number of values the variable can take on. The results
would then normally show a frequency of one for each value, and so the bar chart
would give unsatisfactory results, since each bar would have the same height, and
the distribution of the variable would be unidentifiable by the graph.
To handle this problem and to create a chart that we can interpret, the typically
used method is to split the co-domain of the continuous variable into intervals and
to determine the frequency of the values in each of these subintervals. This transfers
the continuous variable into a discrete one, and then a bar chart can be drawn as
described in the previous subsection. The SPSS Modeler offers different methods
for this procedure. At the end of the classification process, using equidistant classes
(intervals with the same length), a special kind of bar chart called a histogram will
be produced. Based on the stream template “credit_cards”, we will demonstrate
how to use the SPSS Modeler for this kind of graphical analysis. Interested readers
are referred to Anderson et al. (2014) for more details in using histograms.
1. We open the template stream “credit_cards” and save it under another name.
Alternatively, a Statistics File node can be added to an empty stream, and the
“tree_credit.sav” file should be defined as the data source. Finally, we add a
Type node and connect it with the Statistics File node.
200 3 Univariate Statistics
In both cases, we get a stream as shown in Fig. 3.11 with the settings in the
Type node as shown in Fig. 3.12.
2. We focus on the scale type of the variable “Age”. The settings of the Type node
in Fig. 3.12 shows us that the variable is continuous.
3. Now, we add a “Histogram” node to the stream, from the graph section of the
toolbar.
4. We connect the Type node with the Histogram node.
5. Figure 3.13 shows the Histogram node is at the end of the stream on the right-
hand side, Within it, the variable that will be graphically analyzed must be
defined. For this purpose, we double-click on this node and select the variable
age as shown in Fig. 3.14.
We have to pay attention to the fact that in this node only metrical values can
be selected! If the scale type is not appropriately defined in the Type node, then
the variable will not appear in the drop-down list of the Histogram node.
Fig. 3.13 Stream “Distribution continuous values” before selection of the target variable
Related exercises: 5, 9
With the Distribution node and the Histogram node, only one variable can be
analyzed at a time. As shown in the Sect. 3.2.3, the definition of the scale type is
important for getting the correct results. To sum up the findings using these nodes,
we believe that a lot of stream parameters have to be defined, in order to inspect the
shape of a distribution. When considering a powerful data-mining tool, this is a
particularly inefficient way of condensing the information included in a more or less
3.2 Simple Data Examination Tasks 203
larger dataset. The Data Audit node helps us to reduce the overhead and to condense
the information in a much faster process. Here, we present how this node can be used.
1. Open the template stream “credit_cards” and save it under another name, e.g.,
“Distribution analysis using Data Audit node”. Alternatively, create a stream
based on the dataset “tree_credit.sav” as demonstrated in the Sect. 2.1.
Either way, we get a stream as shown in Fig. 3.17 with the settings in the Type
node as shown in Fig. 3.12.
2. Now, we add a Data-Audit node from the output section of the toolbar. We
connect the Data Audit node to the existing Type node. Immediately, the
Modeler scans the data source for the number of included variables that can be
inspected by this node. Here, there are six variables. The number of variables is
displayed under the Data Audit node (see Fig. 3.18).
3. To become familiar with this type of node, we will discuss the different options.
To open the dialog window we double-click on the Data Audit node. In the
dialog window, we enable the option “Advanced statistics” as shown in
Fig. 3.19. This option gives us the chance to select certain more detailed
measures. They are then shown in the final analysis (see Fig. 3.21). Additionally,
we suggest enabling the option to calculate the median and mode. We confirm
and finish the modifications with “Run”.
4. Figure 3.20 shows the result of the analysis for dataset “tree_credit”. Besides the
shape of the distribution, a lot of useful measures can be inspected und
interpreted. To reduce the number of measures, we can use the button “Display
statistics”, marked with an arrow in Fig. 3.20.
Figure 3.21 shows a comparison of the options in the case of the enabled
option “Advanced statistics” in Fig. 3.19. It will be part of Exercise 7 to explain
the calculated measures in your own words. Finally, we can close the dialog
window with “OK”.
5. We can see another interesting feature of the Data Audit node by defining a
variable as a target variable in the Type node. The variable “Credit_rating” can
take on two possible values (0 ¼ bad and 1 ¼ good) as shown also in Fig. 3.20.
For demonstration purposes, we define this variable as the target variable in the
3.2 Simple Data Examination Tasks 205
Fig. 3.21 Comparison of the measures offered by the Data Audit node—advanced statistics on
the right-hand side
Fig. 3.23 Data Audit node with defined target variable in Type node
Type node (see Fig. 3.22). If we now open the Data Audit node once again we
can find stacked bar charts which allow us to determine the proportion of bad and
good ratings (see Fig. 3.23).
" The Data Audit node offers a chart of the frequency distribution of
each variable as well as a wide range of statistical measures of central
tendency, volatility, and skewness.
Related exercises: 8, 9, 11
Theory
Many algorithms for creating statistical models need normally distributed variables
to produce reliable results. Otherwise, either the algorithms can’t determine the
correct solution or the goodness of fit statistics are inprecise. For instance, we can fit
a linear regression model on non-normally distributed variables, but the hypothesis
tests for the parameters, as well as their confidence intervals, are inaccurate. See
Aczel and Sounderpandian (2009, p. 448) for the regression model and Zimmerman
(1998) for the influence of non-normality to parametric and nonparametric tests.
Dealing with data coming from real practical backgrounds means often having to
transform them to normality or at least towards more normally distributed variables.
In this section, we will show how to analyze data and how to determine the best
transformation.
Standard normal distributions are represented by a typical bell-shaped curve. On
the one hand, Fig. 3.24 shows two examples with different expected values of 40 and
65, but the same standard deviation of 3. That’s because the range of the values or the
width of the distributions is equal. See also the 3s-rule mentioned later in Sect. 3.2.7.
On the other hand Fig. 3.25 shows an example of two distributions with the same
expected value of 55 but different standard deviations of 3 and 13. It is important to
note that the probability density is represented at the y-axis. The area under the curves,
between minus infinity and e x , can illustrate the probability of a specific value.
Each normal distribution can be described by its expected value μ and its
standard deviation σ. The skewness of the curves is always zero because the curves
are completely symmetrical. The curves are called left-skewed or negative-skewed
if they have outliers or a longer tail on the left and vice versa.
208 3 Univariate Statistics
The link between the frequency distribution of variables and the probability
density curve is given by the law of large numbers. We want to approximate a
frequency distribution by using normal distribution with an expected value equal to
the mean and the standard deviation; but this only works if the other parameters,
e.g., skewness of zero meet the assumptions.
If a curve has skewness of other than zero, however, we can try to transform the
curve to a more symmetrical shape. Table 3.3 shows common transformations such
as inverse, log, square root, and other power transformations. The different
exponents in a transformation have a massive impact on the result. For instance,
the log and the square root behave differently between 0 and 1, in comparison with
what happens with independent variables larger than 1. Neither functions are
defined for values smaller than zero.
Box–Cox Transformations
Finding the correct transformation can sometimes be challenging. Tukey (1957)
introduced a family of power transformation functions, later improved by Box and
Cox (1964), which covers all these cases. We want to outline some details here, so
that we are familiar with the bigger picture of statistical theory.
The Box–Cox transformation as a family of functions can be denoted as:
(
xλ 1
T ðxÞ ¼ if λ 6¼ 0
λ
loge x if λ ¼ 0
As we can see, this transformation also covers the functions presented in Table 3.3,
and as summarized also in Table 3.4.
The SPSS Modeler offers the Transform node for visualizing the original
variables and the variables after different transformations. To find the optimal
value for the Box–Cox transformation, other statistical software packages offer
specific functions. The interested reader is referred to the package “car”, which can
be found in Fox and Weisberg (2015, p. 106), there is documentation for the
function “powerTransform”. In the following example, we will show how this
procedure works and how it selects the best λ, based on the log-likelihood profile.
210 3 Univariate Statistics
4. We double-click the new Transform node to open the dialog window. In the
Fields tab, we add the variables “glucose_concentration”, “blood_pressure”,
“serum_insulin”, “BMI”, and “diabetes_pedigree” as also shown in Fig. 3.30
with an arrow. Figure 3.31 shows the result.
5. We activate the Options tab. Here, we have the chance to run all possible
transformations or to just select and modify some of them (see Fig. 3.32).
The Modeler offers:
– Inverse 1/x
– Log n equals natural logarithm logex
– Log 10 equals log10x (no Box–Cox power transformation function, but
similar results to log n)
– Exponential equals ex (no Box–Cox power transformation function)
pffiffiffi
– Square root equals x
3.2 Simple Data Examination Tasks 213
Fig. 3.31 Variables are added in the Fields tab of the Transform node
214 3 Univariate Statistics
Adding an offset would mean substituting the original variable x for its
þoffset. As we saw in the theoretical introduction to this chapter, the option
can be helpful for moving the values outside the intervals of 0 to 1 before
transforming them. For now, we do not modify the options. The Modeler should
apply all possible transformations to show us the results.
6. In the Output tab, we can define a name for the output; in the annotations tab, we
can determine a name for the Transform node itself. We do not want to modify
these options here. We click on “Run” to see the results, as shown in Fig. 3.33.
7. In the second column of Fig. 3.33, the Transform node shows the current
distribution of the variable. Under the curve, we can find the mean and the
standard deviation. If we double-click on one of these charts, a new window
appears that shows us the details of the frequency distribution selected and an
added normal curve (see Fig. 3.34).
For the curves, both the mean and the standard deviation of the selected
variable are calculated. The SPSS Modeler determines the expected frequencies
and draws the bell-shaped curve. By assessing the deviation of both curves, we
have the chance to decide whether the distribution is normal or not. We can close
this window.
" The Transform node helps assess the frequency distribution of the
variables. The user can decide if the original variable is approximately
normally distributed. The histogram as well as the mean and the
standard deviation will be calculated and shown.
3.2 Simple Data Examination Tasks 215
– Inverse 1/x
– Log n equals natural logarithm logex
– Log 10 equals log10x
– Exponential equals ex (no Box–Cox power transformation function)
pffiffiffi
– Square root equals x
216 3 Univariate Statistics
" Transformed data are also not normally distributed, but in general the
logarithm of the original values helps to move the distribution
towards normality. So if an algorithm assumes there are normally
distributed values, the user is recommended to test the logarithm
of values instead of the original variable itself.
8. Starting at the third column in Fig. 3.33, we can see the charts of the frequency
distributions, depending on the transformations mentioned above. Here, we can
decide which transformation to use, to shift the distribution of the original variable
more towards a normal curve. To see more details, we can also double-click on
one of those diagrams. A window similar to that in Fig. 3.34 appears.
9. Assessing the distributions, we find that in particular “serum_insulin”, “BMI”,
and “diabetes_pedigree” do not have a bell-shaped form. The second column of
Table 3.5 shows the result of the Shapiro Wilk test.
The null hypothesis of the test is that the values come from a population with
normally distributed values. The stated significance/p-value is the probability,
assuming that this null hypothesis is true. In the result, none of the original
variables are normally distributed.
For all variables except “blood_pressure”, however, we can see that the
log-transformed values look much better. As the log 10-transformation leads
to smaller standard deviation, we prefer them to the natural logarithm.
We select the transformations, identified by clicking once on the distributions.
As we see in Fig. 3.35, the name of the transformation, as well as the histogram
of the transformed values, appear in the first column.
The fourth column of Table 3.5 shows us that not all of these transformations will
result in normally distributed variables, but other transformations cannot be deter-
mined, even if we use the automated Box–Cox transformation implemented in R.
Variable name Wilk’s normality test transform node Wilk’s normality test with R test for Box–Cox lambda
glucose_concentration Non-normally distributed Not transformed Non-normally distributed 0.06295239 0.002
0.000 0.000
blood_pressure Non-normally distributed log n or log 10 Non-normally distributed 1.181447 0.014
0.009 (K–S-test 0.053) 0.000
serum_insulin Non-normally distributed log n or log 10 Normally distributed 0.259 0.04507211 0.377
0.000
BMI Non-normally distributed log n or log 10 Normally distributed 0.063 0.1490053 0.089
0.000
diabetes_pedigree Non-normally distributed log n or log 10 Normally distributed 0.212 0.044299 0.266
0.000
217
218 3 Univariate Statistics
1. Our stream locks, as depicted in Fig. 3.29. Additionally, as shown in Fig. 3.35,
a dialog window appears on the screen and the different transformations for the
variables are determined. Now we can use the toolbar item “Generate” and
select “Derive Node” (Fig. 3.36).
2. A dialog window appears that lets us choose either to transform the values or to
transform and standardize the values at the same time. Here, we use the first
option, as shown in Fig. 3.37. For an explanation of the standardization
procedure, see Sect. 2.7.7. We click on “OK”.
3. The dialog window in Fig. 3.37 disappears, but a SuperNode is added automat-
ically to the stream by the SPSS Modeler (Fig. 3.38).
4. To evaluate the SuperNode, we activate it by clicking it once.
3.2 Simple Data Examination Tasks 219
5. We can now have a look at the details by clicking on “Zoom into SuperNode”
in the toolbar, shown in Fig. 3.39.
6. The Modeler shows the different nodes encapsulated in the SuperNode (see
Fig. 3.40). Furthermore, we can navigate between the SuperNode and the
stream by activating them in the Streams tree, to the right of the Modelers
window. This is illustrated in Fig. 3.41.
7. In the Transform node, we defined four transformations. They are now
implemented in the SuperNode, using one Derive node for each transformation.
If we want to verify the formula used in the Derive nodes, we can double-click
on them (see Fig. 3.42).
8. We can close the dialog window in Fig. 3.42 and navigate back to the stream by
activating the stream in Fig. 3.41.
220 3 Univariate Statistics
" SuperNodes help the user to encapsulate several nodes into one.
SuperNodes can be created using the option “Create SuperNode”
that is available when more than one node is marked. The option
then appears in the dialog field, after a right click with the mouse.
Fig. 3.46 Histogram of the original “Serum Insulin” and the transformed “Serum Insulin”
Fig. 3.47 Q–Q plot of the original “Serum Insulin” and the transformed
Related exercises: 10
3.2 Simple Data Examination Tasks 225
Theory
Figure 3.48 shows the differences between procedures used to analyze discrete or
continuous values in general. In this section we will show how to reduce the number
of values on the scale of a discrete variable.
When talking about data and its analysis, we have to distinguish between
discrete and continuous variables, as described in detail in Sect. 3.1.1. The number
of different discrete values can generally be infinite. Nevertheless, a small gap can
always be identified between two, even very close, values. Typical examples of
these types of data are variables based on questionnaires. Respondents are often
asked about their satisfaction with a specific product or the characteristics of an
object. We can consider that the respondents often tend to the values in the middle
of a scale and avoid giving positive or negative answers. That’s because the high
and low ends of the scales are less-used and underrepresented. Therefore, it makes
sense to combine the frequencies of two directly adjacent values, e.g., excellent and
good. In theory, we could simply add the frequencies, but if we would like to use
this information in a stream and in different ways, we should understand how to
transform the values using a so-called “reclassification”.
Reclassify Values
The dataset “IT_user_satisfaction.sav” represents the opinions of IT users in a
particular firm. 180 users were asked to assess the quality of a specific IT system.
For details, see the dataset description in Sect. 10.1.20. As an example of the
reclassification/recoding procedure, we would like to analyze the frequency distri-
bution of a variable. After that we want to reclassify the values.
window from the values to the labels of the values. In Fig. 3.51, the button is
marked with an arrow (see also Sect. 2.3).
2. Obviously, the variables are discrete. In order to have the opportunity to
reclassify the values, the type “discrete” has to be assigned to the variables.
To do this, we should use the Type node shown in the template stream in
Fig. 3.49. We double-click on the Type node, and the scale types as shown in
Fig. 3.52 should be assigned in the second column “measurement”.
" Often the SPSS Modeler does not determine the scale type correctly
using its internal procedures. Therefore, a Type node should be
included in the stream, right after the Source node. The user should
check and possibly modify the scale type settings!
228 3 Univariate Statistics
3. The first column in Fig. 3.51 shows some values for the variable “starttime” that
represent the satisfaction of respondents, with the time the IT system needs to
come alive and be ready to login. To have a better overview, we can use a Data
Analysis node for example and connect it with the Type node (see Fig. 3.53).
4. We double-click on the Data Analysis node. In Fig. 3.54, we get a rough
overview of the distributions.
5. To have the chance to assess the distribution in detail, we double-click on the
first small diagram in Fig. 3.54. In the new window, we can assess the details of
the distribution.
6. As we can see in Fig. 3.55, the number of respondents that use the option
“poor” to characterize the start time of the IT-system is very small. To reduce
the number of categories, we can combine the categories “good” and “poor”.
Normally, we could then add the frequencies 3 + 109 ¼ 112, but we want to
show how to modify the values in the stream itself, to achieve in the end a
completely new variable with the transformed values. We therefore have to
implement the transformation as summarized in Table 3.6.
7. To do so we add a Reclassify node to the stream from the “Field Ops” tab of the
Modeler and connect it with the Type node as shown in Fig. 3.56.
8. To implement the reclassification procedure shown in Table 3.6, we double-
click on the new Reclassify node. A dialog window as shown in Fig. 3.57
appears.
9. We now adjust all the parameters as follows:
(a) Mode:
Here, we would like to reclassify just one variable. Therefore, the first
“Mode” option is “Single”;
230 3 Univariate Statistics
procedure, then we have the opportunity to select more than one variable.
We select the field “starttime”.
Once more we have to realize that in this dialog box only discrete
variables appear! If we miss a variable, then we have to adjust the settings
in the Type node.
(d) New field name:
The SPSS Modeler displays a generic name with a number. It would be
better to select a name that is self-explanatory, so the name for our new
reclassified variable will be “starttime_recoded” (see Fig. 3.58).
" The term “reclassify” does not mean “binning”. A better description
would be “Recode node”.
" It is critically important to make sure that the last row of the reclassify
definition has absolutely no value, that it is blank!
" To define value labels for a reclassified variable, a Type node must be
added to the stream after the Reclassify node.
13. We can confirm the settings with “OK” and then close the dialog window of the
Type node with “OK”.
14. Finally, we add a Data Audit node at the end of the stream, as shown in Fig. 3.64.
This gives us the chance to analyze the new variable “starttime_recoded”.
3.2 Simple Data Examination Tasks 235
Fig. 3.64 Final stream to reclassify values and add value labels
15. To see the frequency distribution, we double-click on the Data Audit node, run
the analysis, and scroll down to the new variable “starttime_recoded” (see
Fig. 3.65).
16. Finally, if we double-click on the small diagram marked with an arrow in
Fig. 3.65, we can see the new value labels and the frequency distribution as
shown in Fig. 3.66.
236 3 Univariate Statistics
Fig. 3.65 Details of the new variable in the Data Audit node
Fig. 3.66 New codes for the variable “starttime_recode” and their frequencies
1. We open the template stream “tree_credit” and save it under another name, e.g.,
“binning_continuous_data”. Alternatively, we create a stream based on the
dataset “tree_credit.sav” as demonstrated in Sect. 3.2.2. Either way, we get a
stream as shown in Fig. 3.67 with the settings in the Type node, as shown in
Fig. 3.12.
2. Now, we add a node called “Binning” from the “Field Ops” panel, and we
connect the Binning node with the existing Type node (see Fig. 3.68).
3. Before we bin the values, it is always a good idea to analyze the original dataset
(see Sect. 3.2.4). For this reason, we add a Data Audit node to the stream and
connect it with the Statistics File node (see Fig. 3.69).
4. Starting the analysis with this Data Audit node, we realize that only “age” is a
continuous variable. The second row in Fig. 3.70 shows a minimum age of
20 years and a maximum age of 63.350 years. Furthermore, the average age is
33.816 years and the standard deviation is 8.539 years. We will come back to
those results when we discuss the different binning procedures.
5. The Binning node determines the values of a new variable that we would like to
analyze afterwards. To do so, we add finally an additional Data Audit node to the
stream and connect it with the Binning node (see Fig. 3.71).
At the moment, the modeler tells us that there are just six variables. Normally,
we would expect seven—the six original variables and the new additional binned
variable—but so far we have not defined any parameter in the Binning node.
After the next step, there will be seven variables including the new binned age.
Fig. 3.69 Stream with the Data Audit node added to the Statistics File node
Fig. 3.70 Results of the analysis using the Data Audit node
6. First double click on the Binning node and choose “age”, the only variable in the
dataset that is continuous (see Fig. 3.72).
In any case, the user has to determine the method of how to bin. Figure 3.72
shows the methods. To become more familiar with these options, we now want to
use and explain them.
(2014, pp. 125–130). If we would like to specify cut points by ourselves to define
a flag variable (status indicator), we should use a Derive node (see Exercise 12
in Sect. 3.2.8). Additionally, we have to mention that sometimes we would like to
reclassify the categories of a variable. The Reclassify node should be used for this
(see Sect. 3.2.6).
same (see Fig. 3.76). It is essential to understand all the terms, “quartile”, “quin-
tile”, etc. that we find in the middle of the same dialog box shown in Fig. 3.7.6.
In general, a p-quantile xp is the value where p % of all the values can be found
on the left-hand side of xp. Per definition, the median is x0,50 or the 50th percent
quantile or the 50th percentile. All other terms are explained in Table 3.7.
The Binning node also offers the option of putting a certain number of values in
each class. This option is similar to the cut point definition for a predefined number
of categories, using the “Fixed-width” option as mentioned above. The advantage
here, however, is that we can determine what happens with the values at the cut
points themselves. If there is a value that is equal to a cut point, then the options
“Add to next”, “Keep in current”, and “Assign randomly” will assign this number.
In the case of a continuous variable, the probability that this will happen is
approximately zero, but we can imagine this, e.g., in the case of a variable “age
in years”. If the respondent is 25 years old and the cut point definition is exactly
25, then we can determine the class of ages the person is assigned to.
3.2 Simple Data Examination Tasks 243
Fig. 3.76 Defining categories with the same frequency of original values
If we would like to define several new variables, we can activate several of the
options explained in Table 3.7. Figure 3.76 shows an example with the quintile and
the decile. To show the results, we click once more on the tab “Bin Values” and
choose the option “Tile”. The drop-down list is marked with an arrow in
244 3 Univariate Statistics
Fig. 3.77 Defining more than one variable to be generated with the “Tile” option
Fig. 3.77. Let’s analyze the details of the generated variables “AGE_TILE5” and
“AGE_TILE10”. Both are also shown in the analysis of the Data Audit node
connected to the Binning node (see Fig. 3.78).
Method “Ranks”
This option should be used to rank the values in ascending or descending order.
Figure 3.79 shows a definition for three new variables. For an explanation, see
Table 3.8.
Fig. 3.80 Data Audit node—results for dataset “tree_credit” (see Sect. 3.2.4)
The attentive reader will probably ask why we did not discuss the 1s-interval and
its boundaries
As we can see in Table 3.9, the percentage of values falling within those boundaries
highly depends upon the shape of the distribution. Furthermore, in the case of
variables with a normally distributed frequency, only 68 % of the values can be
found in that interval, and so these are of less importance to us.
We would like to use the 2s- and the 3s-interval to identify outliers. Looking
at the percentage of values in these intervals, we can say that values outside the
2s-interval are potentially outliers; they are suspicious. Values outside the
3s-interval are definitely outliers.
Coming back to the Binning node, we now use the 2s-interval option for the
variable “age”, as shown in Fig. 3.81. The “Bin values” tab of the Binning node
in Fig. 3.82 shows the results. If we then use the connected Data Analysis node,
we find the frequency distribution as shown in Fig. 3.83. When we double click
on the distribution chart, we get the diagram in Fig. 3.84. Clearly values are
larger than the upper boundary of the 2s-interval. So we find 78 potential outlier
values in this dataset. This is what we expected from the shape of the distribution
of “age” shown in Fig. 3.80. The distribution has a long right tail. These are the
“outliers” we have now found once again with the more refined 2s-interval
method (Fig. 3.84).
248 3 Univariate Statistics
Fig. 3.81 Using the mean and the standard deviation to determine values of a new variable
3.2.8 Exercises
3. Now the records should be reordered. Therefore, add a Sort node to the end of
the stream on the right-hand side. Modify the node settings so that the records are
ordered in descending order in relation to the frequency of sales.
4. Finally, add a node that shows you the records in the modified order.
Fig. 3.84 Frequency distribution of the classified variable using the 2s-rule
1. Open a new empty stream and add a Variable File node from the sources section.
2. Then try to load the records from the data file “IT-projects.txt”. As you can see,
this is a simple text file with the variable names in the first line and tabs between
the numbers in the following lines. To read the file, we will use the following
Variable File node options: “Read file names from file”, “File delimiters”, “Tab
(s)”, and “Newline”. All the other options we do not have to modify.
Close the options dialog of the Variable File node.
3. Now we would like to see the imported records. To do so, add a Table node to
the stream and connect it with the Variable File node. Try to show the records
in a table.
4. Finally, diagrams should be used to visualize the data. Before we do that, please
analyze the variables and determine the correct scale type using the theoretical
findings in this chapter.
5. Add a Type node and make sure to check the scale types so that a Histogram
node can be used.
252 3 Univariate Statistics
6. Can you use another node to create a graph of the frequency distribution of
several variables?
1. Which options in a Data Audit node have to be activated to calculate the median
and the mode?
2. Explain the following measures, also shown in Fig. 3.87: mean, median, and
mode, range, and standard deviation. Create a table with the columns “Name of
measure” and “Explanation”.
1. Explain the meaning of the 2s- and the 3s-rule in your own words.
2. Using the 3s-rule, determine the number of people that are relatively old. In
other words, those who are outliers in the dataset in terms of their age.
2. Add a Derive node and connect it with the Type node. Using the Derive node,
calculate values of a new variable “Income_binned” which tells you if the
income is smaller than 2.5. This should be a flag variable as shown in Fig. 3.89.
3. By using the expression builder on the right in Fig. 3.89, define a condition for
the calculation of the new variable.
4. Finally, define labels “<2.5” and “2.5” for the new variable “Income_binned”.
5. Show the results in a Table as well as in a Data Audit node.
3.2.9 Solutions
on the right-hand side, to define the variable used to sort the values. This is
shown with an arrow in Fig. 3.94.
We use the variable “sales” and add it to the dialog box (Fig. 3.95).
Finally, we modify the order by clicking on the text “Ascending”. Alterna-
tively, we can first set the default option to “Descending” at the bottom of the
dialog window, to get the same result. Figure 3.96 shows the result.
If we would also like to use other variables to sort the records, we have the
opportunity here to add additional criteria.
4. To show the result, or better still the records, in the new order, the stream should
be expanded by another Table node on the right-hand side. This is shown in
Fig. 3.97. Clearly Fig. 3.98 shows the reordered records by applying the sales in
descending order. When we compare this figure with Fig. 3.92, we can verify the
result.
3.2 Simple Data Examination Tasks 261
1. Practical useful examples for each scale type can be found in Table 3.12.
2. For an appropriate explanation of each scale type, see Sect. 3.1.2.
3. Matching of scale of measurement and the chart types see Table 3.13.
Figure 3.99 shows the structure of the stream used to graphically analyze the
variable “campaign”. The stream can be found in the solutions, with the name
“pm_customer_train1.str”.
It is important to assign the scale type “ordinal” to this variable, in the Type node of
the stream. Figure 3.100 shows the frequency distribution in the form of a bar chart.
264 3 Univariate Statistics
Fig. 3.100 Bar chart of the frequency distribution of the variable “campaign”
1. To open a new stream, use “File\New Stream” from the tool bar. Then add a
Variable File node from the sources section.
2. To load the records from the data file “IT-projects.txt”, the parameter of the node
has to be modified as shown in Fig. 3.101.
3. Figure 3.102 shows the results of the operation using a Table node.
4. There are four variables that should be discussed here. Table 3.14 shows the
scale type of each variable. For an explanation of several scale types see
Sect. 3.1.
5. Figure 3.103 shows the scale type definitions of the type node as described
above. Figure 3.104 also shows the final stream.
" Definition of the scale type depends often on how the variable in the
streams will be used. The scale type is particularly essential if
diagrams are to be created. The Distribution node needs discrete
(nominal, ordinal, or categorical) variables, whereas the Histogram
node only accepts continuous variables. Therefore, sometimes we
need to add a Type node to the stream and modify the scale types.
6. Figures 3.105 and 3.106 show diagrams of frequency distributions. The best way
to have a rough overview of the distributions is to use a Data Audit node, as
described in Sect. 3.2.4. The shape of the distributions depends heavily on the
scale of measurement of the variables, however. If, for example, variable KDSI
is defined as continuous in the Data audit node, then the values will be binned. If
3.2 Simple Data Examination Tasks 265
Fig. 3.101 Parameter of the Variable file node to load data records of “IT-projects.txt”
Fig. 3.103 Scale type definitions for “IT-projects.txt” in the Type node
the scale type of the variable is discrete, a binning is impossible. We can verify
that by adding a Data Audit node and connecting it to the Variable File node. The
result is different when we modify the scale of measurement definition in the
Type node, e.g., for KDSI.
3.2 Simple Data Examination Tasks 267
1. The number of classes, the width of the classes and the start of the classing are
all relevant to the appearance of a distribution in a histogram.
2. The measures of skewness, dispersion, location, and kurtosis characterize a
distribution.
3. They were calculated to describe the maximum distribution, the location of the
distribution relative to the x-axis and relative to label outliers.
4. It should be considered whether the data is classified or not.
5. The median is (5 þ 6)/2 ¼ 5.5.
6. The class midpoints, the absolute/relative frequency per class, and, if applica-
ble, the sample size are used to calculate the arithmetic mean of classified
values.
7. The arithmetic mean is 4. There are two modes 4 and 5. The median is 4.
8. If n ¼ odd or n ¼ 1.
9. The distribution is skewed to the right and extremely skewed.
10. The median income is . . . EUR. The mode of income is . . . EUR.
11. No, it’s wrong. It’s skewed to the right.
12. Yes, the average mean is sensitive to outliers.
13. Yes, it lies on the right-hand side of the maximum of the distribution.
3.2 Simple Data Examination Tasks 269
14. No, it’s wrong. Thats because the relative frequencies are calculated with the
sample size.
15. Yes, the arithmetic mean can be greater or less than most values.
16. No, it’s wrong.
17. No, it’s wrong. Also the median is sensitive to outliers.
1. Firstly, in the Data Analysis node the option “advanced statistics” has to be
activated (see Fig. 3.19). Secondly, the options for the measures to calculate
have to be activated, as shown in Fig. 3.21.
2. See Table 3.15.
For details see Sect. 2.7.6 and Sect. 3.2.5. For details see Table 3.16.
270 3 Univariate Statistics
Fig. 3.109 Parameters of the Derive node to calculate the debt–income ratio
272 3 Univariate Statistics
Fig. 3.111 Variables and their distribution in the Data Audit node
Fig. 3.113 Variables are added in the Fields tab of the Transform node
Fig. 3.116 Final stream with a SuperNode and a Data Audit node
Starting at the third column in Fig. 3.114, we can see the charts of the
frequency distributions, depending on different transformations. Here, we can
decide which transformation to use, to shift the distribution of the original
variable more towards a normal curve. To see more details, we can also
double-click on one of those diagrams.
Assessing the distributions, we find that all variables do not have a bell-
shaped form also after transformation.
For “AGE” and “DEBTINCOMERATIO”, we can see that the square root-
transformed values look much better. We select this transformation for both
variables, identified by clicking once on the distributions.
As explained in Sect. 3.2.5, we now define a SuperNode and connect it to the
rest of the stream. Finally, we add a Data Audit node to evaluate the result (see
Figs. 3.116 and 3.117.
3.2 Simple Data Examination Tasks 275
1. To show the original values, we added a Table and a Data Audit node to the
stream, as shown in Fig. 3.118.
2. Now we should add a Reclassify node and connect it with the Type node.
Figure 3.119 shows the parameters for reclassifying the variables “starttime”,
“system_availability”, . . ., “slimness”. In addition, we defined the recoding
procedure itself, as summarized in Table 3.11.
276 3 Univariate Statistics
It is absolutely vital to make sure that the last row of the reclassify definition
has absolutely no value and is blank! This row is marked in Fig. 3.119 with
an arrow. To be certain, this row is completely empty, we can also click
the red “delete selection option” on the right. This button is also marked with
an arrow.
3.2 Simple Data Examination Tasks 277
3. As shown in the final stream in Fig. 3.122, we have to add another Type node to
define the value labels for the reclassified variables. If we double-click on the
second Type node, we can reorder the variables by clicking at the column head
of the first column “name” (see Fig. 3.120). For each of the first eleven variables,
we have defined a new reclassified variable with the name extension “_Reclas-
sify”. The extension is defined in the dialog window of the Reclassify node (see
Fig. 3.119 in the middle).
If the reclassification does not work, we should try to define the definition for
just one variable and then add the other variables in the dialog box “reclassify
fields”.
4. In the column “Values” of Fig. 3.120, we can now specify the value labels as
shown in Fig. 3.121. Don’t forget we must click on “Specify values and labels”
before we define the new labels.
5. Unfortunately, we must implement this definition for all of the new variables
with the extension “_Reclassify” shown in Fig. 3.120.
6. Finally, we should add a Data Analysis node. As shown in Fig. 3.122, we can
now find 27 variables in the stream. This is because we added 11 new variables
with the implementation of the Reclassify node. In the data source, 16 variables
are defined (see the Data Analysis node in Fig. 3.122).
7. Figure 3.123 shows the results of the reclassification procedure in the Data
analysis node.
278 3 Univariate Statistics
Part 1
For an explanation of the 2s- and 3s-rule, the interested reader is referred to
Sect. 3.2.7 and Table 3.9. Here, the functionalities of the Binning node were
discussed, and the option to bin values using the mean and standard deviation.
Part 2
The solution can be found in the file “outlier analysis”. Figure 3.124 shows the
complete stream. So far, we have not discussed the Sort node on the right-hand side
of the dotted line in Fig. 3.124. In any case, this node is easy to use and there is an
exercise, “Sorting values”, with a detailed solution (see Exercise 1).
Here, we describe the most important steps.
1. First we should determine which variable should be used for the calculation. In
the binning node, we choose the variable “age”.
2. We also use the option “Mean/standard deviation” with “3 standard
deviations” (see Fig. 3.125). In Fig. 3.126, we can see that the 3s-threshold is
33.82 + 3*8.54 ¼ 59.4 years. Persons that are older can be described as “outliers”
in the context of this dataset.
280 3 Univariate Statistics
3. In the Data Analysis node, we can find the distribution of the binned values.
A double click on the frequency distribution of “Age_SDBIN” shows
the details in Fig. 3.127. Here, we can see there are seven persons that are
relatively old.
The interested user can try to find out the concrete records of the persons who
have an age above the threshold mean + 3* standard deviation. To sort the records,
the Sort node can be used. This explains the solution to Exercise 1 in this chapter.
Figure 3.128 shows the concrete parameters in the stream discussed here. Further-
more, a Table node will then show the records (see Fig. 3.129).
3.2 Simple Data Examination Tasks 281
Figure 3.130 shows the parameters of the Derive node. Figure 3.131 shows the
final stream.
To define the value labels, we double click the Type node and then double click
the new variable “Income_binned”. Here, we can define the value labels as shown
in Fig. 3.132. Figure 3.133 depicts the frequency distribution the flag variable
“Income_binned”.
3.2 Simple Data Examination Tasks 283
1. Only the mean and the standard deviation can be directly compared, based on
their calculation and unit. Looking at the formula for the standard deviation, we
find that it is being calculated using the mean/average. So we cannot compare
the result with the median.
2. The smallest value for standard deviation is zero. We will get this result if all
the values are the same.
3. The standard deviation is the square root of the variance. Both remain the same
in this example. That’s because the curve of the frequency distribution will
moved to the right, but the shape of the curve is not(!) affected in this scenario.
Therefore, the standard deviation also remains the same.
4. The variance rises, and so the standard deviation rises too. Consider an apart-
ment with a rental price of 3 Euro per square meter. The new price is then 3.30
Euro per square meter. If we look at the rental price of a more expensive
284 3 Univariate Statistics
apartment, e.g., 8 Euro per square meter, it will increase by 0.80 Euro. These
examples show us that the frequency distribution of the variable will has longer
tails. Consequently, the deviation of the values from the mean will be larger
than before. So the standard deviation is higher.
5. The sample with unimodal distribution has the larger variance. That’s because
in this case we can also find many more values at the interval boundaries. That
means that the number of large deviations from the mean is greater. As the
standard deviation measures this dispersion of the values, the standard devia-
tion is much larger in the case of uniform distributed values.
6. For 1,1,9,9 we get the largest standard deviation. See questions 4 and 5 for an
explanation.
7. As a rule of thumb, approximately 88.89–95.45 % of the values can be found in
the so-called. “2s-interval” from 7 to 13. For an explanation see Sect. 3.2.7 and
particularly Table 3.9.
8. Yes.
9. Yes, additional knowledge of standard deviation as a measure of dispersion is
necessary to assess the accuracy of the mean.
3.2 Simple Data Examination Tasks 285
10. Yes, it’s true. The variance measures the deviation of the values from the mean.
Additionally, we can find in the formula of the standard deviation, the power of
two of the difference between each value and the mean. This difference is large
for an outlier. The power of two boosts this effect, and the standard deviation
will definitely be influenced by this outlier.
11. No, it’s wrong.
12. Yes, because all the differences between the values and their average are zero.
13. Yes, the more skewed a distribution, the bigger is the variance and/or standard
deviation. For an explanation see also question 10.
Literature
Aczel, A. D., & Sounderpandian, J. (2009). Complete business statistics (7th ed.). Boston:
McGraw-Hill Higher Education.
Anderson, D. R., Sweeney, D. J., & Williams, T. A. (2014). Essentials of statistics for business and
economics, 7th edn.
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical
Society Series B (Methodological), 26(2), 211–252.
Chapman, C., & Feit, E. M. (2015). R for marketing research and analytics, use R! Berlin:
Springer.
Fox, J., & Weisberg, S. (2015). Package ‘car’. Accessed June 5, 2015, from http://cran.r-project.
org/web/packages/car/car.pdf
IBM. (2014). SPSS modeler 16 source, process, and output nodes. Available at: ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/16.0/en/modeler_nodes_general.pdf
Schulz, L. O., Bennett, P. H., Ravussin, E., Kidd, J. R., Kidd, K. K., Esparza, J., et al. (2006).
Effects of traditional and western environments on prevalence of type 2 diabetes in Pima
Indians in Mexico and the USA. Diabetes Care, 29(8), 1866–1871.
Tukey, J. W. (1957). On the comparative anatomy of transformations. The Annals of Mathematical
Statistics, 28(3), 602–632.
Zimmerman, D. W. (1998). Invalidation of parametric and nonparametric statistical tests by
concurrent violation of two assumptions. The Journal of Experimental Education, 67(1),
55–68.
Multivariate Statistics
4
1. Explain in detail the tools and the process to determine and assess the depen-
dency between variables
2. Explain the difference between a correlation and contingency table
3. Create and analyze correlation matrices and contingency tables and finally
4. Analyze dependencies between variables as well as to explain why the statistical
analysis is not intended to replace the analysis of practical relevant facts
So the successful reader is familiar with the process to analyze two or more
variables regarding their dependencies. Furthermore, she or he understands and can
describe the specific steps to produce reliable results and to exclude spurious
correlations.
4.1 Theory
First of all, we should understand the difference between univariate, bivariate, and
multivariate analysis. The univariate analysis we discussed in the previous chapter
enables us to analyze and characterize the frequency distribution of each variable in
a dataset separately. We therefore come to know a lot of measures, e.g., the mean
and the standard deviation. See Sect. 3.2, Exercise 6 and 13.
Dependencies between two or more variables are also of interest, however.
When considering the unquestionable correlation between the price and the sales
volume of a product, or between the income of an employee and the rental price of
his/her flat, we can also show those relationships in a graph.
For the last mentioned correlation, the process of moving from a univariate to a
bivariate statistic is shown in Fig. 4.1. Combining two frequency distributions, we
get a so-called scatterplot. This gives us the chance to measure the type (positive or
negative) as well as the strength of the dependency and to determine the parameter
of, e.g., a linear regression model.
We suggest the steps shown in Fig. 4.2 as a general approach to finding out if
there is a dependency between variables and for describing the strength of the
relationship. At the end of this analysis we have created a valid multivariate model.
The steps are explained in this section. Figure 4.3 shows the different topics and the
numbers of the sections.
At this point, we wish to add a word of warning. Often we can find applications
on the concept of correlation and all the models that are based on it for different
scenarios, e.g., based on time series. To understand the difficulties we want to
outline here the difference between a cross-sectional and a longitudinal study.
In a cross-sectional study, a researcher analyzes the objects of interest at only
one point in time. That means, for instance, she/he asks the respondents about their
number of cinema visits in the last year or about their constant monthly payback on
the debt for a recently bought new car. These are snapshots of the person’s behavior
(“statistical objects”) taken just at one point in time.
In a longitudinal study, the researcher conducts the observations many times,
over a period of time. At the end she/he gets a series of values—a time series. So
this is obviously a totally different approach. Table 4.1 shows a summary of both
types of studies.
4.1 Theory 289
Now the question is, when can the concept of correlation be applied? The answer
is that in general we can calculate and interpret the correlation coefficient in a cross-
sectional study. Otherwise the results cannot be interpreted very easily. Granger
and Newbold showed in many articles that time series tend to have an “apparently
290 4 Multivariate Statistics
high” correlation. See Granger and Newbold (1974). But despite the probably
moderate or very often strong correlation of two time series, this does not mean
that these variables are “connected”. This is our word of warning. Even in cases of
completely independent series, we can normally find large correlation coefficients.
Hence, the concept of using the correlation alongside other methods, e.g., regres-
sion should be used with hesitation when the variables are represented by a time
series.
4.2 Scatterplot
Related exercises: 1
Theoretical Background
With Data Mining in particular, we have to deal with huge datasets and sample
sizes. Scrolling through the records is obviously not the best way to get an idea of
the characteristics of the data. Therefore, we suggest creating diagrams to get a
4.2 Scatterplot 291
rough overview of the data structure as well as of the outliers. Here, we want to
demonstrate the possible steps for such an analysis with the SPSS Modeler.
As outlined in Fig. 4.1, multivariate analysis means dealing with at least two
variables. So we will start with a simple example. In the dataset “test_scores”, we
can find data from schools that have been using traditional as well as new/experimental
teaching methods. Furthermore, the dataset includes the pretest and the posttest results
of the students as well as some variables to characterize the students themselves. See
also Sect. 10.1.31. The dependencies between the variables should be examined.
Creating a Scatterplot
First of all a simple scatterplot should be used to get an idea if the pretest and the
posttest results of the students could be dependent. To get access to the data, we use
the template stream “Template-Stream test_scores”. See Fig. 4.4.
We select “pretest” and “posttest” on the left side using the Ctrl-Key and the
left mouse-button. On the right side, we scroll down to the simple “Scatterplot”.
We activate this plot by clicking it once. The arrow in Fig. 4.7 shows the actual
status of the dialog window.
6. We click “Run” to close the window and draw the graph. Figure 4.8 shows the
result. In this scatterplot, we can see that there is probably a strong dependency
between the pre- and the posttest.
7. There is also the possibility to create a more valuable scatterplot in the case of
two variables that may be dependent. Using the same dataset, we will show the
usage of the Plot node here. From the section “Graph” of the toolbar, we add a
Plot node and connect it with the Type node. See Fig. 4.9.
8. To define the correct dependency in the graph, we should think about the
possible direction of the dependency between both variables: pre- and posttest.
Here, we know the causality that the posttest depends on the pretest result.
Therefore, the variable posttest should be assigned to the vertical axis and the
pretest to the horizontal axis.
294 4 Multivariate Statistics
Now we double-click at the Plot node and assign the variables as outlined above.
See Fig. 4.10.
The difference from the scatterplot in Fig. 4.8 is that we can see here much more
clearly the concentration of the points in the middle of the plot. Of course, it is
possible to modify the setting of the Graphboard node to get a similar result. What
we would like to show here is that there are two different nodes that can be used to
create scatterplots for two variables. The Graphboard node is the easiest to handle,
but the Plot node shows more details without modifications.
The regression model based on that identified linear relationship between pre-
and posttest can be found in Sect. 5.2.2.
Related exercises: 2, 3
4.3 Scatterplot Matrix 297
Theoretical Background
Here we address the second step of the suggested approach to create valid bi-/
multivariate models. See Fig. 4.12.
When we get a dataset, we usually do not know the dependencies of the
variables. Additionally, datasets usually have more than two variables that we are
interested in. In general, we can use the simple scatterplots outlined in Sect. 4.2, but
to examine a lot of variables we would have to create a lot of scatterplots. Here we
want to show a more efficient procedure.
The idea of a scatterplot matrix (see, e.g., Fig. 4.17) is to create multiple
scatterplots at once. The user is not interested in any detail of the bivariate
distribution. Instead, the aim of such a diagram matrix is to get a “feeling” for the
dependencies and the type of the dependencies.
We use here once more the dataset “test_scores”. In this dataset, we can find
several variables, e.g., pretest, posttest results and teaching method, as well as the
number of students in the class. See also Sect. 10.1.31 for more details. Following
from the idea of the bivariate distribution discussed in Sect. 4.2, we can expect
dependencies between these variables.
The scatterplot matrix is a squared matrix with diagrams in the cells. We will
show how to create and how to interpret this diagram type.
5. We would like to examine the relationship between the variables, which are the
number of students in the classroom and the pretest and the posttest results. We
select all these variables on the left side of the dialog window.
Now on the right side we see that only diagrams where more than two variables
can be visualized are available. In particular, the remaining plot “Scatterplot
matrix (SPLOM)” can be used to examine the dependency between more than
two variables. We select this plot by clicking it. See also the arrow in Fig. 4.16.
6. We click “Run” to close the dialog window. The operation is often time-
consuming, so it takes a while until the plot appears. Figure 4.17 shows the result.
The order of the variables (from left to right: number of students, pretest and
posttest) on the horizontal and the vertical axis is the same. The diagonal represents
4.3 Scatterplot Matrix 299
" A scatterplot matrix is a squared matrix with diagrams in the cells. The
order of the variables on the horizontal and the vertical axis is the
same. Additionally, the scatterplots above and below the diagonal
from the left upper corner to the right bottom corner of the matrix are
identical (symmetrical matrix). The main diagonal of the matrix
consists of the frequency distribution of the variables.
We find the expected pattern within the diagrams of pre- and posttest
relationships, e.g., in the middle of the last row and the last column of the matrix.
This is identical with the scatterplot we created in Sect. 4.2.
300 4 Multivariate Statistics
In addition, we can find the relationship between the number of students in the
classroom and the pretest or the number of students in the classroom and the
posttest. Indeed, it makes sense that the correlation is negative. See the diagram
in the middle of the first column of Fig. 4.17 for pretest vs. number of students and
the diagram in the bottom left corner for posttest vs. number of students.
Fig. 4.17 Scatterplot matrix for number of students vs. pre- vs. posttest results
We found out that it is a good idea to create scatterplots for all variables, so that
we can identify (hidden or unknown) correlations. Of course it is also necessary to
inspect them using mathematical measures, e.g., the correlation coefficient. Never-
theless we strongly suggest creating such diagrams to get a first rough overview of
the data/information. You can find a linear regression model of the relationship
between pre- and posttest in Sect. 5.2.2.
302 4 Multivariate Statistics
4.4 Correlation
Related exercises: 1, 2, 3, 4
Theory
Here we address the second step of our suggested approach to analyze datasets and
to create valid bi-/variate models. See Fig. 4.12.
In addition to a graphical analysis of the data using scatterplots, it is helpful and
necessary to calculate measures for correlations. Usually in statistical programs, a
correlation matrix would be used. As the SPSS Modeler focuses upon data mining
and therefore large datasets with many variables, these matrixes would be hard to
handle because of their size. Probably due to this fact, IBM chooses to calculate the
correlations, but present them in the form of a list. We will show how to create and
interpret these results.
Let us also add some notes related to the statistics. In Sect. 3.1.2, we distin-
guished between nominal, ordinal, and metric variables. Here we should remember
that theory. It is easy to determine the difference between two values measured on a
metric scale (at least interval scaled). The dependency between two metric
variables can be measured by using the co-variance. This measure can be
standardized using the product of the standard deviations of each variable. At the
end of this calculation, we get the so-called “Pearson’s correlation coefficient”.
4.4 Correlation 303
Table 4.2 Value of the correlation coefficient and the strength of the correlation
Absolute value of the correlation
coefficient Interpretation
1 Perfect correlation
0.8 Strong correlation
0.5 Moderate correlation
0.3 Weak or very low correlation, probably no dependency
between the variables
0 No correlation
That is the measure determined by the Modeler. See IBM (2015a, p. 268). It
measures only the linear proportion of the relationship between both (metric)
variables. However, it is a standard measure for metric values and is also often
used when only one of the variables involved is metric. Although, out of necessity
the variable has to at least be ordinal. Ordinal input variables normally ask for
“Spearman’s rho” as an appropriate measurement. See for instance Scherbaum and
Shockley (2015, p. 92). The Pearson’s correlation coefficient, however, is an
approximation based on the assumption that the distance between the scale items
are equal. Details of how to deal with ordinal variables and how to use in particular
the so-called “polychoric correlations”, for ordinal scaled variables, can be found in
Drasgow (2004). The algorithm tries to determine normally distributed continuous
variables (latent) behind the ordinal scaled variables and then to measure the
association between them.
The correlation coefficient can take on values in the intervals 1 to +1. If the
correlation coefficient is zero, there is no (linear) correlation between both
variables. In the case of a correlation coefficient of exactly 1 or +1, all the values
are on a straight line. Table 4.2 shows typical interpretations of the strength of the
correlation coefficient.
The SPSS Modeler calculates the Pearson’s correlation. The following example
demonstrates how to calculate this measure. We use the dataset “test_scores”. In
this dataset we can find several variables, e.g., pretest/posttest results and the
number of students in a class. Following the idea of bivariate dependency we can
expect other dependencies between these variables.
304 4 Multivariate Statistics
5. The correlations between the variables can be calculated in the second part of the
window at the bottom. Firstly, we suggest adding all the variables of interest into
the upper window. To do this we click the button marked with an arrow in
Fig. 4.20. By pressing the keyboards Ctrl-button at the same time, we select the
variables “n_student”, “pretest”, and “posttest”, and we click “OK”. We then get
the dialog window as shown in Fig. 4.21.
6. In the next step, we must define the variables with which the correlations are
calculated. We click the button on the right side of the dialog window as marked
in Fig. 4.21 with an arrow.
7. Then we add the same three variables “n_student”, “pretest”, and “posttest” to
the list and we click “OK”. Figure 4.22 shows the result of this operation.
8. We click “Run” to close the dialog window and to get the results. Figure 4.23
shows two of three correlation coefficients. If we scroll down we can find the
third.
In relation to Table 4.2 the question at hand is why the correlations in Fig. 4.23
are described as strong. There are two options offered in the Statistics node to
306 4 Multivariate Statistics
" The Modeler offers two options in the Statistics node for assessment
of correlations. By default the values are labeled based upon their
inverse significance. The larger the value, the more reliable the deter-
mined correlation. One can decide to label the correlations based on
their absolute value too. We recommend using the default option.
We should scroll through the Modeler results and identify all correlations that
are at least approximately moderate—that means their absolute value is equal or
larger than 0.5. See Table 4.2 for details. Here all the correlations are approximately
moderate, so we should have a look at all of them. The next step is to identify and
exclude spurious correlations.
Table 4.3 Labels assigned by default in the Modelers Statistics node to the correlations
Inverse significance 1 p Label and interpretation
0 up to 0.9 Weak
The correlation between both variables is questionable
Larger than 0.9 up to 0.95 Medium
The correlation between both variables can exist
Larger than 0.95 Strong
Both variables are correlated
Related exercises: 3, 4
In the Sect. 4.4, we discussed how to calculate the correlations and to summarize
the findings in the form of a correlation matrix. The outlined procedure can be used
in all cases. However, calculating the correlation coefficients and arranging them
manually in the form of a matrix is cumbersome; we would therefore like to propose
another, more convenient method.
We would like to improve the correlation calculation stream described in
Sect. 4.4. Therefore we start by using the results of the previous section and the
stream we created there.
" The correlation matrix can be calculated by using a Sim Fit node, but
there are some restrictions, especially if many values are missing.
Furthermore, the approximation of the distributions as determined
can result in misleading correlation coefficients. The Sim Fit node is
therefore a good tool, but the user should verify the results by using
other functionalities, e.g., the Statistics node.
4.5 Correlation Matrix 311
4. Now we double-click the Sim Fit node to open its parameter dialog, as shown in
Fig. 4.27.
The Sim Fit node tries to find a distribution that the values of each variable
represent. If we have a dataset with a huge sample size, we can restrict the
number of values used to estimate the distribution. The option “number of cases
to sample” in the node parameters gives us the chance to avoid long sampling
procedures by setting a different sample size. See the option marked by an arrow
in Fig. 4.27.
Furthermore, we can determine the statistics being used to find the “correct”
type of distribution. Here, the Anderson–Darling and the Kolmogorov–Smirnov
criterions are being offered by the Modeler. Both forms of statistic can be used in
the case of continuous variables. The Anderson–Darling statistic ensures the best
results, however, especially for the tails of the distribution. See also Vose (2008,
p. 292). We use this criterion here and close the dialog window with “OK”.
5. Now we can run the new part of the stream. We click with the right mouse button
on the Sim Fit node and choose “Run”.
6. The Modeler determines the distribution of values that fits best, in terms of the
criterion activated above. A Sim Gen node is added to the stream. See Fig. 4.28.
7. We can inspect the results by double-clicking on the Sim Gen node. Figure 4.29
shows the distributions determined by the Modeler in the fourth column.
4.5 Correlation Matrix 313
The advantage of the Sim Fit node is that it easily calculates the correlation
matrix. Nevertheless, there are some pitfalls because of the approximation of the
frequency distributions, particularly when there are a lot of missing values. Inter-
ested readers are referred to IBM (2014, pp. 275–276). The approximation
criterions of Anderson–Darling and of Kolmogorov–Smirnov are explained by
Vose (2008, pp. 292–295). This author also prefers the Anderson–Darling test
statistic.
Comparing the results of the scatterplot and the correlation matrix, we can
identify couples of variables that are possibly dependent. Unexpected dependencies
will probably also become apparent here.
The methods used are based on values, but correlation does not imply causation.
That’s because we have to ask if the identified pairs of variables could be depen-
dent. In other words: is there any chance for a logical dependency between the
variables? Or in this example: is there any chance that the number of students in the
classroom causes better test results? For all the correlations that are identified and
summarized as at least moderate in Table 4.4, the answer is yes. They are not
spurious. We do not have to exclude any dependency here.
We will show in the following exercises that this is very often not the case,
however. We will find examples for spurious, or at least questionable, correlations.
An example of such a dependency is the possible correlation between the number of
people with cancer in a region and the presence of a nuclear power plant and the
number of technical failures at that plant.
Related exercises: 5
316 4 Multivariate Statistics
Theory
In Sect. 4.4, we discussed how to calculate Pearson’s correlation coefficient. This
measure gives us the chance to determine the strength of a linear relationship
between two variables, if they are either continuous or at least ordinal (Fig. 4.31).
In the case of discrete values, and especially nominal or ordinal variables, we
can count the values and calculate their pairwise frequency. Alternatively, we can
bin metric values as described in Sect. 3.2.7. At the end of the process we get a
contingency table. To determine if both variables are independent, the Chi-square
test can be used.
– The test cannot be used to determine the direction and the strength
of the relationship between the variables.
The test as described in the box tells us “only” if both variables are independent.
If we reject the null hypothesis, we can say that the data do not let us expect that
there is a relationship. Normally we cannot conclude that both variables are
dependent—or in other words—there is a contingency.
We cannot use the chi-square test of independence to determine the direction of
the contingency in every case, however, and additionally we cannot measure the
strength of the contingency. The Modeler also has limitations in this regard. If we
would like to determine Cramer’s V or the Phi coefficient as a measure of the
strength of contingency, we can for example use R to calculate them.
As we can see, 1087 schools are public and use standard teaching methods.
Later we will explain a better method to assess the dependency. For now we
want to look at the test results for the chi-square test of independence.
320 4 Multivariate Statistics
Fig. 4.36 Absolute frequencies and the Chi-square test results in a Matrix node
7. At the bottom of the dialog window in Fig. 4.36, we can find the results of this
test. Let’s focus here on the interpretation of the probability.
(i) The null hypothesis of the test is that there is no dependency between the
school type and the teaching methods.
(ii) We must compare the determined probability of 0 % with a 5 % probabil-
ity value.
(iii) As the determined probability of 0 % is smaller than 5 % we can reject the
hypothesis. In other words, we make a mistake with 0 % rejecting this
hypothesis. At a confidence level of 95 %, we can therefore reject this
hypothesis.
So the data do not tell us that there is a reason to consider a dependency
between both variables. Figure 4.37 once more summarizes the procedure
for using the Chi-square test of independence. The option to determine the
expected frequency is shown in Fig. 4.39.
8. To have the chance to look at the data from a different angle, we want to add
here another Matrix node to the stream. The variable settings are the same as
those shown in Fig. 4.35. Figure 4.38 shows the final structure of the stream.
9. In the Matrix node at the bottom, we make the following changes to the
parameter as shown in Fig. 4.39 we activate the tab “Appearance”. Here, we
choose the options “Percentage of column” and “Include row and column
totals”. After that we can run the Matrix node.
4.7 Contingency Tables 321
10. Of course the test result from the Chi-square test is the same in Fig. 4.40 as it is
in Fig. 4.36, but we can now find the relative frequencies of the teaching
method dependent on the school type in each column. The process is also
explained in Fig. 4.41.
If the teaching methods were equally distributed over both school types, we
should find in each column approximately the same percentages in the same
order. As we can easily see here this is not the case. So we can guess, just from
looking at this type of contingency table with relative frequencies, that there is
no dependency between the teaching method and the school type. The
Chi-square tests give us statistical justification for that conclusion.
4.8 Exercises
1. Open the stream “Pearson’s Correlation EXERCISE”. Two variables x and y are
included in the dataset “pearson_correlation.sav”. Open the stream and make
sure you have access to the dataset. Use the Table node to check the connection.
Additionally examine the values. Both variables x and y are probably “depen-
dent”. This should be examined here step-by-step.
2. Following the procedure outlined in Fig. 4.2 in Sect. 4.1, one diagram should be
created, of the frequency distribution of x and y separately. Please use the
appropriate node and create these diagrams now. Describe your findings.
3. The next step in Fig. 4.2 is to create a diagram with bivariate distribution,
referred to as a scatterplot. Add a node to the stream and create the diagram.
Try to identify if there is a dependency between x and y and describe it.
4. Now in the last step, the Pearson’s Correlation Coefficient must be calculated.
Interpret your findings. Outline how the consequences of this analysis show the
necessity of the different steps suggested and summarized in Fig. 4.2.
1. Plan the steps of the analysis using the outlined process in Sect. 4.1.
2. Create a stream, import and examine the data. Outline their meaning.
N.B. Alternatively the template stream “Template-Stream IT_Projects.str”
can be used.
3. Based on the meaning of the variables try to find an important relationship that
can be used to determine the time for development. Then try to identify a
relationship to determine the number of person months.
4. Examine the relationship between the three variables. Create a correlation
matrix. Outline your findings.
4.9 Solutions
Figure 4.42 shows the final stream. It is the basis for answering the questions in this
exercise. We will outline the findings below.
1. Figure 4.43 shows the values of both variables. As a mathematician, we can probably
see the formula to calculate the y-values. Obviously it is y ¼ x2. So there is a strong
connection that can be expressed by a formula. Both variables are dependent.
2. In Fig. 4.44 we can see the histogram of variable y.
3. Using a Graphboard node we get the diagram in Fig. 4.45. We can also find a
strong relationship between x and y.
4. Using the Statistics node we can calculate the Pearson’s Correlation. Figure 4.46
shows the settings of the Statistics node parameter to calculate the correlation
and Fig. 4.47 shows the result. The correlation is zero. To interpret the result
correctly, we have to remember the definition of the coefficient that we discussed
in Sect. 4.4. It measures only the linear proportion of the relationship between
both variables. In this example, we have a very strong quadratic dependency
between x and y.
In order to use the correlation coefficient, it is obviously necessary to add a
graphical analysis into the research process. Hence, we have suggested the
combination of mathematical and graphical analysis in our approach. Using
this procedure, we make sure that we reveal “hidden” dependencies between
variables.
Fig. 4.46 Statistics node setting for Pearson’s Correlation between x and y
1. Figure 4.48 shows the stream to find all the solutions. On the left side we can find
the Excel File node that gives access to the data.
2. To inspect the different values of the variable “firm”, we can use the Table node
below the Excel File node in Fig. 4.48. Figure 4.49 shows the results. The
variable “firm” in the first column can have two values “Intel” and “AMD”.
We can also use the Type node. Figure 4.50 shows the result.
3. To create a scatterplot for “CB” vs. “price” we can use the scatterplot as outlined
in Exercise 2 in Sect. 5.2.6. Here we use an functionality of the Graphboard node
to determine the different colors of the symbols.
As a first step, in the dialog window of the Graphboard node we select the
variables “CB” and “EUR”, as well as scatterplot for the diagram type. See
Fig. 4.51. Additionally, we have to click on the “Detailed” tab, to determine the
different colors. Figure 4.52 shows the parameters. In the upper right corner of
the dialog window we have to select the variable “firm” to determine the colors
of the circles within the graph.
Fig. 4.49 Table node results, e.g., for the variable “firm”
4.9 Solutions 331
Fig. 4.50 Type node and the different values for “firm”
same procedure and the second statistics node behind a Select node in Fig. 4.48,
we can find a correlation 0.986 for AMD processors.
5. As expected because of the outlier for the Intel processors, the correlation for
AMD is stronger. Finally, we can exclude the outlier and determine the values
once more. This is part of the exercise 2 in Sect. 5.2.6 where we use the
dependency in regression models.
1. Figure 4.2 provides the steps of an analytical process. We will follow these steps
here also. First we examine the data by using charts. Then we try to measure the
strength of the correlation.
2. To import the data that are given in the form of a simple text file we use the
Variable File node. See also Exercise 5 in Sect. 3.2 for details.
To show the variables and their values we add a Table node. After that we
typically use a Type node to determine the correct scale type of the variables.
Figure 4.58 shows the final stream. Figure 4.59 shows the variables and some
values.
334 4 Multivariate Statistics
Fig. 4.55 Result of the selection procedure shown with the Table node
Fig. 4.56 Parameter of the Statistics node to determine the correlation coefficient
336 4 Multivariate Statistics
process. As a result, we get the number of days the programmers and the project
manager will need to write the code. Based on this data, we can estimate
the delivery date. We can summarize the dependencies as follows:
KDSI ! PM ! TDEV.
4. Analysis of the dependencies between the variables:
To get a first impression of the data, we can use a Data Audit node. Figure 4.60
shows the univariate analysis of the variables. Here, we can also find some
measures of central tendency, volatility, and skewness. The longer right tails of
4.9 Solutions 339
the distributions of PM and KDSI and their maximum are particularly interest-
ing. We can see that there are some outliers that we have to deal with. For details,
see also the regression models in Exercise 3 of Sect. 5.2.6.
5. We create a scatterplot with the Graphboard node. See Fig. 4.61. In the middle of
the last row, we can definitely see an almost linear relationship between KDSI
and PM. Additionally, we have concerns about the linearity of the relationship
between PM and TDEV (first diagram in row 2). As we will see in Exercise 3 of
Sect. 5.2.6, a polynomic relationship describes that dependency best.
That’s because the chance to write parallel code depends on the complexity of
the software. The more complex the program, the more modules can be written
at the same time. Then the number of months to develop the software decreases.
Later we will use a logarithmic transformation of TDEV to create a valuable
model. Alongside these findings, we now can only measure the linear proportion
of the correlation. For details of the correlation coefficient see Sect. 4.4 and
Exercise 1. Figure 4.62 shows the result of the statistics node. Table 4.8 shows
the rearranged correlations in the form of a correlation matrix.
340 4 Multivariate Statistics
Fig. 4.64 Added Sim Fit node to stream “Correlation IT-project variables”
342 4 Multivariate Statistics
Fig. 4.65 Generated Sim Gen node in stream “correlation IT-project variables”
Figure 4.67 shows the final structure of the stream. We added a Matrix nodes to
show the results of the Chi-square tests of independence in detail. Table 4.9
explains the results.
Literature
Drasgow, F. (2004). Polychoric and polyserial correlations. In S. Kotz & N. Johnson (Eds.),
Encyclopedia of statistical sciences. New York: John Wiley & Sons, Inc.
Granger, C., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of Economet-
rics, 2(2), 111–120.
IBM. (2014). SPSS modeler 16 source, process, and output nodes. Accessed September 18, 2015,
from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/mod
eler_nodes_general.pdf
IBM. (2015a). SPSS modeler 17 modeling nodes. Accessed September 18, 2015, from ftp://public.
dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/
ModelerModelingNodes.pdf
IBM. (2015b). SPSS modeler 17 source, process, and output nodes. Accessed March 19, 2015,
from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/
ModelerSPOnodes.pdf
Scherbaum, C., & Shockley, K. M. (2015). Analysing quantitative data: For business and
management students, Mastering business research methods. London: Sage.
Vose, D. (2008). Risk analysis: A quantitative guide (3rd ed.). Chichester: Wiley.
Regression Models
5
In the previous chapters, we introduced the SPSS Modeler and described how to use
it for basic data operations, descriptive analytics, and visualization of data. In the
remaining chapters, we will look at explorative and inductive models, as well as
data mining methods that allow us to identify hidden structures in the data. We start
in this chapter by introducing the first and one of most popular classes of models in
data mining, the regression model. The main purpose of this model is to determine
the relationship between a target variable and some input variables. Regression
models are very intuitive, easy to handle, and to interpret. For these reasons, they
are popular and widely used in many different areas, e.g., medicine, finance,
physics, and web analytics, to mention a few.
Building regression models with the SPSS Modeler is quite simple. This chapter
is aimed to the introduction of the different regression types and how these models
are built with the SPSS Modeler. So after finishing this chapter, the reader . . .
1. Knows the differences of the various regression model types and is able to pick a
proper model type for his/her current problem.
2. Is able to build a regression model with the SPSS Modeler for different problems
and is able to apply it to new data for prediction.
3. Can evaluate the quality of the trained regression model and interpret its
outcome.
Before going into detail, we wish to outline some motivating examples, to give
an impression of the type of problems and data that suit the application of a
regression model.
Example 1
Determining the linear relationship between two variables (Gravity constant of the
earth).
Let’s think back to physics class in school. There, we learned that the gravitation
of the earth accelerates the speed of every falling object by a constant g of about
g ¼ 9:801 m=s2
near the equator. This means that a falling object freely increases its velocity by
9.801 m/s2 per second it falls. School was a long time ago for most of us; however,
and so now, we check this basic constant again, with the help of regression analysis.
According to physics, there is a linear relationship, h ¼ g S, where h is the
height from which an object is falling, and S is the squared time in seconds that the
object needs to reach the ground. Figure 5.1 shows the data of such an experiment,
and Fig. 5.2 shows the corresponding scatterplot (see Sect. 4.2 for a description of
how to perform a scatterplot). The meters are plotted at the abscissa and the
measured seconds to the power of two are plotted at the ordinate.
Despite some noise and measurement errors, we see that the data points lie in a
straight line, and thus, the presumption of a linear relationship is justifiable.
Estimation of the slope of this line can be easily done with a linear regression
model, and the result should be close to the real constant.
In this example, there is only a single input variable, and so these kinds of
models are called univariate or simple; but regression models can also be used to
estimate more complex linear variable connections. In these cases, the models are
called multiple linear regression models.
Example 2
Determination of variable relations and prediction of variables (Analysis of pretest
and final exam results).
Students often prepare for exams by taking tests in advance. This gives students
the opportunity to become familiar with the type and complexity of questions
asked, as well as a chance to check their degree of readiness for the final exam.
To find out if pretesting is helpful, we can inspect the relationship between the
performances in both exams.
As well as measuring with an appropriate correlation coefficient, we can model
the exact relationship using linear regression. In addition, we are interested in a
prediction of the final exam scores from future students, based on their pretest
scores. This can be done by applying the build regression model on this new dataset
of pretest results.
IBM provides a dataset called “test_scores.sav” (see Sect. 10.1.31), which
comprises data for the kind of analysis described here. In Sect. 5.2.2, a simple
linear regression modeling process will be discussed in detail, based on this
example.
Example 3
Determining the relationship between multiple variables and building a model for
variable prediction (Prediction of house prices).
We can also use regression analysis to model cases where the target variable
depends on more than just one input variable. These models are called multiple
350 5 Regression Models
• Industrial processes. The goal is to predict the quality of the produced product or
the waste/defective rate in an industrial process based on some process
parameters, such as temperature, acid concentration, and so on.
• Medical research. In a medical study, various blood pressure lowering drugs are
tested on patients with high blood pressure. With regression, the effectiveness of
each drug can be quantified.
• In econometrics, regression analysis is one of the major tools for estimating
important relationships, for example Okun’s law. Okun’s law describes the
relationship between the unemployment rate and losses in a country’s production
rate (Abel and Bernanke 2008).
• Demographic assessment. For many institutions, such as the government, moni-
toring of population size is important, in order to modify policy with a view to
influencing the birth rate.
Following data exploration and preparation, the actual model building process can
begin. When building a model using data mining, the model will only apply to the
5.1 Introduction to Regression Models 351
specific data being used in it. In particular this means that the model only knows this
dataset. A general assumption is that the build model fits unknown data, but often
this is not the case; often the model is unable to predict the values of new data
records. This phenomenon is known as overfitting (see section below). To validate
the model and prevent overfitting, the build model is typically applied to unknown
data to verify its universality. This process is called cross-validation. In particular,
this method is used to evaluate classification models, such as Neural Networks or
Support Vector Machines (see Chap. 8). We refer to James et al. (2013) for further
information on the concept and variants of cross-validation.
The key idea behind cross-validation and creating a correct modeling process is
to split data into a training dataset and a test dataset, typically in the proportion
70–30 %. Whereas the training dataset is used to estimate the model parameters and
build the model, the smaller test dataset is used for testing the exactitude of these
calculated parameters. This gives us the chance to measure a model’s ability to
generalize.
There are some important aspects to keep in mind when using a partitioning
method to separate the different subsets:
1. The values in both the training and the test set should have the characteristics of
the original sample. Sampling bias has to be avoided. Therefore, a random
selection process should be used to separate these subsets. For sampling
methods, see Sect. 2.7.8.
2. There is a trade-off between the exactness of the training results and the test
results. When the data are separated into the training and the test sets, the number
of records that are used for training the model is reduced.
3. The records of the test dataset cannot be used in the training procedure. The
model has to be processed using only the training dataset.
Often, the precise parameters of the model are unknown and have to be found
during the modeling process. In this case, the dataset is split into three parts: the
training, testing, and validation sets. After fitting the model for each of the
parameter selections using the training data, the validation dataset is then used to
find the model that will most precisely predict the target value of unknown data.
This so-called “best” model has to then be tested again with the testing dataset, to
352 5 Regression Models
verify that it is actually suitable for use with independent data. Here, another dataset
must be utilized, since the validation set is now biased, having already been used to
find the optimal parameters. A typical partitioning of the dataset is training data
60 %, validation data 20 %, and test data 20 %.
Figure 5.4 visualizes the modeling and cross-validation process. A dataset can be
split into different parts with the Partition node, see Sect. 2.7.7. In Sect. 5.4.4, we
will use this partitioning method by fitting a polynomial regression.
As mentioned above, cross-validation is also a very convenient way of
preventing overfitting and is therefore often used in the modeling process for data
mining tasks. For these reasons, we recommend always using a cross-validation
setting when building a prediction model.
Overfitting
In statistics, overfitting occurs when a statistical model describes errors or noise
instead of the underlying relationship. In other words, the model describes the
training data very well, but independent and unknown data cannot be modeled. This
results in poor prediction performance. Overfitting typically occurs when a model is
exorbitantly complex or has too many parameters. In Fig. 5.5 overfitting in regres-
sion is illustrated. Of course, there always exists a polynomic function on which all
data points are located. This is demonstrated in the left graphic. Although this is
very precise for the present data, this doesn’t represent the actual structure of the
data since measurement noise is modeled as well. So a linear relation describes the
data structure better and is therefore a more suitable model (see the right graphic in
Fig. 5.5). We refer to Kuhn und Johnson (2013) and James et al. (2013) for more
information on overfitting.
5.2 Simple Linear Regression 353
Fig. 5.5 Overfitting in regression. A complex regression function can tend to overfitting and fails
to model the data structure
We start by introducing the (simple) linear regression model (SLR), which is the
easiest one in this class of regression models, since it comprises only two variables
and the linear connection between them. We need to mention that a linear regres-
sion model is only suitable if a linear relationship between the two variables exists.
To ensure this is the case, we suggest performing a correlation analysis first, to
verify if regression is suitable. This correlation procedure is described in Sect. 4.4.
5.2.1 Theory
(see, e.g., Figs. 5.2 and 5.6). This is due to noise or measurement errors within the
experiments.
SLR can now be used to quantify and estimate the linear dependency between
these two variables. In other words, the goal is to find a straight line
^y ¼ β0 þ β1 x
that best describes the data. For that, the unknown coefficients β0 and β1 have to be
estimated in an “optimal way”. The found estimators and the (linear) relationship
between these variables is called the (linear) regression model. This model can then
be used to predict the value of the target variable ynþ1 for a new input xnþ1 .
To perform a linear regression, there are a couple of necessary requirements,
which should be checked before trusting the results of the regression.
Necessary Conditions
1. The random noise for each sample has zero mean and the same variance
(homoscedasticity). This means that the noise does not contain any additional
information, but is totally random.
2. The errors are pairwise uncorrelated. This guarantees no systematic effect within
the errors. Instead, the errors are purely random.
3. The number of samples and data points is large enough.
4. Another useful assumption is the Gaussian distribution of noise. In this case,
measures for the goodness of fit for the models can be calculated. This gives us
the opportunity to find out if the linear model fits the data well.
than for line B. In other words, the error made by predicting the measured values
with line A is much smaller than with line B. This is illustrated in Fig. 5.7.
The squared sum of the distances gives us a measure of the total distance from
each line to the observations. The smaller this value, the better the data are
approximated by the line.
The task of linear regression is to now find the parameters βi of the straight line
that minimize the total distance between the data points, i.e.,
X
n
ðyi ðβ0 þ β1 xÞÞ2 ! min:
i¼1
The standard method for this task is the method of least squares. A detailed
description of this method can be found in Fahrmeir (2013). The line with estimated
coefficients is called the regression line, and is line A in Fig. 5.6. The distances
from the observations to the values predicted by the regression line are called
residuals.
Coefficient of Determination
The Coefficient of determination, denoted by R2, is a measure that indicates how
well the data fits the estimated regression model. It describes the proportion of the
total variation that can be explained by the regression model. R2 is always a number
between 0 and 1, and the higher the R2, the better the model represents the data. If
R2 ¼ 1, then all points lie exactly in a straight line with no variance. In this case, the
correlation coefficient is 1 or 1. In fact, R2 equals the squared (Pearson’s) correla-
tion coefficient [see Fahrmeir (2013) for more information on the Coefficient of
determination].
356 5 Regression Models
There are two different nodes that can be used to build a stream for a simple linear
regression (SLR) in the SPSS Modeler. They are the Linear node and the Regres-
sion node, each of which has advantages and disadvantages. For the SLR, the
choice of node does not essentially matter. Here we will present how to build a
stream for SLR with the Linear node. The Regression node is described later in
Exercise 4, in Sect. 5.3.7, where we invite the reader to get to know this node in a
self-study, while building a stream for a multiple linear regression.
We will set up a stream for a simple linear regression with the Linear node based
on the dataset “test_scores.sav” (see Sect. 10.1.31). The question we are trying to
answer is whether a good pre-exam result is an indicator for a good final-exam score
(see Example 2).
1. To build the stream, we first open a new empty stream, add a Source node, and
import the data file. Here, we will use the Statistics File node, as the imported file
is in SAV format (see Sect. 2.1 how to import data files).
2. Now, we add a Type node and connect the data file node to it. If necessary, the
types of the relevant variables have to be modified. Furthermore, we can define
the role of the variables, e.g., the target and input variable (see Fig. 5.8). For an
SLR, there will be one target variable and one input variable. The roles can be
declared later in the model node if preferred.
" A linear regression model is only suitable for data that have a linear
relationship. Thus, before proceeding to the actual regression analy-
sis, we recommend verifying the linear assumption with the following
additional statistical calculations and plots:
3. Next, we add the Linear node to the stream and connect it with the Type node.
Figure 5.9 shows how the stream should look.
358 5 Regression Models
Fig. 5.10 Defining the input and target variables for the regression model
4. Next, we have to set the parameters of the regression model and so we open the
Linear node with a double-click.
5. In the Fields tab, we declare the target and input variable; in this case, there are
exactly one of each in the SLR. This can be done by simply dragging and
dropping. If the target and input variables were already defined, as described
in Step 2, the SPSS Modeler will identify these variables automatically (see
Fig. 5.10).
5.2 Simple Linear Regression 359
6. In the Build Options tab, we can choose between “Building a new model” and
“Continue training existing model”. The former option creates a new model
based on the imported data, whereas the latter one updates an existing model. For
further information, see IBM (2015b). Here, we will use the first option, since we
are starting from scratch and want to build a new model.
In the “Basics” section, we can choose whether or not the data should be
automatically preprocessed and adjusted. This procedure is selected by default,
and we recommend leaving it enabled to improve the prediction power. We can
also define the confidence level for the estimated model parameters. The stan-
dard value is a 95 % confidence level (see Fig. 5.11).
7. All other tabs and options in this node are irrelevant for the SLR. They are
explained in the multiple regression, Sect. 5.3.3.
8. We run the stream to build the model. The estimated model is displayed in a new
node that now appears, called the model nugget.
At the bottom right of Fig. 5.12, the model nugget containing the estimated
model parameter is displayed. The two nodes at the bottom left are the
recommended pre-analyses. They calculate the correlation coefficient and
draw a scatterplot of the data.
360 5 Regression Models
9. To get the estimated values of the regression model, we can add an output node,
such as the Table node (see Fig. 5.13).
The model, built in the previous section is represented in the SPSS Modeler by the
model nugget (see Fig. 5.12).
The details of the model can be inspected by double-clicking on the model
nugget. Here, we will find the estimated coefficients, goodness of fit statistics, and
other useful information shown in the Model tab. Let’s take a closer look at the most
important ones. Firstly, we will show how to find the estimated coefficients of the
model and identify an equation for predicting the unknown values of the target
variable using these parameters.
In the graphic “Coefficients” (see Fig. 5.14), the SPSS Modeler provides the
information on the estimated coefficients for the regression model. In this simple
5.2 Simple Linear Regression 361
Fig. 5.14 Visualization of the estimated coefficients of the linear regression model
regression model, there are only two coefficients, the intercept β0 and the coeffi-
cient β1 for the input variable, in this case, the pretest results.
If the mouse curser is moved onto the lines of a parameter, a new window pops
up showing the estimated coefficients. The color of the line indicates the sign of the
coefficient. In our example, both coefficients are positive, especially the pretest
results coefficient. This means that a high pretest score will predict a high final
exam score, and a low pretest score will predict a low final exam score.
We can get a more detailed overview of the coefficients (see Fig. 5.15), by
changing the style at the bottom of the dialog window shown in Fig. 5.14 to
“Table”. In the example of the pretest and final exam scores, the estimated regres-
sion equation is
^y ¼ 13:212 þ 0:981 x:
With this equation, we are now able to predict the outcome of the final exam if we
know the student’s pretest result.
362 5 Regression Models
Once we have found the equation of the model, we can predict the final score of an
exam participant based on his/her pretest result, but we should first ensure the
quality of the model, its goodness of fit. We will now discuss several parameters that
quantify the model.
Predicted by Observed
In the scatterplot “Predicted by Observed” (see Fig. 5.17), the values of the target
variable are plotted against the values predicted by the model. The graphic does not
show every single point, but instead shows colored circles that represent a set of
Worse Better
90.4%
100.000
Predicted Value
80.000
60.000
40.000
Post-test
points. The number of data points located in the area of the circle is indicated by the
intensity of the color.
If the model fits the data well, the circles should be arranged in a straight line
along the diagonal. Any other arrangement would indicate that the data do not have
a linear relationship.
Residuals
One of the requirements for linear regression is a Gaussian distribution of the error.
This is verified in the next graphic, which shows the distribution of normalized
residuals in a histogram and the density of a Gaussian distribution (see Fig. 5.18). If
the histogram is approximately the shape of the Gaussian density, the residuals can
be said to follow Gaussian law. This is another indicator of a good fit with the
model; it reflects the assumed distribution of the error. If the residuals are not
Gaussian distributed, then the model is inappropriate for the data, or the data does
not follow a linear structure.
Using the drop down list in Fig. 5.18, that is marked with an arrow, it is possible
to switch to another graphic, the so-called P–P Plot (see Fig. 5.19). This is another
plot used to verify the Gaussian distribution assumption. If the dots are located
around the diagonal, the residuals are distributed normally, if they are not then the
residuals follow a different law. For more detailed information on the P–P Plot, see
Thode (2002).
364 5 Regression Models
Fig. 5.20 Expanded table view of the coefficients with the significance level and additional
parameter
Fig. 5.21 Visualization of the test score data with the regression line
The estimated coefficient for the “pretest” variable in Fig. 5.20 is 0.981, which is
the slope of the regression line. This states that improvement in the final score does
not depend on pretest results. Moreover, each student improves their score by
roughly the same constant: the intercept 13.212. Figure 5.21 shows the data with
the estimated regression line in the SPSS Modeler.
We build a model that fits the test score data well and answers the question of the
motivating Example 2. Now, the determined regression equation can be used to
predict final exam scores from the pretest results of other students, e.g., in the next
semester, the final results can be estimated after the students have written the
pretest.
366 5 Regression Models
1. Copy the regression model nugget and paste it onto the stream canvas.
2. Then, select an appropriate Source node to import the data. Since in this case the
data are in a standard text file, we choose the Var. File node. Figure 5.22 shows
the table of the input values. The variable names have to be exactly the same as
those in the model, as otherwise, the stream does not work and running it will
generate an error. If the variable names in the new dataset do not correspond to
the names in the model, they can be modified later with an additional Filter node
(see Sect. 2.7.5).
3. Connect the data file node to the model nugget.
4. Now, add a Table node to the stream and connect the model nugget to that. The
stream should look like the one in Fig. 5.23.
5. Run the stream. The output for this example is displayed in Fig. 5.24. A new
column containing the predicted values is added, called $L-posttest. This is the
same value that can also be calculated with the regression equation.
5.2 Simple Linear Regression 367
Fig. 5.24 Output of the prediction. The column $L-posttest contains the predicted values
5.2.6 Exercises
5.2.7 Solutions
1. The data are given in a csv table, so we use the Var. File node to import the data.
2. To calculate the correlation coefficient, we use the Statistics node, and for the
scatterplot, the easiest way is to use the Plot node. The correlation coefficient is
0.995 (see Fig. 5.26).
The data from the gravity example were already plotted at the beginning of
this chapter (see Fig. 5.2). Both the correlation coefficient and the scatterplot of
the data indicate a linear relationship. Hence, a linear regression is adequate for
this problem.
370 5 Regression Models
Fig. 5.27 Variable selection of the linear regression of the gravity example
Fig. 5.28 Coefficients of the linear regression model for the gravity data
3. We use the Linear node to build a regression model with the target variable
“seconds squared” and predictor “height”. The variable selection is shown in
Fig. 5.27.
4. After running the stream, the estimated coefficients can be looked up in the
Coefficients view of the model nugget (see Fig. 5.28).
Here the estimated coefficients are β0 ¼ 0:680 and β1 ¼ 0:103. Hence, the
regression equation is
^y ¼ 0:680 þ 0:103 x:
Fig. 5.29 The coefficients of determination of the linear regression model for the gravity data
Fig. 5.30 P–P Plot of the residuals of the linear regression model for the gravity data
You may recall from the motivating Example 1 at the beginning of this chapter
that the gravity constant is given by the equation h ¼ g S, where h is the height
from which an object is falling, and S is the squared time in seconds that it needs
to hit the ground. Thus, we get the gravity constant from our regression equation
by ignoring the intercept β0 and inverting the slope β1, hence,
1 1
g¼ ¼ ¼ 9:709;
β0 0:103
which is close to the real gravity constant (see motivating Example 1).
6. The coefficient β1 (the slope) is significant in this model (see Fig. 5.28). The
significance level of the intercept is very high, however, at 0.499, but it is not
very important for the regression model and so can be ignored. The model still
represents the data adequately.
Since we have hardly any data points, the distribution of the residuals cannot
be very accurately calculated. The P–P Plot of the residuals indicates a Gaussian
distribution, however (see Fig. 5.30).
7. To predict the time an object needs, falling from 275 m, until it reaches the
ground, we build a prediction stream by copying the model nugget and adding a
User Input node, where we define the new predictor value as 275 (see Fig. 5.31).
We connect these two nodes and add a Table node for the output. The stream
should look like Fig. 5.32.
The predicted value of the time an object needs to reach the ground from a
height of 275 m is 27.592 square seconds, hence, 5.253 s (see Fig. 5.33).
5.2 Simple Linear Regression 373
Fig. 5.31 Input of the new height for the prediction of its falling time
Fig. 5.33 Predicted squared seconds of an object falling from a height of 275 m
374 5 Regression Models
The complete stream for this exercise looks like Fig. 5.34.
1. First, we import data from the xlsx File (see Exercise 2 in Sect. 2.5 for details).
To visualize the data, we use the Graphboard node and use the scatterplot
graphic. To distinguish the two companies by color and shape, we select the
variable “firm” for both (See Figs. 5.35 and 5.36 for the parameters and output of
the scatterplot).
2. The point to the top right of Fig. 5.36 is located far away from the other data.
Thus, we omit this outlier from further analysis. This is done with the Select
node, and the formula given is in Fig. 5.37. The scatterplot now looks like
Fig. 5.38, with the points located around an imaginary straight line.
After these preprocessing steps, the stream should look like Fig. 5.39.
5.2 Simple Linear Regression 375
3. To perform a regression for each company individually, we use the Select node
to split the dataset into two disjointed sets (See Fig. 5.40 for selection of the
AMD processor data). We proceed in the same way with the Intel data.
For each processor dataset, we add a Linear node to perform a linear regres-
sion. Select “CB” for the target and “EUR” for the predictor variable, as shown
in Fig. 5.41.
Now, run the stream. Afterwards, it should look like Fig. 5.42.
To visualize the data and the regression line, we add two Plot nodes and
connect them to the two model nuggets. As an overlay, we insert the regression
line formula in the function field (See Fig. 5.43 for a definition of the plot
parameter and Fig. 5.44 for the actual plot of the AMD data). The values of the
coefficients can be found in the coefficients view in the model nugget.
376 5 Regression Models
Fig. 5.36 Scatterplot of the benchmark data. The companies are displayed with different colors
and shapes
Fig. 5.37 Removal of the outlier from the top right corner
5.2 Simple Linear Regression 377
4. The estimated slopes of the two regression lines are 30.559 for the AMD
processors and 28.707 for the Intel processors. Both coefficients are significant.
Thus, we conclude that the AMD processors have a better cost-performance
ratio, based on this particular benchmark test, since performance levels grow
faster as the cost of a processor increases. To visualize this result, we need to
378 5 Regression Models
combine the estimated data from both regression models. This can be done with
the Append node. We add this node to the stream and connect it with both model
nuggets. Afterwards, we add another Plot node and connect it with the Append
node. We select “EUR” as the X and “L-CB” as the Y variable. Furthermore, we
select “firm” as the color overlay, to plot the regression lines in a different color
(see Fig. 5.45). In the Options tab, we can further choose between a line or dot
plot.
5.2 Simple Linear Regression 379
Fig. 5.44 Visualization of the AMD data with the regression line
Fig. 5.45 Parameter selection of the comparison plot of the regression lines
5.2 Simple Linear Regression 381
The plot should look like Fig. 5.46, which shows that the cost-performance
ratio of the AMD processors is better than the ratio for Intel.
The final stream for this exercise looks like Fig. 5.47.
1. First, we import the data from “IT-projects.txt” (see Sect. 10.1.19), and we
define the types of the variable.
2. We have already examined the relationships between the three variables in
Exercise 3 in Sect. 4.8, using diverse scatterplots and correlation coefficients.
Refer to that for the first part of the stream. The Pearson’s correlation coefficient
is in all cases quite high, and thus, the correlation is strong (see Fig. 5.48). This
indicates linearity in all cases.
382 5 Regression Models
Fig. 5.48 Pearson’s correlation of the three variables KDSI, TDEV, and PM
5.2 Simple Linear Regression 383
It should be noted however, that the scatterplot of the variables (see Fig. 5.49)
shows a linear relationship between KDSI and PM, but a more polynomic
relationship between KDSI and TDEV and/or PM and TDEV. Thus, we assume
that the variables PM and TDEV follow a polynomial relationship, i.e.,
TDEV ¼ b*PMm ;
with b and m being positive real numbers. This is also the assumption of the
COCOMO model. To build a suitable linear regression model, we transform this
equation into a linear one through logarithmic calculus, i.e.,
logðTDEVÞ ¼ logðbÞ þ m*logðPMÞ:
Now, we can estimate log(b) and hence b and m with a linear regression.
3. Before building the actual model, we need to perform logarithmic calculus on
the variables PM and TDEV. This is done with the Derive node. We add this
node to the stream after the Type node and select the Multiple mode (see
Fig. 5.50). Also select the two variables, TDEV and PM, which we want to
transform, choose a suitable Field name extension and insert the formula log
384 5 Regression Models
Fig. 5.50 Setup of the Derive node to calculate the logarithm of the variables PM and TDEV
(@FIELD) into the corresponding field. The Derive node now takes each of the
two variables, calculates its logarithm, and adds a new variable to the data with
the extension we selected, in this case, “_trans” (see Fig. 5.51).
These new variables have a linear dependency, as evidenced by a scatterplot
(see Fig. 5.52).
We also have the option of omitting the one outlier that is located above the
line, to improve the regression. We can remove the outlier with the Select node
and the formula shown in Fig. 5.53. This increases the correlation coefficient to
almost one (see Fig. 5.54), which confirms linearity.
5.2 Simple Linear Regression 385
Fig. 5.51 Data with the new logarithmic variables PM_trans and TDEV_trans
To finish these data preparation steps, we add another Type node to the stream
and define the correct data types (continuous) for the variables PM_trans and
TDEV_trans.
Now, we are ready to perform the two linear regressions. We add two Linear
nodes to the stream and connect each of them with the last Type node. For each
regression, we select the appropriate variables, i.e., KDSI and PM for the first
5.2 Simple Linear Regression 387
Fig. 5.55 Selection of the variables for the regression KDSI versus PM
one and PM_trans and TDEV_trans for the second (see Figs. 5.55 and 5.56).
Then run the stream and the two model nuggets will appear.
All estimated coefficients are significant and have the values displayed in
Table 5.1.
" Warning: The coefficients for the second regression model describe
the linear relationship of the logarithmic variables. To predict the
original variable TDEV from PM, the predicted values have to be
retransformed via the exponential function.
4. Figure 5.57 shows the complete stream for predicting the TDEV and PM of
KDSI 350 and KDSI 420.
First, we use the User Input node to insert the variable KDSI with the new
values of 350 and 420 and to add the common Type node. To predict PM, we
copy the model nugget for the regression model KDSI vs. PM and add it to the
stream. As in the stream for building the model, we need to add the new variable
PM_trans with the logarithmic values of the just estimated values of PM. Use the
Derive node for that purpose (see previous step). We copy the second regression
model nugget and add it to the stream. This will predict the logarithmic value of
388 5 Regression Models
Fig. 5.56 Selection of the variables for the regression PM versus TDEV
TDEV. To get the prediction of TDEV and not of log (TDEV), we have to add a
further Derive node, which will transform the estimated values back to the
original ones by calculating their exponentials (see Fig. 5.58 for the setup of
this Derive node).
5.2 Simple Linear Regression 389
Fig. 5.58 Retransformation of the logarithmic values to the original variable TDEV
Fig. 5.59 Prediction of PM and TDEV for KDSI ¼ 320 and 450
Finally, we add a Table node to output the predicted values. The predictions of
PM and TDEV if KDSI ¼ 320 and 450, respectively, are displayed in Fig. 5.59.
1. Both, a scatterplot and correlation coefficients (see Sect. 4.2 and 4.4), can be
used to indicate a linear relationship. A boxplot and variances can only give
insight into the distribution of the variables. Furthermore, two variables can
have the same variance, but have no relationship to each other.
2. All answers except the fourth are correct. A regression model does not repro-
duce the original data. Instead, it is used to predict values for new input values,
based on the training data. This even means that the model outputs an approxi-
mation using the input of original data.
3. No. A high significance level indicates that the coefficient is irrelevant for the
model fit, and thus, it is probably more suitable to remove this variable to
improve its data fit. Thus, the significance levels of the coefficients are crucial
for the interpretation of the model.
4. Yes (see Sect. 5.2.1).
5. Yes. The method of least squares is the most common method used in this
context. There are other goodness of fit measures that can be considered
however, for example, the method of least absolute deviations.
6. No. The whole range of the input variable can be considered.
7. No. It is exactly the opposite. The coefficient of determination gives the
proportion of variability that can be explained by the model (see Sect. 5.2.1).
8. Yes (see Sect. 5.2.1).
9. Yes (see Sect. 5.2.4).
10. Yes.
5.3.1 Theory
This equation can also be described in the shorter vector form, i.e.,
5.3 Multiple Linear Regression 391
xÞ ¼ ~
t
hð~ β ~
x;
where ~ β and ~ x are p+1-dimensional column vectors, with entries β0, . . ., βp and
1, x1, . . ., xp. Furthermore, ~
β is the transpose of ~
t
β and is the typical scalar product
for vectors.
Thus, instead of a regression line, here, we have to find a hyperplane in a high
dimensional space that fits the data in an optimal way. In other words, we have to
estimate the coefficients β0, . . ., βp such that h minimizes the squared distance to the
observations.
The estimation of these coefficients is done with the method of least squares, just
as in SLR. We recommend Fahrmeir (2013) for a precise description of how the
coefficients are estimated in cases of multiple regression.
Necessary Conditions
The conditions that are necessary for MLR are similar to the ones for SLR.
1. The random noise for each sample should have zero mean and the same variance
(homoscedasticity). This means that the noise does not contain any additional
information, but is totally random.
2. Errors should be pairwise uncorrelated. This guarantees no systematic effect
from the errors. Instead, the errors are purely random.
3. The number of samples should be large enough. Hence, the coefficients can be
calculated.
4. The last assumption is a Gaussian distribution of noise. In this case, measures for
the goodness of fit for the models can be calculated. This gives us an opportunity
to find out how well the linear model fits the data.
Adjusted R2
You may recall from SLR that R2, the coefficient of determination, gives a measure
of how well the model fits the data. R2 however has the drawback that it grows as
the number of input variables increases, even if these variables do not have any
effect on the output variable. Due to this fact, MLR often uses an adjusted R2
instead, which is a slight modification of R2 and takes the number of input variables
into account. In the case of one input variable, the adjusted R2 is exactly the
coefficient of determination, as described in the SLR section (see James
et al. (2013) and Fahrmeir (2013) for information on adjusted R2).
Table 5.2 Variable selection methods for linear regression in the Linear and Regression node
Linear Regression
Method Description node node
Include all This is the naive approach, where all input variables X X
predictors/ are included.
Enter
Forwards This is an iterative method. Starting with no X
variables, in each step a variable is added, the one
that improves the model the most, until
improvement is not possible anymore. The
improvement is quantified by a selection criteria.
(Forward) Like the Forwards method, but with the adding and X X
Stepwise removing of variables in each step.
Best subset The subset of variables that gives the best criteria X
value is selected as the input variable for the
regression model.
Backwards This is the opposite of the Forwards method. X
Starting with the complete model, the input
variables are removed stepwise, according to the
selection criteria, until the model can no longer be
improved.
There are a number of algorithms and methods for selecting the variables for
inclusion in the final regression model. The SPSS Modeler provides the selection
methods that are listed in Table 5.2. There are many other techniques however,
which can be looked up in Fahrmeir (2013).
The main step in all of these selection methods is the validation of models with
different subsets of the input variables. There are a variety of validation criteria.
The following are implemented in the SPSS Modeler: Information criteria (AICC),
F-Statistics, Adjusted R2, and Averaged squared errors. From these validation
criteria, the AICC is the most common and thus typically consulted. For a detailed
description of these methods and further criteria, see Fahrmeir (2013).
Just as in SLR (see Sect. 5.1.2), there are two ways to build a Multiple Linear
Regression MLR stream, one of which is using the Linear node and the Regression
node. Both these nodes assume a Gaussian distributed target variable and no
interactions between the input variables. If one of these conditions is deemed
invalid, a GenLin or GLMM node may be more appropriate since they can be
used for more general models. They are capable of considering different
distributions of the target variable and correlations between the input parameters.
In this section, we will show how to perform an MLR with the Linear node, which is
much simpler and more intuitive than the GLMM node. For a description of the
GLMM and GenLin nodes, see Sect. 5.4.
5.3 Multiple Linear Regression 393
We set up a stream for an MLR with the Linear node using the Boston housing
dataset (see Sect. 10.1.17). This dataset describes the house prices (MEDV) in
certain neighborhoods of Boston. The Regression node is described in Exercise 4 of
Sect. 5.3.7, and we recommend doing the exercise in order to get to know that node
too, and to identify differences with the Linear node, and the advantages and
disadvantages of its use.
1. We open a new empty stream, add a Source node, and import the data file. Here,
we use the Var. File node, since the imported file is in .txt format.
2. To perform an MLR, we need to assign a target variable and input variables. This
can be done in the Type node (see Fig. 5.60), or later in the Linear node, where
the model parameters are defined.
3. We add the Linear node to the stream and connect it to the Type node.
Figure 5.61 shows the stream.
4. Now, we open the Linear node with a double-click, to set the parameters of the
regression model.
394 5 Regression Models
Fig. 5.60 Detailed view of the Type node. The variable MEDV is the target variable and
describes the house prices in a neighborhood of Boston
5. In the Fields tab, the target and input variable, which should be considered when
building the model, are declared. This can be done by simply dragging and
dropping. If the target and input variables are defined as described in step 2, the
SPSS Modeler automatically identifies the correct roles of the variables
(Fig. 5.62).
6. As in SLR, we can choose between “Building a new model” or “Continue
training existing model” in the Build Options tab. Here, we chose to build a
new model.
5.3 Multiple Linear Regression 395
Fig. 5.62 Selection of the target variable and input variables that should be considered when
building the model. The meaning of the variables can be looked up in Sect. 10.1.17
Moreover, we can also define the model selection method and the validation
criteria (see Fig. 5.63). Possible methods are “Include all predictors”, “Forward
stepwise”, and “Best subset”. The entry/removal criteria (validation criteria) for
the variables offered by the SPSS Modeler are “Information Criterion (AICC)”,
“F statistics”, “Adjusted R2”, and “Overfit Prevention Criterion (ASE)”. For
further information, see Sect. 5.3.1 and Table 5.2. Here, we decide to use the
selection method “Best subset” and the “Information criteria (AICC)”.
7. We run the stream to build the model. The model nugget appears and is included
in the stream (see Fig. 5.64).
396 5 Regression Models
Fig. 5.63 Options within the model selection method and validation criteria
As in SLR, we can get a more detailed view of the estimated coefficients, accuracy
values, and other useful information about the model by double-clicking on the
model nugget. In the following explanations of the model nugget, we only focus on
relevant and interesting results and information that differ from SLR, or are not
explained in Sect. 5.2.4, because of unimportance there. The parameters and
graphics, which are exactly the same as in the simple linear case, are, for example,
“Summary of the model” or “Predicted by Observed”. We refer to Sect. 5.2.4 for a
description of these graphics and views. Please note that the accuracy in the model
summary is the adjusted R2.
We start by showing where the estimated coefficients can be found, and how to
determine the regression equation.
As before, the color of the line indicates the sign of the variables’ coefficient.
Here, a darker color means that the variable has a positive influence, and a lighter
line indicates a negative effect on the predicted variable. More precisely, an
increase in the crime rate (CRIM) will decrease the price of the house, whereas a
better accessibility to rapid highways (RAD) will increase the selling price of the
house.
This graphic only displays the variables included in the model by the selection
method. Hence, we can get a first impression and list of the final chosen variables of
the MLR model.
We can get a better overview of the coefficients by changing the style at the
bottom to “Table” (see Sect. 5.2.4). The table that appears shows a list of the
coefficients and their estimates (see Fig. 5.66).
We would like to point out the incident when an ordinal/categorical variable is
included in the model. In this case, for each possible value of this variable, a
coefficient is estimated and displayed in the list (see the rectangle in Fig. 5.66 for
an example of the CHAS variable in the Boston housing dataset).
5.3 Multiple Linear Regression 399
Predictor Importance
In the Predictor Importance view, the relative importance of the predictors in the
model is visualized (see Fig. 5.69). This is the relative effect of the various variables
on the prediction of the observation value in the estimated model. The length of the
bars indicates the importance of the variables in the regression model. The impor-
tance values of the variables are positive and sum up to 1.
400 5 Regression Models
The values of the predictor importance are not the relative proportion of the
variable coefficients, although they give the effect to the target variable if the input
variables change. The input variables are scaled differently however, which means
that increasing one variable by one unit has a different effect than increasing
another variable by one unit. Thus, to compare the effects of the variables with
each other, and to give a significant (relative) importance to the predictors, one has
to put more effort into the calculation, which is done by the Modeler automatically.
For a detailed description of the algorithm used here, we refer to IBM (2015a).
Fig. 5.68 Summary of the “best subset” variable selection process for the Boston housing data
As with the Coefficients view, we can switch to a Table view at the bottom of the
window (see arrow in Fig. 5.70), which then shows the significance levels and other
statistical values of the ANOVA, such as the F-statistic, of each variable effect and
of the whole model (Fig. 5.71). For information on ANOVA, see Kutner
et al. (2005) and Winer et al. (1991). The first small table can be expanded into a
detailed one, with all the above-mentioned statistics, by clicking on the “Corrected
Model” field.
Fig. 5.71 Table view of the coefficient effects of the regression model
Fig. 5.72 Overview of the coefficients and their significance level with additional statistical
parameters
404 5 Regression Models
Building a stream to predict unknown values with the MLR model is done in
exactly the same way as described for SLR. Thus, we omit a detailed description
here and refer to Sect. 5.2.5. Do recall, however, that the variable names in the new
dataset must coincide with the names of the model variables.
Up until now, the estimated model was tested as to how well it fits the data it was
built on. The model would already know these data records, although usually a
model is used to predict values from unknown data. So a model should only really
be deemed suitable if it gives a good approximation for general and unknown data,
and hence, overfitting of the model onto the training data should be avoided. The
necessary test for this suitability is called cross-validation. It is the process of
assessing how well the regression generalizes with an independent dataset (see
Sect. 5.1.2).
We can utilize the Partition node to perform a cross-validation in the SPSS
Modeler. With this node, we split the data into two disjointed datasets, the training
data and the testing data. Afterwards, the model is built with the training data and
validated with the unknown test dataset (see Sect. 5.1.2).
The multiple regression model with cross-validation is presented for the Boston
Housing dataset. This stream can be found under the name “cross-validation MLR”.
Related exercise: 5, 6
5.3 Multiple Linear Regression 405
Fig. 5.73 Partitioning the data into training and testing data
" The select nodes can be automatically created by the SPSS Modeler.
To do this, we use the “Generate” field in the Partition node (see
upper arrow in Fig. 5.73 and Exercise 9 in Sect. 2.7.11.
3. To validate the estimated model, we copy the model nugget, paste it into the
stream canvas, and connect it to the test data.
4. We add an Analysis node to each of the model nuggets and run the stream. The
Analysis nodes compare the predicted values with the original values and
calculate various statistics for their difference or errors (see Figs. 5.75 and 5.76).
To decide if the regression model is applicable to unknown data and capable
of predicting the values, compare the stats from the errors of both the training
and test data, particularly the standard deviation which is called RMSE and an
indicator for the accuracy of the model (see Sect. 5.1.2). The value of the
standard deviation of the test set (5.432) is a bit higher, which is usual, but not
far away from the one of the training set (4.413). So, the model can be used as a
good predictor for independent datasets.
Theory
Instead of a standard model, we can also create ensemble models to improve
stability and accuracy. Ensemble models, or methods, combine multiple models,
each predicting the target variable, but these predictions are aggregated afterwards
to one prediction. Typical aggregation functions are mean and median. In Fig. 5.77
this concept of ensemble model prediction is illustrated. With this procedure,
5.3 Multiple Linear Regression 407
ensemble models are more likely to reduce the bias or prevent overfitting of the
training data, thus obtaining better predictions. These models are not only useful for
regression, but also even more common and applied in classification problems, for
example in decision trees (see Sect. 8.8). See Tuffery (2011) and Zhou (2012) for
more information on ensemble methods.
The SPSS Modeler provides us with two ensemble methods: bagging (bootstrap
aggregation) and boosting.
Boosting
This method generates a sequence of models to increase the accuracy of the
predictions. To build the successive model, the records are weighted based on the
previous model’s residuals. Records with large residuals get a higher weight than
ones with small residuals. So, the next component model focuses on the records
with large residuals and, thus, makes good predictions too. All component models
are built on the entire dataset. Models generated in this way form the ensemble
model. Since boosting is commonly used in classification, we refer to Sect. 8.8 for
further explanations, or see Tuffery (2011) for a more detailed description on
boosting and IBM (2015a) for the algorithm used by the SPSS Modeler.
5.3 Multiple Linear Regression 409
Fig. 5.78 The concept of bagging. On each new sampled dataset a model is build which is then
integrated in the ensemble model
Related exercise: 6
410 5 Regression Models
Fig. 5.79 Selection option of bagging and boosting in the Linear node
5. In the Model Summary view, we see the quality of the ensemble model and the
reference model measured on accuracy, i.e., the coefficient of determination.
The reference model is thereby the one which would be created when chosen
the standard model option (see Fig. 5.79). We observe that the ensemble
method, i.e., bagging, has increased the quality of the prediction (see Fig. 5.81).
6. The calculated predictor importance differs naturally from the importance of
the standard model, since these measures are now a combination of several
single models. This can even result in a different order of variable importance.
The predictor importance is displayed in the Predictor Importance view.
7. How often each input variable was picked in the selection process for the
different models is displayed in the Predictor Frequency view (see Fig. 5.82).
With the slider at the bottom, the number of displayed variables can be
changed.
8. The accuracy of each component model in the ensemble can be inspected in the
Component Model Accuracy view. These are visualized through the dots on the
left of Fig. 5.83. Furthermore, the accuracy of the ensemble model (aggregation
of each component model) and the reference model are plotted. We see that the
ensemble model fits the data better than the reference model and every compo-
nent model, which seems to be surprising at first. But this is a great example
that aggregating several models can increase the quality (accuracy) since the
412 5 Regression Models
Fig. 5.81 Quality of the ensemble model compared to the reference model
Fig. 5.82 Visualization of the frequencies how often an input variable was selected for a model
errors of the single components compensate each other, which lead to better
and more robust predictions.
9. A more detailed list of each component model, including the accuracy and
number of input variables, is shown in the Component Model Details view (see
Fig. 5.84).
10. At the top of the model nugget window, we can declare the model which should
be used for prediction, the ensemble, or reference model. In case of the
ensemble model, we are even able to choose the aggregation method. Here,
we select the mean aggregation method (see arrows in Fig. 5.84).
5.3 Multiple Linear Regression 413
Fig. 5.83 Graphic of the accuracy of the model in comparison with the reference model
11. Figures 5.85 and 5.86 show the statistics of the ensemble model predictions
calculated by the Analysis node for the training and the testing set. Comparing
these with the errors (RMSE) of the standard model (Figs. 5.75 and 5.76), we
see that the ensemble model improves the fit of the model to the data and its
prediction of unknown data.
414 5 Regression Models
5.3.7 Exercises
1. Import the data and specify the variable types with the Type node.
2. Add a Regression node to the stream and select MEDV as the target variable and
all other variables as the input.
3. Choose the Backwards method to find the significant input variables and then run
the stream. This can be done in the Model tab.
4. Inspect the model nugget and identify the estimated coefficients and the regres-
sion equation. Which variables are included in the final model, and which
variable has a coefficient of 3.802?
5. What is the value of R2 and the adjusted R2?
6. Include a Partition node in the stream and divide the dataset into 70 % training
data and 30 % test data.
7. Select the partition field in the Fields tab of the Regression node, setting it to use
only the training data in the model building procedure.
8. Add an Analysis node to the model nugget and run the stream again. Is the model
suitable for processing unknown data?
where the degree p of the polynomial term is unknown and has to be determined by
cross-validation.
1. Import the data file and familiarize yourself with the data by performing
descriptive analysis. What relationship can you surmize between the variables
“hp” and “mpg”?
2. Divide the dataset into training, validation, and test datasets with the proportion
60–20–20, respectively.
3. Build a polynomial regression model of the training data for several degrees
p with the Regression node. These should include at least the degrees 1 and 2.
4. Compare the R2 and AIC values of the models. Which of our models fits the training
data best? Perform a cross-validation of the models by adding Analysis nodes to the
model nuggets and comparing their output to the validation set. Which of the models
is the best candidate to predict the “mpg”? Use the statistics of the test set to decide if
the model is appropriate for prediction of “mpg” based on the “hp”.
5. Reflect on the previous step. The cross-validation was performed on a dataset,
independent to the training data, in order to find the best model. Why is the test
step on the selected model still important, in order to guarantee a proper
prediction model?
1. Import the dataset and divide it into training and testing sets with the proportion
70–30, respectively.
2. Build a boosting regression model with forward selection mechanism.
3. Compare the ensemble model with the standard model. Which one is more accurate?
5.3.8 Solutions
The final stream for this exercise is shown in Fig. 5.87. The stream can be found in
the solutions, under the name “LPGA tour MLR”.
1. We import the data with the Var. File node. Pay attention to the quotes on the
column names. We select the value “Pair and discard” for the Double quotes
variable in the source node. The data consist of 13 variables.
5.3 Multiple Linear Regression 419
2. We divide the dataset into a training dataset (70 %) and a test dataset (30 %) with
the Partition node, as shown in Fig. 5.73. Afterwards, select the training data and
test data with Select nodes as described in Sect. 5.3.5.
Warning, the selection of included variables and the regression model itself
depends highly upon the generated training data. Therefore, the model and your
results may differ from the ones presented here. To get the same sampling of the
training data, and thus the same results as presented here, choose the same seed
in the Partition node as we did in this solution, which is 1234567.
3. To build an MLR model, we add a Linear node to the stream and connect it to the
Select node of the training set. We choose the options as required by the
exercise, i.e., the “prize money (log)” is the target variable and all other variables
are predictors, except for the “golfers id”, since this is irrelevant information for
an MLR. The variable selection method is Forward stepwise with AICC criteria
(see Figs. 5.88 and 5.89).
We run the stream to build the model. Afterward, the stream should look like
Fig. 5.90.
4. The variables that are included in the final regression model are listed in “Model
Building Summary”. Here, these are rounds completed, average percentile in
tournaments, percent of fairway hits, green in regular putts per hole, percent of
greens reached in regulation, average drive, average strokes per round. This
can be seen in detail in Fig. 5.91. The final selection for the model of the
predictor variables is displayed in the frame.
The AICC value of this model is
AICC ¼ 157:252:
Fig. 5.88 Predictor and target variable definition of the MLR for the LPGS data
Fig. 5.91 Summary of the variable selection for the final model
422 5 Regression Models
5. The estimated coefficients can be viewed in the “Coefficients” field and are
listed in Fig. 5.92.
Note that the coefficient of the last variable (average strokes per round) is
displayed as 0. The true estimated coefficient is a bit smaller than 0; however,
this is indicated by the negative sign. Since the Modeler automatically rounds
every real number to the third digit, the actual estimate is not visible to the user,
and so we therefore interpret the last coefficient as 0.
Hence, the regression equation can be determined as
xÞ ¼ ~
t
hð~ β ~
x, with
~
β ¼ ð22:984, 0:03, 0:036, 0:037, 6:165, 0:073, 0:022, 0Þ;
and the variables are numbered as listed from the top down in Fig. 5.92.
6. All coefficients except for average strokes per round are significant to a 1 %
level, as can be seen in Fig. 5.92. The coefficient of the mentioned variable is
still significant to a 10 % level. The residuals can be assumed as Gaussian
distributed by appeal to the graphs in the Residual field of the model nugget
(see Figs. 5.93 and 5.94). Finally, the coefficient of determination is 0.904 (see
Fig. 5.95).
5.3 Multiple Linear Regression 423
7. To predict the prize money potential of the golfers from the test data, we copy
the model nugget, paste it into the stream, and connect it to the Select node of the
test data. We add an output node to it, such as a Table node, to display the
predicted values.
To compare the estimated prize money with the actual prize money, we use a
scatterplot and the Analysis node. This looks like Fig. 5.96, which shows the
second part, with the prediction and comparison of the results of the LPGA
stream.
In Fig. 5.97, we see the distribution statistics of the difference between the
actual prize money and the predicted prize money. As can be seen, the model
predicts the true values very well, since the difference varies only between 1
5.3 Multiple Linear Regression 425
and 1 around 0, the mean difference is almost 0, and the variation (RMSE) is
very small. This can also be seen through a scatterplot, which is not shown here,
but is easy for interested readers to create themselves.
1. We use the Statistics File node to read the data from the “test_scores.sav” file
(see Sect. 10.1.31). Then, we add the usual Type node.
2. We connect the Type node to a Linear node, select appropriate predictors and the
target variable (posttest), and select the “Best subset” method for variable
selection (see Figs. 5.99 and 5.100). We choose the AICC for the variable
selection criteria. The reader can pick another one if he or she prefers.
Here, we excluded “school” and “classroom” from the set of predictors, as we
want to diminish a possible school or classroom related bias. We can include
these variables in the model. As an exercise, we recommend building this model
too and comparing it with the model presented here. What are the differences?
We run the stream!
3. The predictors included in the final model are school_setting, teaching_method,
lunch, n_students, pretest, school_type. This can be inspected in the Model
Building Summary view of the model nugget (see Fig. 5.101).
426 5 Regression Models
Fig. 5.99 Variable selection for the MLR of the test scores
5.3 Multiple Linear Regression 427
The estimated coefficients of the input variables, which influence the posttest
results, are displayed in Fig. 5.102. There, we can also read the relative impor-
tance of the variables for the prediction of the target variable (posttest score). We
can see that only the variables “pretest” and “teaching method” have a notice-
able importance for the prediction, while the other three predictors are quite
dispensable, as their importance factor is almost 0 (see Fig. 5.102).
4. The teaching method has a significant influence on the posttest score, and its
coefficient is approximately 6. This means that there is a difference in the
results of the posttest scores of about six between the two methods of teaching,
and thus, this has to be considered when picking an approach and explaining the
material to the pupils.
428 5 Regression Models
Fig. 5.101 Model building summary view. Variables included in the final model
Fig. 5.102 Coefficients and the relative variable importance of the MLR for test score
– Adjusted R2
– Overfit prevention criterion (AES)
– F-statistics
– Information criteria (AICC)
5. Yes
6. Yes
7. Yes
8. Yes
9. No
10. Yes
Fig. 5.105 Selection of target and input variables in the Regression node
1. After importing the data, we add the Type node and the Regression node, as
shown in Fig. 5.104.
2. We open the Regression node and in the Fields tab choose the target variable
MEDV, and the input variables as in Fig. 5.105. Thereby, all variables except for
MEDV are input variables.
5.3 Multiple Linear Regression 431
3. In the Model tab, we choose the Backwards variable selection method from the
drop-down menu (see the arrow in Fig. 5.106). Now, run the stream.
4. The estimated coefficients and the regression equation are located in the Sum-
mary tab of the model nugget under the Analysis header. This is shown in
Fig. 5.107. In summary, there are 11 variables included in the final model, and
RM is the variable with the coefficient 3.802.
5. In the Advanced tab, several tables are displayed, which summarize the model
selection process for each step. Here, we can retrace each step of the variable
selection process, see which variables are included or removed, and the
associated assessment values and test statistics. In particular, in the first table,
the variables selection process can be reviewed with the removal and inclusion
of the variables in each step (see Fig. 5.108). Here, the AGE is removed in the
first step, and the INDUS variable in the second, as we are using backwards
selection.
The next table, labeled “ANOVA”, lists goodness of fit statistics for each
model in the variable selection step (see Fig. 5.109). We can see that the model
quality increases with every step, by considering the sum of squares, for
example.
In the “Model Summary” table, the values of R2 and adjusted R2 can be
viewed in the Advanced tab of the model nugget in the table “Model Summary”
(see Fig. 5.110). R2 is 0.741 and the adjusted R2 parameter has the value 0.735.
432 5 Regression Models
In the last table, the “Coefficients” table, the estimated coefficients with their
level of significance are listed for every model in the variable selection process
(see Fig. 5.111 for the first part of this table).
In the Expert tab in the Regression node, further statistical outputs can be
chosen, which are then displayed here in the Advanced tab (see Fig. 5.121 for
possible, additional output statistics).
6. We include the Partition node in the stream before the Type node, and divide the
dataset into training data (70 %) and a test dataset (30 %), as described in Sect. 2.7.7.
5.3 Multiple Linear Regression 433
Fig. 5.109 Goodness of fit measures for the models considered during the variables selection
process
434 5 Regression Models
Fig. 5.110 Table with goodness of fit values, including R2 and adjusted R2, for each of the model
selection steps
Fig. 5.112 The MLR stream with cross-validation and the Regression node
7. In the Fields tab, we select the partition field, generated in the previous step as
the Partition. This results in the model being built using only the training data
(see Fig. 5.113).
8. After adding the Analysis node to the model nugget, we run the stream. The
window in Fig. 5.114 pops up. There, we see probability statistics for the training
436 5 Regression Models
Fig. 5.114 Output of the Analysis node. Separate evaluation of the training data and testing data
data and test data separately. Hence, the two datasets are evaluated indepen-
dently of each other.
The mean error is near 0, and the standard deviation (RMSE) is, as usual, a bit
higher for the test data. The difference between the two standard deviations is
not optimal, but still okay. Thus, the model can be used to predict the MEDV of
unknown data (see Sect. 5.1.2 for cross-validation in this chapter).
1. We import the mtcars data with the Var. File node. To get the relationship
between the variables “mgp” and “hp”, we connect a Plot node to the Source
node and plot the two variables in a scatterplot. The graph is displayed in
Fig. 5.116. In the graph, a slight curve can be observed and so we suppose that
a nonlinear relationship is more reasonable. Hereafter, we will therefore build
two regressions models with exponents 1 and 2 and compare them with each
other.
To perform a polynomial regression of degree 2, we need to calculate the
square of the “hp” variable. For that purpose, we add a Derive node to the stream
5.3 Multiple Linear Regression 437
Fig. 5.115 Stream of the polynomial regression of the mtcars data via the Regression node
Fig. 5.117 Definition of the squared “hp” variable in the Derive node
and define a new variable “hp2”, which contains the squared values of “hp” (see
Fig. 5.117 for the setting of the Derive node).
2. To perform cross-validation, we partition the data into training (60 %), valida-
tion (20 %), and test (20 %) sets via the Partition node (see Sect. 2.7.7, for details
on the Partition node). Afterwards, we add a Type node to the stream and
perform another scatterplot of the variables “hp” and “mpg”, but this time we
color and shape the dots depending on their set partition (see Fig. 5.118). What
we can see once again is that the model barely depends on the explicit selection
of the training, validation, and test data. In order to get a robust model, a bagging
procedure should also be used.
3. Now, we add two Regression nodes to the stream and connect each of them with
the Type node. One node is used to build a standard linear regression while the
other fits a degree polynomial function to the data. Here, we only present the
settings for the second model. The settings of the first model are analog. For that
purpose, we define “mpg” as the target variable and “hp” and “hp2” as the input
variables in the Fields tab of the Regression node. The Partition field is once
again set as Partition (see Fig. 5.119).
In the Model node, we now enable “Use partitioned data” to include cross-
validation in the process and set the variable selection method to “Enter”. This
guarantees that both input variables will be considered in the model (see
Fig. 5.120).
5.3 Multiple Linear Regression 439
Fig. 5.118 Scatterplot of “mpg” and “hp,” colored and shaped depending on how they belong to
the dataset
Fig. 5.121 Enabling of the selection criteria, which include the AIC
Fig. 5.122 Goodness of fit statistics and variable selection criteria for the degree 1 model
Fig. 5.123 Goodness of fit statistics and variable selection criteria for the degree 2 model
442 5 Regression Models
(4.651) compared to the other two sets (2.755 and 1.403) results in the very low
number of occurrences.
Now, even the test set has a mean error close to 0 and a low standard deviation
for the degree 2 model. This states the universality of the model, which is thus
capable of predicting the “mpg” of a new dataset.
5. After training several models, the validation set is used to compare these models
with each other. The model that best predicts the validation data is then the best
candidate for describing the data. Thus, the validation dataset is part of the
model building and finding procedure. It is possible that the validation set
somehow favors a specific model, because of a bad partition sampling for
example. Hence, due to the potential for bias, another evaluation of the final
model should be done on an independent dataset, the test set, to confirm the
universality of the model. The final testing step is therefore an important part in
cross-validation and finding the most appropriate model.
1. We import the mtcars data with the Var. File node and add a Partition node to
split the dataset into training (70 %) and testing (30 %) set. For the description of
the Partition node and the splitting process, we refer to Sect. 2.7.7 and
Sect. 5.3.5. After adding the typical Type node, we add a Select node, to restrict
the modeling only on the training set. The Selection node can be generated in the
Partition node. For details see Sect. 5.3.5 and Fig. 5.73.
2. To build the boosting model, we add a Linear node to the stream and connect it
with the Select node. Afterwards, we open it and select “mpg” as the target and
“disp” resp. “wt” as input variables in the Fields tab (see Fig. 5.127).
Fig. 5.126 Stream of the boosting regression of the mtcars data with the Linear node
444 5 Regression Models
Next, we choose the boosting option in the Building Options tab to ensure a
boosting model building process (see Fig. 5.128). In the Ensemble options, we
define 100 as the maximum components in the final ensemble model (see
Fig. 5.129).
Now, we run the stream and the model nugget appear.
3. To inspect the quality of the ensemble model, we open the model nugget and
notice that the boosting model is more precise (R2 ¼ 0.734) than the reference
model (R2 ¼ 0.715) (see Fig. 5.130). Thus, boosting increases the fits of the
model to the data, and this ensemble model is chosen for prediction of the “mpg”
(see arrow in Fig. 5.130).
In the Ensemble accuracy, we can retrace the accuracy progress of the
modeling process. Here, we see that only 26 component models are trained
5.3 Multiple Linear Regression 445
since further improvement by additional models wasn’t possible, and thus, the
modeling process stopped here (Fig. 5.131).
For cross-validation and the inspection of the RMSE, we now add an Analysis
node to the stream and connect it with the model nugget. As a trick, we further
change the data connection from the Selection node to the Type node (see
Fig. 5.132). This will ensure that both the training and testing partitions are
considered simultaneously by the model. Hence, the predictions can be shown in
one Analysis node.
After running the stream again, the output of the Analysis node appears (see
Fig. 5.133). We can see that the difference of the RMSE in the training and
testing data do not differ much, hence, indicating a robust model which is able to
predict the “mpg” from unseen data.
446 5 Regression Models
Fig. 5.129 Definition of the maximum components for the boosting model
Fig. 5.132 Change of the data connection from the Select node to the Type node
Generalized Linear (Mixed) Models (GLM and GLMM), are essentially a generali-
zation of the linear and multiple linear models described in the two previous
sections. They are able to process more complex data, such as dependencies
between the predictor variables and a nonlinear connection to the target variable.
A further advantage is the possibility of modeling data with non-Gaussian
distributed target variables. In many applications, a normal distribution is inappro-
priate and even incorrect to assume. An example would be when the response
variable is expected to take only positive values that vary over a wide range. In this
case, constant change of the input leads to a large change in the prediction, which
can even be negative in the case of normal distribution. In this particular example,
assuming an exponentially distributed variable would be more appropriate. Besides
these fixed effects, the Mixed Models also add additional random effects that
describe the individual behavior of data subsets. In particular this helps us to
model the data on a more individual level.
This flexibility and the possibility of handling more complex data comes with a
price: the loss of automatic procedures and selection algorithms. There are no black
boxes that help to find the optimal variable dependencies, link function to the target
variable, or its distribution. YOU HAVE TO KNOW YOUR DATA! This makes
utilization of GLM and GLMM very tricky, and they should be used with care and
only by experienced data analysts. Incorrect use can lead to false conclusions and
thus false predictions.
The Modeler provides two nodes for building these kinds of models, the GenLin
node and the GLMM node. With the GenLin node, we can only fit a GLM to the
data, whereas the GLMM node can model Linear Mixed Models as well. After a
short explanation of the theoretical background, the GLMM node is described here
in more detail. It comprises many of the same options as the GenLin node, but, the
latter has a wider range of options for the GLM, e.g., a larger variety of link
functions, and is more elegant in some situations. A detailed description of the
GenLin node is omitted here, but we recommend completing Exercise 2 in the later
Sect. 5.4.5, where we explore the GenLin node while fitting a GLM model.
5.4.1 Theory
The GLM and GLMM can process data with continuous or discrete target variables.
Since GLMs and Linear Mixed Models are two separate classes of model, with the
GLMM combining both, we will explain the theoretical background of these two
concepts in two short sections.
The difference from an ordinary MLR is that this linear term is linked to the mean
of the observation distribution via a so-called link function g, i.e.,
gðMean of the target variableÞ ¼ h xi1 ; . . . ; xip :
This provides a connection between the target distribution and the linear predictor.
There are no restrictions concerning the link function, but, its domain should
coincide with the support of the target distribution, in order to be practical. There
are, however, a couple of standard functions which have proved to be adequate. For
example, if the target variable is binary and g is the logistic function, the GLM is a
logistic regression, which is discussed in Sect. 8.3. We refer to Fahrmeir (2013) for
a more detailed theoretical introduction to GLM.
estimation). We should mention that other methods exist for estimating coefficients,
such as the Bayesian approach. For an example see once again Fahrmeir (2013).
A popular statistical measure for comparing and valuing models is the “Akaike
criteria (AICC)”. As in the other regression models, the mean squared error is also a
commonly used value for determining the goodness of fit of a model. This parame-
ter is consulted during cross-validation in particular. Another criterion is the
“Bayesian information criterion (BIC)”, see Schwarz (1978), which is also calcu-
lated in the Modeler.
Here, we set up a stream for a GLMM with the GLMM node using the “test_scores”
dataset. It comprises data from schools that used traditional and new/experimental
teaching methods. Furthermore, the dataset includes the pretest and the posttest
results of the students, as well as variables to describe the student’s characteristics.
The learning quality not only depends on the method but also, for example, on the
teacher. How he/she delivers the educational content to the pupils and to the class
itself. If the atmosphere is noisy and twitchy, the class has a learning disadvantage.
These considerations justify a mixed model and the use of the GLMM node to
predict the posttest score.
the target values with the curve of a normal distribution. This indicates a
Gaussian distribution of the target variable.
3. We add the GLMM node from the SPSS Modeler toolbar to the stream, connect
it to the Type node, and open it to define the data structure and model
parameters.
4. In the Data Structure tab, we can specify the dependencies of the data records, by
dragging the particular variables from the left list and dropping them onto the
subject field canvas on the right. First, we drag and drop the “school” variable,
then the “classroom” variable, and lastly the “student_id” variable, to indicate
that students in the same classroom correlate in the outcome of the test scores
(see Fig. 5.136).
5. The random effects, induced by the defined data structure, can be viewed in the
Fields and Effects tab. See Fig. 5.137 for the included random effects in our
stream for the test score data, which are “school” and the factor
“school*classroom”. These indicate that the performance of the students
depends on the school as well as on the classrooms and are therefore clustered.
We can manually add more random effects by clicking on the “Add Block. . .”
button at the bottom (see arrow in Fig. 5.137). See IBM (2015b, pp. 180–181) for
details on how to add further effects.
Fig. 5.136 Definition of the clusters for the random effects in the GLMM node
5.4 Generalized Linear (Mixed) Model 453
6. In the “Target” options, the target variable, its distribution, and relationship to
the linear predictor term can be specified. Here, we choose the variable “post-
test” as our target. Since we assume a Gaussian distribution and a linear
relationship to the input variables, we choose the “Linear model” setting (see
arrow in Fig. 5.138).
We can choose some other models predefined by the Modeler for continuous,
as well as categorical target variables [see the enumeration in Fig. 5.138 or IBM
(2015b, p. 175)]. We can further use the “Custom” option to define the distribu-
tion and manually link functions if none of the above models are appropriate for
our data. A list of the distributions and link functions can be found in IBM
(2015b, pp. 178–179).
7. In the “Fixed Effects” options, we can specify the variables with deterministic
influence on the target variable, which should be included in the model. This can
be done by dragging and dropping. We select the particular variable in the left
list and drag it to the “Effect builder canvas”. There we can select multiple
variables at once and we can also define new factor variables to be included in
the model. The latter can be used to describe dependences between single input
variables. The options and differences between the columns in the “Effect
builder canvas” are described in the Infobox below.
In our example, we want to include the “pretest”, “teaching_method”, and
“lunch” variables as single predictors. Therefore, we select these three variables
in the left list and drag them into the “Main” column on the right. Then we select
the “school_type” and “pretest” variable and drag them into the “2-way” col-
umn. This will bring the new variable “school_type*pretest” into the model
since we assume a dependency between the type of school and the pretest results.
454 5 Regression Models
Fig. 5.138 Definition of the target variable, its distribution, and the link function
To include the intercept in the model, we check the particular field (see left
arrow in Fig. 5.139).
There are four different possible drop options in the “Effect builder canvas”:
Main, *, 2-way, and 3-way. If we drop the selected variables, for example A,
B, and C, into the Main effects column, each variable is added individually to
the model. If, on the other hand, the 2-way or 3-way is chosen, all possible
variable combinations of 2 or 3 are inserted. In the case of the 2-way
interaction, this means for the example here, that the terms A*B, A*C, and
B*C are added to the GLMM. The * column adds a single term to the model,
which is a multiplication of all the selected variables.
Despite these given options, we can specify our own nested terms and add
them to the model with the “Add custom term” button on the right [see the
right arrow in Fig. 5.139 and for further information see IBM (2015b)].
8. In the “Build options” and “Model Options”, we can specify further advanced
model criteria that comprise convergence settings of the involved algorithms.
We refer the interested reader to IBM (2015b).
9. Finally, we run the stream and the model nugget appears.
5.4 Generalized Linear (Mixed) Model 455
Some of the tables and graphs in the model nugget of the GLMM are similar to the
ones of MLR, for example the Predicted by Observed, Fixed Effects, and Fixed
Coefficients views. For details, see Sect. 5.3.3, where the model nugget of the MLR
is described. It should not be surprising that these views equal the ones of the MLR,
since the fixed input variables form a linear term, as in the MLR. The only
difference to the MLR model is the consideration of co-variate terms, here,
“pretest*school” (see Fig. 5.140). These terms are treated as normal input variables,
and are therefore no different to handle.
The single effects are significant, whereas the product variable “school_ty-
pe*pretest” has a relatively high p-value of 0.266. Therefore, a more suitable
model can be one without this variable.
Model Summary
In the summary view, we can get a quick overview of the model parameters, the
target variable, its probability distribution, and the link function. Furthermore, two
model evaluation criteria are displayed, the Akaike (AICC) and a Bayesian criteria,
456 5 Regression Models
Fig. 5.140 Coefficients of the test scores GLMM with the co-variate term “pretest*school_type”
which can be used to compare models with each other. A smaller value thereby
means a better fit of the model (Fig. 5.141).
The covariances of the separate random effects are in the “Random Effect
Covariances” view. The color indicates the direction of the correlation, darker
color means positively correlated and lighter means negatively correlated. If there
are multiple blocks, we can switch between them in the bottom selection menu (see
Fig. 5.142). A block is thereby the level of subject dependence, which was defined
in the GLMM node, in this case the school and the classroom (see Fig. 5.137). The
value, displayed in Fig. 5.142 for example, is a measure of the variation between the
different schools.
In the “Covariance Parameter” view, the first table gives an overview of the number
of fixed and random effects that are considered in the model. In the bottom table,
parameters of the residual and covariance estimates are shown. As well as the
estimates, these include the standard error and confidence interval, in particular (see
Fig. 5.143). At the bottom, we can switch between the residual and the block estimates.
In the case of the test score model, we find that the estimated effect of school
subjects is much higher (22.676) than the effect of the residuals (7.868), and the
“school*classroom” subject, which is 7.484 (see Fig. 5.144 for school subject
variation and Fig. 5.143 for the residual variation estimate). This indicates that
most of the variation in the test score set, which is unexplained by the fixed effects,
458 5 Regression Models
In all previous examples and models, the structure of the data and the relationships
between the input and target variables were clear before estimating the model. In
many applications, this is not the case, however, and the basic parameters of an
assumed model have to be chosen before fitting the model. One famous and often
referenced example is polynomial regression, see Exercise 5 in Sect. 5.3.7. Polyno-
mial regression means that the target variable y is connected to the input variable
x via a polynomial term
y ¼ β0 þ β1 x þ β2 x2 þ . . . þ βp xp :
5.4 Generalized Linear (Mixed) Model 459
The degree p of this polynomial term is unknown however, and has to be deter-
mined by cross-validation. This is a typical application of cross-validation, to find
the optimal model, and afterwards test for universality (see Sect. 5.1.2). Recall
Fig. 5.4, where the workflow of the cross-validation process is visualized.
Cross-validation has an advantage over normal selection criteria, such as for-
ward or backward selection, since the decision of the exponent is made on an
independent dataset, and thus overfitting of the model is prevented. Selecting
criteria in this manner can fail, however.
In this section, we perform a polynomial regression with cross-validation, with the
GLMM node based on the “Ozone.csv” dataset, containing meteorology data and
ozone concentration levels from the Los Angeles Basin in 1976 (see Sect. 10.1.27).
We will build a polynomial regression model to estimate the ozone concentration from
the temperature.
We split the description into two parts, in which we describe first the model
building and then the validation of these models. Of course, we can merge the two
resulting streams into one single stream: we have saved this under the name
“ozone_GLMM_CROSS_VALIDATION”.
Fig. 5.145 Filter node to exclude all irrelevant variables from the further stream
Fig. 5.145). For this we have to click on all the arrows from these variables,
which are then crossed out, so they won’t appear in the subsequent analysis.
3. Now, we insert the usual Type node, followed by a Plot node, in order to inspect
the relationship between the variables “O3” and “temp” (see Sect. 4.2 for
instructions on the Plot node). The relationship between the two variables can
be seen by the scatterplot output of the Plot node in Fig. 5.146.
As can be seen in scatterplot Fig. 5.146, the cloud of points foretell a slight
curve, which indicates a quadric relationship between the O3 and temp
variables. Since we are not entirely sure, we choose cross-validation to decide
if a linear or quadric regression is more suitable.
4. To perform cross-validation, we insert a Partition node into the stream, open it,
and select the “Train, test and validation” option to divide the data into three
parts (see Fig. 5.147). We choose 60 % of the data as the training set and split
the remaining data equally into validation and testing sets. Furthermore, we
should point out that we have to fix the seed of the sampling mechanism, so that
the three subsets are the same in the later validation stream. This is important,
as otherwise the validation would be with already known data and therefore
5.4 Generalized Linear (Mixed) Model 461
Fig. 5.146 Scatterplot of the “O3” and “temp” variables from the ozone data
Fig. 5.147 Partitioning of the test score data into training, validation and test sets
462 5 Regression Models
Fig. 5.148 Fields and Effects tab of the GLMM node. Target variable (O3) and the type of model
selection
inaccurate, since the model typically performs better on data it is fitted to. Here,
the seed is set to 1234567.
5. We generate a Selection node for the training data, for example with the
“Generate” option in the Partition node (see Sect. 2.7.7). Add this node to the
stream between the Type and the GLMM node.
6. Now, we add a GLMM node to the stream and connect it to the Select node.
7. Open the GLMM node and choose “O3” as the target variable in the Fields and
Effects tab (see Fig. 5.148). We also select the Linear model relationship, since
we want to fit a linear model.
8. In the Fixed Effects option, we drag the temp variable and drop it into the main
effects (see Fig. 5.149). The definition of the linear model is now complete.
9. To build the quadric regression model, copy the GLMM node, paste it into the
stream canvas, and connect it with the Select node. Open it and go to the Fixed
Effects option. To add the quadric term to the regression equation, we click on
the “Add a custom term” button on the right (see arrow in Fig. 5.149). The
custom term definition window pops up (see Fig. 5.150). There, we drag the
temp variable into the custom term field, click on the “By*” button, and drag
and drop the temp variable into the custom field once again. Now the window
should look like Fig. 5.150. We finish by clicking on the “Add Term” button,
5.4 Generalized Linear (Mixed) Model 463
Fig. 5.149 Input variable definition of the linear model. The variable temp is included as a
linear term
which adds the quadric temp term to the regression as an input variable. The
Fixed Effects window should now look like Fig. 5.151.
10. Now we run the stream and the model nuggets appear. The stream for building
the three models is finished, and we can proceed with validation of those
models.
Fig. 5.152 Merge of the model outputs. Switching off of duplicate filters
Fig. 5.153 Validation of the models with the output of the Analysis node
5.4 Generalized Linear (Mixed) Model 467
quadric regression does not strongly improve the goodness of fit, but when
computing the RMSE (Sect. 5.1.2), the error of the quadric regression is
smaller than the error of the linear model. Hence, the former outperforms
the latter and so describes the data a bit better.
17. We still have to test this by cross-validating a chosen model with the
remaining testing data. The stream is built similarly to the other testing
streams (see, e.g., Sect. 5.2.5 and Fig. 5.154).
18. To test the model, first, we generate a Select node, connect it to the Type node,
and select the testing data.
19. We copy the model nugget from the most validated model and paste it into the
canvas. Then, connect it to the Select node of the test dataset.
20. We add another Analysis node at the end of this stream and run it. The output
is shown in Fig. 5.155. We can see that the error mean is almost 0 and the
RMSE is of the same order as the standard deviation of the training and
validation data. Hence, a quadric model is suitable for discerning ozone
concentration from temperature.
5.4.5 Exercises
1. Import the data file and perform a boxplot of the “distance” variable. Is the
assumption of adding individual effects suitable?
2. Build two models with the GLMM node, one with random effects and one
without. Then, use the “Age” and “Sex” variables as fixed effects in both
models, while adding an individual intercept as a random effect in the latter
model.
3. Does the random effect improve the model fit? Inspect the model nuggets to
answer this question.
1. Import the test score data and partition it into three parts, training, validation, and
test set, with the Partition node.
2. Add a GenLin node to the stream and select the target and input variables.
Furthermore, select the field as Partition, which will indicate the relationship to
the subsets.
3. In the Model tab, choose the option “Use partitioned data” and select “Include
intercept”.
4. In the Expert tab, we set the normal distribution for the target variable and
choose Power as the link function with parameter 1.
5. Add two additional GenLin nodes to the stream by repeating steps 2–4. Choose
0.5 and 1.5 as power parameters for the first and second models, respectively.
6. Run the stream and validate the model performances with the Analysis nodes.
Which model is most appropriate for describing the data?
7. Familiarize yourself with the model nugget. What are the tables and statistics
shown in the different tabs?
5.4 Generalized Linear (Mixed) Model 469
1. Reflect why a Poisson regression is suitable for this kind of data. If you are not
familiar with it, inform yourself about Poisson distribution and the kind of
randomness it describes.
2. Use the GenLin node to build a Poisson regression model with the three
predictor variables mentioned above. Use 80 % for the training dataset and
20 % for the test data. Is the model appropriate for describing the data and
predicting the counts of ship damage? Justify your answer.
3. Inform yourself on the “Offset” of a Poisson regression, for example, on
Wikipedia. What does this field actually model, in general and in the current
dataset?
4. Update your stream by a logarithmic transformation of the “months of
service” and then add these values as offset. Does this operation increase
the model fit?
5.4.6 Solutions
1. We import the “Othodont.csv” data (see Sect. 10.1.26) with a Var. File node.
Then, we add the usual Type node to the stream.
Next, we add a Graphboard node to the stream and connect it with the Type
node. After opening it, we mark the “distance” and “subject” variables and click
on the Boxplot graphic (see Fig. 5.157). After running the node, the output
window appears, which displays the distribution of each subject’s distance via a
boxplot (see Fig. 5.158). As can be seen, the boxes are not homogeneous, which
confirms our assumption of individual effects. Therefore, the use of a GLMM is
highly recommended.
2. Next, we add a GLMM node to the stream canvas and connect it with the Type
node. We open it and select the “distance” variable as the target variable and
choose the linear model option as the type of the model (see Fig. 5.159).
470 5 Regression Models
Fig. 5.156 Complete stream of the exercise to fit a GLMM to the “Orthodont” data with the
GLMM node
Now, we add the “Age” and “Sex” variables as fixed effects to the model, in
the Fields and Effects tab of the GLMM node (see Fig. 5.160). This finishes the
definition of the model parameters for the nonrandom effects model.
Now, we copy the GLMM node and paste it into the stream canvas. After-
wards, we connect it with the Type node and open it. In the Data Structure tab,
we drag the “Subject” field and drop it into the Subject canvas (Fig. 5.161). This
will include an additional intercept as a random effect in the model and finishes
the parameters definition of the model with random effects.
We finally run the stream, and the two model nuggets appear.
3. We open both model nuggets and take a look at the “Predicted by Observed”
scatterplots. These can be viewed in Figs. 5.162 and 5.163. As can be seen, the
points of the model with random effects lie in a straight line around the diagonal,
whereas the points of the model without random effects have a more cloudy
472 5 Regression Models
Fig. 5.159 Target selection and model type definition in the GLMM node
Fig. 5.162 Predicted by Observed plot of the model without random effects
474 5 Regression Models
Fig. 5.164 Complete stream of the exercise to fit a regression into the test score data with the
GenLin node
shape. Thus, the model with random effects explains the data better and is more
capable of predicting the distance moved, using the age and gender of a young
subject.
In this exercise, we get to know the GenLin node. Figure 5.164 shows the complete
stream of the solution.
5.4 Generalized Linear (Mixed) Model 475
Fig. 5.165 Partitioning of the test score data into three parts: training, testing and validation
1. The “001 Template-Stream test_scores” is the starting point for our solution of
this exercise. After opening this stream, we add a Partition node to it and select a
three part splitting of the dataset with the ratio 60 % training: 20 % testing: 20 %
validation (see Fig. 5.165).
2. Now, we add a GenLin node to the stream and select “posttest” as the target
variable, “Partition” as the partitioning variable, and all other variables as input
(see Fig. 5.166).
3. In the Model tab, we choose the option “Use partitioned data” and select
“Include intercept” (see Fig. 5.167).
4. In the Expert tab, we set the target variable as normal distribution and choose
“Power” as the link function with parameter 1 (see Fig. 5.168).
5. We add two additional GenLin nodes to the stream and connect the Partition
node to them. Afterwards, we repeat steps 2–4 for each node and set the same
parameters, except for the power parameters, which are set to 0.5 and 1.5 (see
Fig. 5.169 for how the stream should look like).
6. Now, we run the stream and the three model nuggets should appear in the
canvas. We then add an Analysis node to each of these nuggets and run the
476 5 Regression Models
Fig. 5.167 The model tab of the GenLin node. Enabling of partitioning use and intercept inclusion
5.4 Generalized Linear (Mixed) Model 477
Fig. 5.168 Expert tab in the GenLin node. Definition of the target distribution and the link
function
Fig. 5.170 Output of the Analysis node for the GenLin model with power 1
Fig. 5.171 Output of the Analysis node for the GenLin model with power 0.5
stream a second time to get the validation statistics of the three models. These
can be viewed in Figs. 5.170, 5.171, and 5.172.
We see that all standard deviations appear very close to each other, to the
overall models and to the dataset parts. This suggests that there is no real
difference in the performance of the models for this data. The model with
5.4 Generalized Linear (Mixed) Model 479
Fig. 5.172 Output of the Analysis node for the GenLin model with power 1.5
power 1 has the smallest values of these statistics, however, and is thus the most
appropriate for describing the test score data.
7. To get insight into the model nugget and its parameters and statistics, we will
only inspect the model nugget of the GenLin power 1 model here. The other
nuggets have the same structure and tables.
First, we observe that the GenLin mode nugget can also visualize the predictor
importance in a graph. This is displayed in the Model tab. Not surprisingly, the
“Pretest” variable is by far the most important variable in predicting the “post-
test” score (see Fig. 5.173).
Now, we take a deeper look at the “Advanced” tab. Here, multiple tables are
displayed, but the first four (Model Information, Case processing Summary, Cate-
gorical Variable Information, Continuous Variable Information) summarize the
data used in the modeling process.
The next table is the “Goodness of Fit” table; it contains a couple of measures for
validating the model using the training data. These include, among others, the
Pearson Chi-Squared value and Akaike’s Information Criterion (AICC) (see
Fig. 5.174). For these measures, the smaller the value, the better the model fit. If
we compare, for example, the AICC of the three models with each other, we get the
same picture as before with the Analysis nodes. The model with a power of
1 explains the training data better than the other two models.
The next table shows the significance level and test statistics from validation of
the model with the naive model. That is, the model consisting only of the intercept.
As displayed in Fig. 5.175, this model is significantly better than the naive model.
The last but one table contains the results of the significance test of the individual
input variables. The significance level is located in the column furthest right (see
Fig. 5.176). In the case of the test scores, all effects are significant, except for the
480 5 Regression Models
Fig. 5.177 Parameter estimate and validation for the GenLin model
1. Poisson regression is typically used to model count data and assumes a Poisson-
distributed target variable. Furthermore, the expected value is linked to both the
linear predictor term via a logarithmic function, hence,
logðMean of the target variableÞ ¼ h xi1 ; . . . ; xip ;
and the target variable, here “incidents”, following a Poisson law. A Poisson
distribution also describes rare events and is ideal for modeling random events
that occur infrequently. We refer to Fahrmeir (2013) and Cameron and Trivedi
(2013) for more information on Poisson regression.
Since ship damage caused by a wave is very unusual, the assumption is of
Poisson-distributed target variables and so Poisson regression is suitable. Plot-
ting a histogram of the “incidents” variable also affirms this assumption (see
Fig. 5.179).
2. To build a Poisson regression with the GenLin node, we first split the data into a
training set and a test set using a Partition node in the desired proportions
(Fig. 5.180).
After adding the usual Type node, we insert the GenLin node to the stream by
connecting it to the Type node. Now, we open the GenLin node and define the
5.4 Generalized Linear (Mixed) Model 483
Fig. 5.179 Histogram of the “incidents” variable, which suggests a Poisson distribution
target, Partition, and input variables in the Fields tab (see Fig. 5.181). In the
Model tab, we enable the “Use partitioned data” and “Include intercept” options,
as in Fig. 5.167.
In the Expert tab, we finally define the settings of a Poisson regression. That is,
we choose “Poisson” distribution as the target distribution and “logarithmic” as
the link function (see Fig. 5.182). Afterwards, we run the stream and the model
nugget appears.
To evaluate the goodness of our model, we add an Analysis node to the
nugget. Figure 5.183 shows the final stream of part 2 of this exercise. Now, we
484 5 Regression Models
Fig. 5.180 Partitioning of the ship data into a training set and a test set
Fig. 5.181 Definition of the target, partition, and input variables for Poisson regression of the
ship data
5.4 Generalized Linear (Mixed) Model 485
Fig. 5.182 Setting the model settings for the Poisson regression. That is, the “Poisson” distribu-
tion and “logarithmic” as link function
Fig. 5.184 Output of the Analysis node for Poisson regression on the ship data
run the stream again, and the output of the Analysis tab pops up. The output can
be viewed in Fig. 5.184. We observe that the standard deviations of the training
and test data differ a lot: 5.962 for the training data, but 31.219 for the test data.
This indicates that the model describes the training data very well, but is not
appropriate for independent data. Hence, we have to modify our model, which is
done in the following parts of this exercise.
3. Often occurrences of events are counted on different timescales, and so appear to
happen equally often, although it is not the case. For example, in our ship data,
the variable “service” describes the aggregated months of service, which differ
for each ship. So, discovered damage is based on different timescales for each
ship and thus is not exactly comparable. The “offset” is an additional tool that
balances this disparity in the model, so that the damages are observed at nearly
the same time intervals. For more information on the offset, we refer to
Hilbe (2014).
4. To update our model with an offset tool, we have to calculate the logarithm of
the “service” variable. Therefore, we insert a Derive node between the Source
and the Partition node. In this node, set the formula “log(service)” to calculate
the offset term and name this new variable “log_month_service” (see
Fig. 5.185).
Then, we open the GenLin node and set the “log_month_service” variable as
the Offset field. This is displayed in Fig. 5.186. Now, we run the stream again,
5.4 Generalized Linear (Mixed) Model 487
Fig. 5.187 Output for the Poisson regression with offset term
and the output of the new model pops up (see Fig. 5.187). We observe that both
standard deviations do not differ much from each other. Thus, by setting an
offset term, the accuracy of the model has increased, and the model is now able
to predict ship damage from ship data not involved in the modeling process.
The SPSS Modeler provides us with an easy way to build multiple models in one
step, using the Auto Numeric node. The Auto Numeric node considers various
models, which use different methods and techniques, and ranks them according to a
quantification measure. This is very advantageous to data miners for two main
reasons. First, we can easily compare the different settings of a mining method,
such as the variable selection method or validation criteria, within a single stream
instead of running multiple steams with diverse settings. This helps to quickly find
the best method setup. A second reason to use the Auto numeric node is that it
comprises models from different mining approaches, such as regression models, as
described in this chapter, but also neuronal networks or decision trees. See
Fig. 5.188 and Kuhn and Johnson (2013) for descriptions of regression trees and
other regression modeling techniques which are also provides by the Auto numeric
5.5 The Auto Numeric Node 489
Fig. 5.188 Nodes included within the Auto numeric node. The darker circles are the nodes for
regression models which are described in this chapter. The lighter circles are further regression
nodes of other models within the Auto numeric node
node. This can help find the most appropriate approach for understanding the data.
With either way, the Auto numeric node takes the best-fitting models and joins them
together into a model ensemble. That means, when predicting the target variable
value, each of the models in the ensemble processes the data and predicts the
outputs, which are then aggregated into one final prediction using mean. This
aggregation process of values, predicted from different models, has the advantage
that data points favored as outliers or leveraged by one model type are smoothed out
by the output of the other models. Furthermore, overfitting becomes less likely.
We would like to point out that building a huge number of models is very time
consuming. That’s why a large number of considered models in the Auto Numeric
node may take a very long time to calculate, up to a couple of hours.
490 5 Regression Models
In the following, we show how to effectively use the Auto numeric node to build an
optimal model for our data and mining task. A further advantage of this node is its
capability of running a cross-validation within the same stream. This results in more
clearly represented streams. In particular, no additional stream for validation of the
model has to be constructed. We include this useful property in our example stream.
1. First, we import the data, in this case the Boston housing data, housing.data.txt.
2. To cross-validate the models within the stream, we have to split the dataset into
two separate sets, the training data and the test data. Therefore, we add the
Partition node to the stream and partition the data appropriately: 70 % of the
data to train the models and the rest for the validation. The Partition node is
described in more detail in Sect. 2.7.7. Afterwards, we use the Type node to
assign the proper variable types.
3. Now, we add the Auto numeric node to the canvas and connect it with the Type
node, then open it with a double-click. In the “Fields” tab, we define the target
and input variables, where MEDV describes the mean house prize and is set as
the target variable, and all other variables, except for the partitioning field, are
chosen as inputs. The partition field is selected in the Partition drop-down menu
for the Modeler, to indicate that this field defines both the training and the test
sets (see Fig. 5.189 for details).
5.5 The Auto Numeric Node 491
Fig. 5.189 Definition of target and input variables and the partition field
4. In the “Model” tab, we enable the “Use partitioned of data” option (see the top
arrow in Fig. 5.190). This option will lead the model to be built based on the
training data alone.
In the “Rank models by” selection field, we can choose the score that
validates the models and compares them with each other. Possible measures
are:
Fig. 5.190 Model tab with the criteria that models should be included in the ensemble
particular, the relative error equals the variance of the observed values from
those predicted, divided by the variance of the observed values from
the mean.
With the “rank” selection, we can choose if the models should be ranked by
the training or the test partition, and how many models should be included in
the final ensemble. Here, we select that the ensemble should have four models
(see the bottom arrow in Fig. 5.190). At the bottom of the tab, we can define
more precise exclusion criteria, to ignore models that are unsuitable for our
purposes. If these thresholds are too strict however, we might end up with an
empty ensemble, that is, no model fulfills the criteria. If this happens, we should
loosen the criteria.
In the Model tab, we can also choose to calculate the predictor importance,
and we recommend enabling this option each time. For predictor importance,
see Sect. 5.3.3.
5. The next tab is the “Expert” tab. Here, the models that should be calculated and
compared can be specified (see Fig. 5.191). Besides the known Regression,
5.5 The Auto Numeric Node 493
Linear and GenLin nodes, which are the classical approaches for numerical
estimation, we can also include decision trees, neuronal networks, support
vector machines, and decision trees as candidates for the ensemble. Although
these models were actually invented for different types of data mining tasks,
such as classification analysis, which are the typical applications, these models
are also capable of estimating numeric values. We omit a description of the
particular models here and refer to the corresponding Chap. 8 on classification
and IBM (2015a) and Kuhn and Johnson (2013) for further information on the
methods and algorithms.
We can also specify multiple settings for one node, in order to include more
model variations and to find the best model of one type. Here, we used two
setups for the Regression node and eight for the Linear node (see Fig. 5.191).
We demonstrate how to pick settings with the Linear node. For all other nodes,
the steps are the same, but the options are of course different for each node.
494 5 Regression Models
6. To include more models of the same type in the calculations, we click on the
“Model parameters” field next to the particular model. We choose the option
“Specify” in the opening selection bar. Another window pops up in which we
can define all model variations. For the Linear node, this looks like Fig. 5.192.
7. In the opened window, we select the “Expert” tab and in the “Options” field
next to it, we click on each parameter that we want to change (see the arrow in
Fig. 5.192). Another, smaller window, opens with the possible selectable
parameter “values” (see Fig. 5.193). We mark all parameter options that we
wish to be considered, here, the “Forward stepwise” and “Best subset” model
selection methods, and confirm by clicking the “ok” button. With all other
parameters, proceed in an analog fashion.
5.5 The Auto Numeric Node 495
Fig. 5.193 Specification of the variable selection methods to consider in the Linear node model
8. If we have set all the model parameters of our choice, we run the model, and the
model nugget should appear in the stream. For each possible combination of
selected parameter options, the Modeler now generates a statistical model and
compares it to all other build models. If it is ranked high enough, the model is
included in the ensemble.
We would again like to point out that running a huge number of models can
lead to time-consuming calculations.
9. We add an Analysis node to the model nugget to calculate the statistics of the
predicted values by the ensemble, for the training and test sets separately. We
make sure that the “Separate by partition” option is enabled, as displayed in
Fig. 5.194.
496 5 Regression Models
Fig. 5.194 Analysis node to deliver statistics of the predicted values. Select “Separate by
partition” to process the training and test sets separately
10. Figure 5.195 shows the output and it shows distribution values from the
training data and test data. The subsequent section describes how these statis-
tics are calculated. To evaluate if the ensemble model can be used to process
independent data however, we have to compare the statistics and especially the
standard deviation. Since this is not much higher for the test data than for the
training, the model passes the cross-validation test and can be used for
predictions of further unknown data.
5.5 The Auto Numeric Node 497
Fig. 5.195 Analysis output with statistics from both the training and the test data
In this short section, we take a closer look at the model nugget, generated by the
Auto numeric node, and the options it provides.
Fig. 5.196 Model tab of the Auto numeric model nugget. Specification of the models in the
ensemble, which are used to predict output
R2. This is highlighted by the left arrow in Fig. 5.196. This will open the model
nugget of each particular node in a new window, and the details of the model will be
displayed. Since each of the model nuggets is introduced and described separately
in the associated chapter, we refer to these chapters, as we omit a description of
each individual node here.
In the furthest left column labeled “Use?”, we can choose which of the models
should process data for prediction. More precisely, each of the enabled models
takes the input data and estimates the target value individually. Then, all outputs are
averaged to one single output. This process of aggregating can prevent overfitting
and minimize the impact of outliers, which will lead to more trustworthy
predictions.
Fig. 5.197 Graph tab of the Auto numeric model nugget. Predictor importance and scatterplot of
observed and predicted values
Fig. 5.198 Settings tab of the Auto numeric model nugget. Specification of the output
Furthermore, we can add the standard error to our output, estimated for each
prediction (see the second check box in Fig. 5.198).
We recommend playing with these options and previewing the output data, to
understand the differences between each created output. For more information, we
recommend consulting IBM (2015b).
5.5.3 Exercises
5.5.4 Solutions
For this exercise, there is no definite solution, and it is just not feasible to consider
all possible model variations and parameters. This exercise serves simply as a
practice tool for the Auto numeric node and its options. One possible solution
stream can be seen in Fig. 5.199.
1. To build the “Longley_regression” stream of Fig. 5.199, we import the data and
add the usual Type node. The dataset consists of 7 variables and 16 data records,
which can be observed with the Table node. We recommend using this node in
order to inspect the data in the data file and to get an impression of the data.
2. We then add the Auto numeric node to the stream, define the variables, with the
“Employed” variable as the target, and all other variables as predictor variables
(see Fig. 5.200).
5.5 The Auto Numeric Node 501
Fig. 5.199 Stream of the regression analysis of the Longley data with the Auto numeric node
In this solution to the exercise, we use the default settings of the Auto numeric
node. That is, using the partitioned data for ranking the models, with the
correlation as indicator. We also want to calculate the importance of the
predictors (see Fig. 5.201). Furthermore, we include only the standard models,
which are uniformly specified by the Auto numeric node, in the evaluation and
comparison process (see Fig. 5.202). We strongly recommend also testing other
models and playing with the parameters of these models, in order to find a more
suitable model and to optimize the accuracy of the prediction.
3. After running the stream, the model nugget pops up. When opening the nugget,
we see in the “model rank” tab that the best three models and nodes are the
Regression node, the GLMM node, and the Linear model (see Fig. 5.203). There
the GenLin node tried to estimate the data relationship using a line: the default
setting. We observe that for all of the three models, the Correlation coefficient is
extremely high, 0.998. The models then show ranking results of further decimal
places, which are not displayed. Moreover, the model stemming from the Linear
node is slimmer, insofar as fewer predictors are included in the final model, i.e.,
4 instead of 6.
5.5 The Auto Numeric Node 503
Fig. 5.202 Definition of the models included in the fitting and evaluation process
Fig. 5.203 Evaluation overview of the best models selected by the Auto numeric node
Fig. 5.204 Scatterplot to visualize the model performance and the predictor importance
Fig. 5.205 Complete stream of the exercise to fit a regression into the test score data with the
Auto numeric node
1. To build the above stream, we first open the “001 Template-Stream test_scores”
stream and save it under a different name. Now, add a Partition node to the
stream and split the dataset into training, test, and validation sets, as described in
Fig. 5.165.
2. To include both the model building and the validation process in a single node,
we now add the Auto numeric node to the stream and select the “posttest” field as
the target variable, the Partition field as the partition and all other variables as
inputs (see Fig. 5.206).
3. In the “Expert” tab, we specify the models that should be used in the modeling
process. We only select the Generalized Linear Model (see Fig. 5.207). Then, we
Fig. 5.207 Definition of the GLM that is included in the modeling process
5.5 The Auto Numeric Node 507
click on the “Specify” option to specify the model parameters of the different
models that should be considered. This can be viewed in Fig. 5.207.
4. In the opened pop-up window, we go to the “Expert” tab and click on the right
“Option” field in the Link function, to define its parameters (see Fig. 5.208).
Then, we select the “Power” function as the link function, see Fig. 5.209, and
0.5, 1, and 1.5 as the exponents (see Fig. 5.210). After confirming these
selections by clicking on the “ok” button, all the final parameter selections are
shown (see Fig. 5.211). Now, three modes are considered in the Auto numeric
node, each having the “Power” function as link function, but with different
exponents.
508 5 Regression Models
5. After running the stream, the model nugget appears. In the “Model” tab of the
nugget, we see that the three models all fit the training data very well. The
correlations are all high, around 0.981. As in Exercise 2 in Sect. 5.4.5, the
highest ranked is the model that has exponent 1 (see Fig. 5.212). The differences
between the models are minimal, however, and thus can be ignored.
6. If we look at the ranking of the test set, the order of the model changes. Now, the
best-fitting model is model number 3, with exponent 1.5 (see Fig. 5.213). We
double-check that all the models are properly fitted, and then we choose model
3 as the final model, based on the ranking of the test data.
5.5 The Auto Numeric Node 509
7. After adding an Analysis node to the model nugget, we run the stream again (see
Fig. 5.214 for the output of the Analysis node). We see once again that the model
fits the data very well (low and equal RMSE), which in particular means that the
validation of the final model is as good as for the testing and training sets. This
coincides with the results of Exercise 2 in Sect. 5.4.5.
510 5 Regression Models
Fig. 5.212 Build models ranked by correlation with the training set
Fig. 5.213 Build models ranked by correlation with the testing set
Literature 511
Literature
Abel, A. B., & Bernanke, B. (2008). Macroeconomics (Addison-Wesley series in economics).
Boston: Pearson/Adison Wesley.
Boehm, B. W. (1981). Software engineering economics (Prentice-Hall advances in computing
science and technology series). Englewood Cliffs, NJ: Prentice-Hall.
Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data (Econometric society
monographs 2nd ed., Vol. 53). Cambridge: Cambridge University Press.
Fahrmeir, L. (2013). Regression: Models, methods and applications. Berlin: Springer.
Gilley, O. W., & Pace, R. (1996). On the Harrison and Rubinfeld data. Journal of Environmental
Economics and Management, 31(3), 403–405.
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air.
Journal of Environmental Economics and Management, 5(1), 81–102.
Hilbe, J. M. (2014). Modeling of count data. Cambridge: Cambridge University Press.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy.
International Journal of Forecasting, 22(4), 679–688.
IBM. (2015a). ‘SPSS modeler 17 algorithms guide. Accessed September 18, 2015, from ftp://public.
dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/AlgorithmsGuide.pdf
IBM. (2015b). SPSS modeler 17 modeling nodes. Accessed September 18, 2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerModelingNodes.pdf
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol. 103). New York: Springer.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.
512 5 Regression Models
Kutner, M. H., Nachtsheim, C., Neter, J., & Li, W. (2005). Applied linear statistical models (The
McGraw-Hill/Irwin series operations and decision sciences 5th ed.). Boston: McGraw-Hill
Irwin.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Thode, H. C. (2002). Testing for normality, statistics, textbooks and monographs (Vol. 164).
New York: Marcel Dekker.
Tuffery, S. (2011). Data mining and statistics for decision making (Wiley series in computational
statistics). Chichester: Wiley.
Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design
(McGraw-Hill series in psychology 3rd ed.). New York: McGraw-Hill.
Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms (Chapman & Hall/CRC
machine learning & pattern recognition series). Boca Raton, FL: Taylor & Francis.
Factor Analysis
6
1. Evaluate data using more complex statistical techniques such as factor analysis
2. Explain the difference between factor and cluster analysis
3. Describe the characteristics of principal component analysis and principal factor
analysis as well as
4. Apply especially the principal component analysis and explain the results
Ultimately, the reader will be called upon to propose well thoughtout and
practical business actions from the statistical results.
ordinal scale. The concrete question was “Please rate how the following dietary
characteristics describe your preferences . . .”.
As depicted in Fig. 6.1, the answer of each respondent can be visualized in a
profile chart (columns). Based on these graphs, we can find some similarities
between the variables “filling” and “hearty”.
If we analyze the profile charts row by row, we find that the variables are
somehow similar because the answers often go in the same direction. In statistical
terms, this means that the fluctuation of the corresponding answers (the same items)
is “approximately” the same.
" The factors are the explanation and/or reason for the original fluctua-
tion of the input variables and can be used to represent or substitute
them in further analyses.
6.2 General Theory of Factor Analysis 515
" Factor analysis can be used to reduce the number of variables. The
algorithm determines factors that can explain the common variance
of several variable subsets. The strength of the relationship between
the factor and a variable is represented by the factor loadings.
" The variance of one variable explained by all factors is called the
communality of the variable. The communality equals the sum of the
squared factor loadings.
" Both methods, PFA and PCA, can be also differentiated by their
general approaches used to find the factors: the idea of the PCA is
that the variance of each variable can be completely explained by the
factors. If there are enough factors there is no variance left. If the
number of factors equals the number of variables, the explained
variance proportion is 100 %.
We want to demonstrate the PCA as well as the PFA in this chapter. Figure 6.2
outlines the structure of the chapter.
516 6 Factor Analysis
Fig. 6.3 Using correlation or covariance matrix as the basis for factor analysis
in the majority of the applications. This is why the covariance depends on the units
of the input variables.
The quality of the factor analysis heavily depends on the correlation matrix;
because of this different measures have been developed to determine if a matrix is
appropriate, and a reliable basis for the algorithm. We want to give here an
overview of different aspects that should be assessed:
Bartlett-Test/Test of Sphericity
Based on Dziuban and Shirkey (1974), the test tries to ascertain that the sample
comes from a population where the input variables are uncorrelated. Or in other
words, that the correlation matrix is only incidentally different from the unit matrix.
Based on the chi-square statistic, however, the Bartlett test is necessary but not
sufficient. Dziuban and Shirkey (1974, p. 359) stated: “That is, if one fails to reject
the independence hypothesis, the matrix need be subjected to no further analysis.
On the other hand, rejection of the independence hypothesis with the Bartlett test is
not a clear indication that the matrix is psychometrically sound”.
Based on Kaiser and Rice (1974), values should be larger than 0.5, or better still,
larger than 0.7 (see also Backhaus 2011, p. 343).
In particular, the MSA/KMO criterion is widely used and strongly recommended
in literature on the subject. Unfortunately, the SPSS Modeler does not offer this or any
of the other statistics or measures described above. The user has to be sure that the
variables and the correlation matrix are a reliable basis for a factor analysis. In the
following section, we will show possibilities for assessing the quality of the matrix.
6.3.1 Theory
Principal Component Analysis (PCA) is a method for determining factors that can
be used to explain common variance in several variables. As the name of the
method suggests, let us assume the PCA tries to reproduce the variance of the
original variables (principal components) after determining subsets of them. As
outlined in the previous section, the identified factors or principal components can
be described as “the general description [not reason!] of the common variance” in
the case of a PCA (see Backhaus 2011, p. 357).
In this chapter, we want to extract the principal components of variables and
therefore reduce the actual number of variables. We are more interested in finding
“collective terms” (see Backhaus 2011, p. 357) or “virtual variables”, and not the
cause for the dietary habits of the survey respondents. That’s because here we use
the PCA algorithm.
There are several explanations of the steps in a PCA calculation. The interested
reader is referred to Smith (2002).
" PCA identifies factors that represent the strength of the relationship
between the hidden and the input variables. The squared factor
loadings equal the common variance in the variables. The factors
are ordered by their size or by the proportion of variance in the
original variables that can be explained.
The data we want to use in this section describe the answers of respondents to
questions on their dietary habits. Based on the categories “vegetarian”, “low meat”,
“fast food”, “filling”, and “hearty”, the respondents rated the characteristics of their
diets on an ordinal scale. The concrete question was “Please rate how the following
dietary characteristics describe your preferences . . .”. The scale offered the
values “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. For details see also
Sect. 10.1.24.
We now want to create a stream to analyze the answers and to find common
factors that help us to describe the dietary characteristics. We also will explain how
PCA can help to reduce the number of variables (we call that variable clustering),
and help cluster similar respondents, in terms of their dietary characteristics.
" There are at least three options for excluding variables from a stream.
1. In a (Statistics) File node, the variables can be disabled in the tab “Filter”.
2. The role of the variables can be defined as “None” in a Type node.
3. A Filter node can be added to the stream to exclude a variable from
usage in any nodes that follow.
522 6 Factor Analysis
" The user should decide which option to use. We recommend creating
transparent streams that each user can understand, without having to
inspect each node. So we prefer Option 3.
Now we want to reduce the number of variables and give a general description of
the behavior of the respondents. Therefore, we perform a PCA. To do so we should
verify the scale types of the variables and their role in the stream. We double-click
the Type node. Figure 6.8 shows the result. The ID was excluded with a Filter node.
All the other variables with their three codes, “1 ¼ never”, “2 ¼ sometimes”, and
“3 ¼ (very) often” are ordinally scaled.
As explained in detail in Sect. 4.5, we want to calculate the correlations between
the variables. That means we determine the elements of the correlation matrix.
Here, we have ordinal input variables that ask for Spearmans rho as an appropriate
measurement (see for instance Scherbaum and Shockley 2015, p. 92). Pearson’s
correlation coefficient is an approximation, based on the assumption that the
distance between the scale items is equal.
The number of correlations that are at most weak (smaller 0.3) should be relatively
low. Otherwise, the dataset or the correlation matrix is not appropriate for a PCA.
" The SPSS Modeler does not provide the typical measures used to
assess the quality of the correlation matrix, e.g., the inverse matrix,
the Anti-Image-Covariance Matrix, and the Measure of Sampling
Adequacy (MSA)—also called Kaiser–Meyer–Olkin criterion (KMO).
6.3 Principal Component Analysis 523
" After reviewing the scale types of the variables, the correlation matrix
should be inspected. The number of correlations that are at most
weak or very low (below 0.3) should be relatively small.
To calculate the correlation matrix, we can use a Sim Fit node as explained in
Sect. 4.5. We add this type of node to the stream from the Output tab of the Modeler
and connect it with the Type node (Fig. 6.9).
2. Running the Sim Fit node and reviewing the results, we can find the correlation
matrix as shown in Fig. 6.10. With 200 records, the sample size is not too large.
The number of small correlations that are at most weak (smaller 0.3) is 4 out of
10, or 40 %, and so not very small. However, reviewing the correlations shows
that they can be explained logically. For example, the small correlation of
0.012 between the variables “fast_food” and “vegetarian” makes sense. We
therefore accept the matrix as a basis for a PCA.
3. As also explained in Sect. 4.5, the Sim Fit node calculates the correlations based
on an approximation of the frequency distribution. The determined approxima-
tion of the distributions can result in misleading correlation coefficients. The
524 6 Factor Analysis
Fig. 6.9 Sim Fit node is added to calculate the correlation matrix
Fig. 6.10 Correlation matrix is determined with the Sim Fit node
Sim Fit node is therefore a good tool, but the user should verify the results by
using other functions, e.g., the Statistics node.
To be sure that the results reflect the correlations correctly, we want to
calculate the Pearson correlation with a Statistics node. We add this node to
the stream and connect it with the Type node (see Fig. 6.11).
4. The parameter of the Statistics node must be modified as follows:
– We have to add all the variables in the field “Examine” by using the drop-
down list button on the right-hand side. This button is marked with an arrow
6.3 Principal Component Analysis 525
in Fig. 6.12. Now all the statistics for the frequency distributions of the
variables can be disabled.
– Finally, all the variables must also be added to the field “Correlate”. Once
more the corresponding button on the right-hand side of the dialog window
must be used.
Now we can run the Statistics node.
5. The results shown in Fig. 6.13 can be found in the last row or the last column
of the correlation matrix in Fig. 6.10. The values of the matrix are the same.
So we can accept our interpretation of the correlation matrix and close the
window.
N.B.: We outlined that there are several statistics to verify and to ensure the
correlation matrix is appropriate for a PCA. In the download section of this book,
an R-Script for this PCA is also available. Here, the following statistics can be
determined additionally:
This shows that the matrix is indeed quite appropriate for demonstration
purposes. For more details using R with the Modeler see Sect. 9.
526 6 Factor Analysis
6. Now we can apply the concept of the PCA to the given data. For this, we add a
PCA/Factor node from the SPSS Modeler tab “Modeling” and connect the node
to the Type node (see Fig. 6.14).
7. By double-clicking on the new node, we can modify the details of the calculation
procedure. As we have carefully checked the available variables, as well as their
scale type in the stream, we can use the option “Use type node settings” in the
first tab of the dialog window shown in Fig. 6.15.
Fig. 6.16 Determination of the communality extraction method in the PCA/Factor node
8. We can find the most important options in the tab “Model” of the parameter
dialog (see Fig. 6.16). As decided before, we want to perform here a Principal
Component Analysis (PCA). It computes factors in order to extract the maxi-
mum of their variance. Further details related to all other algorithms offered by
the Modeler can be found in IBM (2015, p. 163). We will explain in the next
section the application of “Principal Axis Factoring (PFA)”.
9. In the third parameter tab of the PCA/Factor node, shown in Fig. 6.17, the
calculation procedure can be determined in detail. We suggest using the corre-
lation matrix here. Theoretically, we can determine the results based just on the
covariance matrix. In the majority of applications, the correlation matrix should
be used. The correlation matrix should be preferred because of the dependency
of the covariance on the units of the input variables, especially in the case of
different dimensions; also when the variance in the input variables is different
(see Jackson 2003, pp. 64–65; Jolliffe 2002, p. 24).
Additionally, in the “Expert” tab we can determine how many factors to
extract. The aim of a PCA is dimension reduction. Therefore, eigenvectors are
determined to transpose the given data (see Fig. 6.4). Every eigenvector has an
eigenvalue. This eigenvalue measures the amount of variance that is in the data
in the direction of the eigenvector.
Normally, principal components with an eigenvalue larger than one should be
extracted (see Kaiser 1960, p. 146). This rule is also implemented and activated
in the SPSS Modeler. The Modeler does not offer a wide range of statistics,
6.3 Principal Component Analysis 529
however, as in the case of the PCA/Factor node. For instance, a scree plot would
be helpful here but is not available. We will show how to create this plot later on
(see Fig. 6.18). Meanwhile, we recommend extracting all eigenvalues in the case
of a PCA. To do this, we set the parameter “Eigenvalues” in Fig. 6.17 to zero.
Based on the results of the PCA, we can easily determine the number of factors
to use.
" The PCA should be applied based on the correlation matrix. This is
especially useful for input variables with different units or
dimensions.
" All eigenvalues (above zero) should be extracted in the first run.
Reviewing the results, the user can determine the number of factors
to use later on. The Kaiser criterion and the scree plot shown in
Fig. 6.18 can help to determine an appropriate number of factors.
530 6 Factor Analysis
10. We run the PCA and we get the additional model nugget node with the
determined model and its parameters (see Fig. 6.19). Usage of the model
nugget is explained in detail in Sect. 5.
11. If we double-click on the model nugget node, the Modeler shows us the results
in the tab “Advanced” (see Fig. 6.20). The initial communalities—the amount
of common variance of the variables explained—is 1.0. That’s because here we
performed a PCA. The PCA is based on the assumption that the whole variance
can be explained by the factors (see Sect. 6.1).
6.3 Principal Component Analysis 531
" There are many rules when determining the number of factors to
extract.
532 6 Factor Analysis
" First the user can determine the number of factors by deciding how
large the proportion of cumulative variance explained should be.
" At least all factors with an eigenvalue larger than one should be
extracted. This is called the Kaiser criterion (see, e.g., Guttman 1954).
" A Scree plot can be created, based on the unrotated solution. The plot
visualizes the number of components/factors extracted vs. the per-
centage of explained variance. A rule of thumb is to use as many
factors as eigenvalues before and including the elbow. Sometimes
the scree plot tends to encourage using too many factors (see Patil
et al. 2010). The Modeler does not provide the option of creating a
scree plot, but it can be produced easily with the results shown in the
“Advanced” tab of the factor analysis model nugget.
" Both the Kaiser criterion and the scree plot help to determine a useful
number of components to extract. The researcher should find the
best option with this information.
always be interpreted that easily. Besides this uncertainty, we can also draw a
diagram as shown in Fig. 6.22.
" It is important to note that the factors load to the variables and not
the other way around!
" To analyze the component matrix (see Fig. 6.21) row by row means to
determine which factor or component loads to which variable. If one
variable is associated with more than one factor (more than one value
is equal or larger 0.5), then the interpretation of the corresponding
factors is not that simple.
Finally, the components should be rotated to get a better result. Varimax has an
appropriate procedure for doing this.
12. Before we interpret the components, we want to rotate them and improve the
solution. This means that the coordinate system will be rotated. Based on
Backhaus (2011, p. 363), we can distinguish two types of rotation:
534 6 Factor Analysis
– If the factors (not the input variables) are not correlated with each other, then
an orthogonal rotation such as Varimax can be used.
– If the rotation should be done in respect to a correlation between the factors,
then an oblique-angled rotation should be used.
14. Finally, we can run the PCA node once more. Figure 6.25 shows the result.
Here, we can see a better association of all factors with the input variables. See
also Fig. 6.26 for the association of the factors to the variables. It is important to
note that the factors load to the variables.
536 6 Factor Analysis
The sum of all squared factor loadings per component equals the sum of the
squared values per column in Fig. 6.25. This sum equals the eigenvalues and can
be found in Fig. 6.27 in the column “Rotation sums of Squared Loadings/Total”.
For instance,
2
0:041 þ 0:2412 þ 0:9612 þ 0:9132 þ 0:9132 ¼ 2:65
The proportion of the variance explained by each factor equals the eigenvalues
divided by the number of variables. As the eigenvalues equal the sum of all the
squared factor loadings, they must be divided by the number of variables. In this
case, factor 1 can explain
((0.041)2 + 0.2412 + 0.9612 + 0.9132 + 0.9132)/5 ¼ 2.65/5 ¼ 53 %, and factor
2 can explain (0.9602 + 0.9242 + 0.0182 + 0.1212 + 0.1252)/5 ¼ 36.1 %. All in all,
53 % + 36.1 % ¼ 89 %. This is what we determined earlier in the PCA (see Fig. 6.20).
" The squared factor loadings equal the percentage of the variance of
the input variable that can be explained by that factor.
" The sum of all squared factor loadings equals the eigenvalues of this
factor.
In the last two columns of the table in Fig. 6.29, we can see the so-called “factor
scores” for each case. This is a linear combination of the input variables calculated
with the PCA components. So each case is expressed in terms of the determined
components.
Using these results here, the answers given by the respondents can be
represented with only a loss of 100 % 89.16 % ¼ 10.84 % accuracy. The formula
given by the Modeler in the “Model” tab of the model nugget is used to calculate
these factor scores (see Fig. 6.30). For respondent number 1, the following equation
is used
In reference to Janssen and Laatz (2010, pp. 571–573), the factor scores can be
precisely determined for a PCA, but for all other factor analyses types, a multivari-
ate regression model is used. The SPSS Modeler manual does not provide any
detailed information here (see IBM 2015, p. 165). It is obvious though that a
6.3 Principal Component Analysis 539
Fig. 6.30 Equations for factor score calculation in the model nugget node
multivariate regression, based on the rotated factor loadings, is used in PCA cases.
See the structure of the equation in Fig. 6.30.
" The factor scores or better still the principal component scores express
the input variables in terms of the determined factors, by reducing the
amount of information represented. The loss of information depends
on the number of factors extracted and used in the formula.
" The factor scores are standardized. So they have a mean of zero and a
standard deviation of one.
To visualize the factor scores, we add a Plot node from the Modelers Graph tab
to the stream and connect the model nugget node to this new node (see Fig. 6.31).
540 6 Factor Analysis
In the parameter section of the Plot node, we first want to assign factor 1 to the
x-axis and factor 2 to the y-axis (see Fig. 6.32).
Running the Plot node, we get the diagram in Fig. 6.33.
6.3 Principal Component Analysis 541
Figure 6.34 is an extended version of Fig. 6.33. It shows three clusters and the
numbers of some records corresponding to the Table node in Fig. 6.29. It is
important to note that these records or row numbers are not representative. Due
to the discrete character of the questionnaire data, or more to the point, because of
the ordinal scale, the data points are overlapping.
This example shows that PCA can also be used as a clustering algorithm. PCA
can do more than just find “clusters of related variables”, as depicted in Fig. 6.26.
Clusters of objects can also be identified. We will discuss this in Exercise 4 of Sect.
7.4.3. This PCA-object-clustering characteristic can only be visualized in cases
with two or three extracted components, however. As we want to show in the
following chapter, a cluster algorithm can also be used to identify clusters based on
the factor scores of the records.
542 6 Factor Analysis
To add more information to the graph, we can modify its parameters or add a
second Plot node to the stream (see Fig. 6.35).
We double-click on the new Plot node and modify the parameters as follows (see
Fig. 6.36):
6.3 Principal Component Analysis 543
Figure 6.37 shows the complete stream. The animated diagram produced by the
second Plot node is depicted in Fig. 6.38. We can see that . . .
Fig. 6.39 Diagram with the PCA results (part 1), “low_meat ¼ never”
Fig. 6.40 Diagram with the PCA results (part 2), “low_meat ¼ sometimes”
Depending on the slider position, and the answer to the preference regarding
“low_meat”, we get results as depicted in Figs. 6.39, 6.40, and 6.41.
Summarizing all our findings, we can state that visualization of the factor scores
helps find clusters of respondents with similar dietary characteristics. Figure 6.42
shows three different respondents. Each of them represents one of the clusters. For
an overview of the clusters, see Fig. 6.43 and Table 6.1.
546 6 Factor Analysis
Fig. 6.41 Diagram with the PCA results (part 3), “low_meat ¼ (very) often”
6.3.3 Exercises
If the components in the diet example represent the “heartiness” or the “nativeness/
naturalness” of the food consumed by the respondents, then indices for even the
type of preference can be calculated. The indices then measures the level of
“heartiness” or “nativeness/naturalness” a respondent prefers.
The key idea behind index calculation is to use a linear combination for each subset
of the input variables identified by the PCA. The coefficients represent the weight or
the importance of each variable for the index (see also Wendler 2004, pp. 187–196).
To determine the coefficients, factor loadings can be used.
Answer the following questions . . .
3. Merge the calculation results for both indices using a Merge node. Don’t forget
to disable duplicated variables in the options of the Merge node. After merging
the results, finally visualize the index values.
4. Interpret your findings.
1. Open the template stream. Analyze the data, as well as the variables.
Based on the following variables, use the PCA algorithm to determine two
components from those below, which represent more than 50 % of the volatility:
“starttime”, “system_availability”, “performance”, “training_quality”,
“user_orientation”, “data_quality”.
2. Explain your findings from the PCA in detail.
(a) Try especially to explain the components that have been determined.
(b) The variable “user_quali” represents the answer to the question “How do
you evaluate your knowledge, abilities, and accomplishments when dealing
with the IT system and the provided applications?”. Explain why this
variable could not be used in the PCA.
3. The input values of the variables named above can now be used to determine the
values of satisfaction indices. Using probably a separate calculation, e.g., a
Microsoft Excel spreadsheet, calculate the coefficients for each of the variables.
Standardize the coefficients so that the indices have the same scale as the input
variables.
4. Using the coefficients, extend the stream to calculate and visualize the values of
the satisfaction indices.
5. Summarize your findings in the form of a “Management Summary”.
would then evaluate mathematical terms such as “exponential function” and “divi-
sor”. A detailed description can be found in Sect. 10.1.28. Based on the template
stream “Template-Stream_OECD_PISA_Questionnaire_Question_45”, a PCA
should be performed.
1. Open the template stream and save it under a new name. Analyze the variables
included, as well as their scale.
2. Assess the quality of the correlation matrix.
3. Modifying the parameter of an appropriate Modeler node, perform a PCA.
Discuss especially the number of factors to extract. To do this you should also
create a scree plot.
4. Describe the identified components.
5. Assess the quality of the PCA and the result.
6.3.4 Solutions
Fig. 6.46 Parameters of the Derive node for calculating the “heartiness” index
Fig. 6.47 Parameters of the Derive node for calculating the “nativeness/naturalness” index
6.3 Principal Component Analysis 553
3. In the Modeler, a Derive node has to be used to calculate each index. After that,
the results have to be consolidated using a Merge node. In the options for this
node, duplicated input variables must be disabled, as shown in Fig. 6.48.
Now the stream can be extended with a table, as well as a Plot node, to
visualize the result (see Fig. 6.49 and 6.50). Figure 6.51 shows the parameters of
the Plot node, and Fig. 6.52 shows the diagram with the index values.
4. This example shows how to use squared factor loadings to create indices
generally. The idea is to standardize the squared loadings and to use them as
coefficients of a linear combination of the corresponding input variables.
554 6 Factor Analysis
The result helps to cluster the respondents by their preferences. Figure 6.52 is
similar to Fig. 6.43, and the algorithm to calculate the indices can be used
generally for all PCA results. For instance, we can calculate satisfaction indices
based on consumer survey results, by clustering questionnaire items. So we can
see if the satisfaction increases or decreases year by year.
But there are also some disadvantages:
6.3 Principal Component Analysis 555
1. Figure 6.53 shows the template stream. The meaning of the input variables is
explained in detail in Sect. 10.1.20.
To perform the PCA, the template stream has to be extended. First of all, we
recommend using a Filter node to select the necessary variables named. Fig-
ure 6.54 shows the details. It is important to remember to disable the variables
that are not necessary for the PCA.
Additionally, we add a PCA/Factor node. Figure 6.55 shows the extended
template stream and the model nugget with the PCA. The factor scores are
shown in the Table node.
As described in the exercise, we should extract two components. Figure 6.56
shows the parameters of the PCA/Factor node. The number of components to
extract is at least 2, and the “Varimax” algorithm is used to rotate the solution.
2. Figures 6.57 and 6.58 show the most important details from the PCA results.
With the two components, we can explain 60.255 % of the variance of the used
input variables. This is not too much but a larger number of components would
be even harder to interpret, for example, in the case of six input variables.
(a) The factor loadings in Fig. 6.57 allow us to assign each variable to a
component. The start time, the system availability, as well as the perfor-
mance are more technical characteristics of an IT system. Therefore, the
component behind these variables will be called “technical satisfaction”.
Later, this will also be the name of the index.
Component 2 refers to the quality of the training offered by the firm, the
user orientation of the system, and the quality of the data that can be
accessed. We will call this component “organizational satisfaction”.
6.3 Principal Component Analysis 557
(b) The variable “user_quali” represents the answer to the question “How do
you evaluate your knowledge, abilities, and accomplishments when dealing
with the IT system and the provided applications?” So this is a user self-
assessment and doesn’t represent any aspect of satisfaction with the IT
system. We should not include this variable in any of the calculations.
3. As explained in Sect. 6.3.1, as well as in the previous exercise, the squared factor
loadings equal the explained variance of a variable by a factor. We can use these
numbers as coefficients, to cumulate the input variables into an index. The index
is a linear combination of the variables that are related to one of the components.
For straightforward calculation, we use Microsoft Excel and export the table
in Fig. 6.58 into a separate file with a right mouse-click. Several easy
calculations can be done in this spreadsheet. Standardization is necessary, to
558 6 Factor Analysis
ensure that the index has the same scale as the input variables. We divide the
squared factor loadings by their sum. For details see Fig. 6.59. The results can be
found in file “pca_IT_user_satisfaction_index_coefficients.xlsx”.
560 6 Factor Analysis
4. Now we can extend the stream to calculate the values of the indices. We use two
Derive nodes. In the formula, we have to define the coefficients manually, as
calculated in Fig. 6.59. Figures 6.60 and 6.61 show the parameters of the nodes.
To have a chance to assess the index values, we have to combine the results of
both parts of the stream. Here, we use a Merge node. The parameters are shown
in Fig. 6.62. Again it is important to disable the duplicates at the bottom of the
dialog window.
6.3 Principal Component Analysis 561
To review the results, we add a Table node, a Data Audit node, and a Plot
node, as shown in Fig. 6.63. In the last two columns of the Table node in
Fig. 6.64, the index values for the first five respondents are shown. For a more
detailed analysis, we can use the Data Audit node. The statistical measures of
central tendency for the indices (see Fig. 6.65), show us there is no respondent
that answered all questions with “1 ¼ poor”. That’s because the minimum
indices is larger than 1. Theoretically, we could rescale the indices, but the
frequency distributions, as well as the 2D scatterplot of the indices in Fig. 6.66,
show us their practical importance.
5. Management Summary: Technical aspects, as well as organizational aspects,
determine the satisfaction of IT users. In the analysis, we use three variables for
each of these categories. With statistical analysis (PCA), we can find two
important factors, which help us to summarize user opinions. Despite the fact
that the determined components do not represent all the information details
(at least 60 %), we can calculate values within a technical and an organizational
satisfaction index. As the survey will be repeated at adequate intervals, we can
measure satisfaction over time. Regardless of technical details, the firm can
evaluate the effect of IT expenditures based on the indices.
562 6 Factor Analysis
Fig. 6.65 Measures of the central tendency and volatility of the indices
1. The details of the dataset are outlined in Sect. 10.1.28. Here, the description of
the variables and their coding are explained. Figure 6.67 shows the structure of
the Modeler’s template stream, which is the basis for performing a PCA at the
end. Verifying the settings in the Type node, we can see that all the variables are
ordinally scaled. This is correct given the scale of the answers.
2. To inspect the correlation matrix, we use a Sim Fit node. As this node
approximates the frequency distributions of the variables in the background,
we add also a Statistics node to the stream. The extended template stream is
depicted in Fig. 6.68.
The node at the bottom shows the result of the Sim Fit node. Double-clicking
on this node, the correlation matrix in Fig. 6.69 appears. The correlations
determined by the Sim Fit node are correct, as we can see in Fig. 6.70, which
shows the results of the statistics node. The correlations in Fig. 6.70 can be found
in the first row of the correlation matrix.
We discussed the output of the Statistics node in detail in Sect. 4.4. The
absolute values of the correlations can be found in Table 4.2, but here the
correlations are assessed by the Modeler, using their inverse significance. As
shown in Table 4.3, there is a good chance that the variables are correlated if the
inverse significance is larger than 0.95. Scrolling through the results of the
Statistics node, we can see that all correlations (bar one) are strong. That means
that the values are reliable. The absolute values, however, tell us that the
correlation matrix is not a good basis for a PCA.
N.B.: We outlined that there are several statistical tools for verifying if the
correlation matrix is appropriate for a PCA. An R-Script for this PCA is also
available in the download section of this book. Here, the following statistics can
also be determined:
6.3 Principal Component Analysis 567
These statistics let us assume that we can use the correlation matrix as the
basis for a quite reliable PCA.
3. We add a PCA/Factor node with the parameters shown in Fig. 6.71. Rotation of
the solution is not yet enabled. We run the PCA/Factor node and add the model
nugget to the stream (see Fig. 6.72).
Assessing the Modelers output in Fig. 6.73, we can see that 55.36 % of the
variance in the responses can be explained by just three factors. Figure 6.74
depicts the scree plot. Also here we can see that three factors should be extracted.
Now we can modify the parameters of the PCA/Factor node so that the three
factors are extracted and the Varimax rotation will be used. Figure 6.75 shows
the result.
568 6 Factor Analysis
6.4.1 Theory
The most important details of the factor analysis are explained in Sect. 6.2. Addi-
tionally, we discussed the steps of a factor analysis in Sect. 6.3.1, where we used the
Principal Component Analysis (PCA). In the end, the PCA tries to identify factors
or principal components that are used as “a general description [not reason!] for the
common variance” (see Backhaus 2011, p. 357).
In this chapter, we want to extend our knowledge by looking at Principal Factor
Analysis (PFA). We will use the dataset with responses from 200 consumers
regarding their dietary preferences. For details see Sect. 10.1.24.
570 6 Factor Analysis
Fig. 6.76 Components vs. input variables with the variance proportion explained
" Using a PFA means finding reasons for the values of variables or the
behavior of respondents. PFA should only be used when there is a
theory on the dependency between variables.
" The aim of the PFA is to confirm that theory. This procedure is called a
confirmatory factor analysis.
572 6 Factor Analysis
" If it turns out that the theory is inappropriate, and has to be updated,
then an exploratory factor analysis approach is performed. Statisti-
cally speaking, that means determining factor loadings that best
reproduce the empirical correlation matrix (fundamental theorem)
(see Backhaus et al. 2013, p. 125).
The assumption with PFA is different (e.g., Tacq 1997, pp. 298–301): The
volatility of a variable can be divided into the communality and a residual variance
that cannot be explained using hidden variables. So the initial communality will
always be smaller than 1. Theoretically, we can determine the proportion of
variance that should be reproduced, based on our knowledge of the relationship.
The PFA will determine factors that can reproduce the given proportion. In fact, the
software contains an algorithm that tries to determine the common variance.
Assuming that the SPSS Modeler algorithm and the procedure implemented in
IBM SPSS Statistics are the same, the initial communalities are determined as
multiple determination coefficients of a multivariate regression, with each variable
as the target variable to reproduce.
Here, we also want to use data that represent the answers of respondents in relation
to their dietary habits. Based on the categories “vegetarian”, “low meat”, “fast food”,
“filling”, and “hearty”, the respondents should rate the characteristics of their diet on
an ordinal scale. The concrete question was “Please rate how the following dietary
characteristics describe your preferences . . .”. The scale offers the values
“1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. See also Sect. 10.1.24.
In Sect. 6.3.2, we explained in detail how to create a stream to perform a PCA. As
discussed above, the PFA and the PCA calculations are similar, except for the initial
communalities that are used and will be reproduced by the factors. The calculations
are similar because we don’t have to create a new stream here. Instead, we will use
574 6 Factor Analysis
3. The rotated factor loadings are shown in Fig. 6.83. In comparison with the PCA
results in Fig. 6.84, the loadings, and therefore also their communalities, are
smaller. That is because of the restriction caused by the initial communalities, at
the starting point of a PFA extraction.
4. As it does not make sense to plot the factor scores here, we should remove both
Plot nodes. Figure 6.85 shows the final stream.
6.4 Principal Factor Analysis 577
Fig. 6.86 Reproduced correlations and the residuals determined using SPSS Statistics
shows the values and the derivation from the original values, called residuals. These
results are calculated with IBM SPSS statistics. All residuals are smaller than 0.05,
so the factors can be used to explain the original variables.
6.4.3 Exercises
1. Explain the difference between the PCA and the PFA algorithm in your own
words.
2. Remember the PCA and PFA diet example shown here. Figures 6.83 and 6.84
show the determined factor loadings. Give a detailed interpretation of the
“practical” meaning of both results.
3. The right-hand side of Fig. 6.81 shows the extracted communalities. Figure 6.83
shows the rotated factor loadings.
(a) Interpret the communality 0.857 for the variable “low_meat”.
(b) Explain the calculation of the value 0.857 for the variable “low_meat”,
using the factor loadings shown in Fig. 6.83.
1. Using the dataset, create a new stream from scratch. There is no template stream
given. Load the dataset using an appropriate node. Show the data and analyze the
variables, with regard to the possibility of using them in a PFA.
2. Now add all nodes that are necessary to perform a PFA to the stream.
3. Interpret you results.
6.4.4 Solutions
1. The difference between the PCA and the PFA algorithm is described in
Sects. 6.1, 6.3.2, and 6.4.1.
2. The factors determined by a Principal Factor Analysis (PFA) can be interpreted
as “the reason for the common variance”, whereas the factors determined by a
580 6 Factor Analysis
1. Figure 6.87 shows the complete stream. To import the dataset, we add a
Statistics File node and a Table node. As we can see in Fig. 6.88, four variables
are defined. The variable “name” cannot be used in a PFA. All the other three are
metrically scaled and therefore provide a possible basis for a PFA.
To create a transparent stream, we recommended adding a Type node. As
depicted in Fig. 6.89, the variables “price”, “calories”, and “alcohol” are contin-
uous. Additionally, the role for the variable “name” is “None”.
In addition, here we can add a Filter node to entirely block the variable
“name” from usage in any other nodes that follow. This is not really necessary
as the role was already set to “None”, but in our experience, one should do
everything possible to create streams that can be easily understood by each user.
You can decide which option to use.
Here, we add a Filter node to exclude the variable “name” (see Fig. 6.90).
582 6 Factor Analysis
2. To perform a PFA, we need to add a PCA/Factor node at the end of the stream. In
the options of the node, we defined “Principal Axis Factoring” as the extraction
method and also in the Expert tab, to extract all factors with eigenvalues larger
than 0. Finally, it is important to define the rotation method. Here, we used
“Varimax” (Figs. 6.91 and 6.92).
6.4 Principal Factor Analysis 583
3. We can identify two factors. Factor 2 loads only the variable “price”. Factor
1 loads only “calories” and ”alcohol”. Obviously, factor 2 is not very useful.
That is because the factor loading is 0.410 and so the communality is 0.168 (see
Fig. 6.93). The proportion of variance explained by factor 1 is negligible, but
factor 2 leads us to expect it to be useful. Looking at the communality, we can
explain a significant proportion of the variance of both input variables.
Here, we can identify a “common” proportion of variance between “calories”
and “alcohol”. Knowing the aim of a PFA, we can describe factor 2 as the
“heaviness” or “strength” of a beer. This example is useful for training purposes
only, however.
584 6 Factor Analysis
Literature
Backhaus, K. (2011). Multivariate Analysemethoden: Eine anwendungsorientierte Einf€ uhrung,
Springer-Lehrbuch (13th ed.). Berlin: Springer.
Backhaus, K., Erichson, B., & Weiber, R. (2013). Fortgeschrittene multivariate
Analysemethoden: Eine anwendungsorientierte Einf€ uhrung, Lehrbuch (2nd ed.). Berlin:
Springer Gabler.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research,
1, 245–276.
Dziuban, C. D., & Shirkey, E. C. (1974). When is a correlation matrix appropriate for factor
analysis? Some decision rules. Psychological Bulletin, 6(81), 358–361.
Guttman, L. (1953). Image theory for the structure of quantitative variates. Psychometrika, 18(4),
277–296.
Guttman, L. (1954). Some necessary conditions for common-factor analysis. Psychometrika, 19
(2), 149–161.
IBM. (2011). Kaiser-Meyer-Olkin measure for identity correlation matrix – United States.
Accessed March 18, 2015, from http://www-01.ibm.com/support/docview.wss?
uid¼swg21479963
IBM. (2015). SPSS modeler 17 modeling nodes. Accessed September 18, 2015 ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerModelingNodes.pdf
Jackson, J. E. (2003). A user’s guide to principal components. New York: Wiley.
Janssen, J., & Laatz, W. (2010). Statistische Datenanalyse mit SPSS: Eine anwendungsorientierte
Einf€ uhrung in das Basissystem und das Modul Exakte Tests; [Zusatzmaterial online] (7th ed.).
Berlin: Springer.
Literature 585
Jolliffe, I. T. (2002). Principal component analysis, Springer series in statistics (2nd ed.).
New York: Springer.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and
Psychological Measurement, 20, 141–151.
Kaiser, H. F., & Rice, J. (1974). Little Jiffy, Mark IV. Educational and Psychological Measure-
ment, 34, 111–117.
Patil, V. H., McPherson, M. Q., & Friesner, D. (2010). The use of exploratory factor analysis in
public health: A note on parallel analysis as a factor retention criterion. American Journal of
Health Promotion, 24(3), 178–181.
Scherbaum, C., & Shockley, K. M. (2015). Analysing quantitative data: For business and
management students, mastering business research methods. London: Sage.
Smith, L. (2002). A tutorial on principal components analysis. Accessed March 13, 2015, from
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Tacq, J. J. A. (1997). Multivariate analysis techniques in social science research: From problem to
analysis. London: Sage.
Wendler, T. (2004). Modellierung und Bewertung von IT-Kosten: Empirische Analyse mit Hilfe
multivariater mathematischer Methoden, Wirtschaftsinformatik. Wiesbaden: Deutscher
Universitäts-Verlag.
Cluster Analysis
7
1. Evaluate data using more complex statistical techniques such as cluster analysis,
2. Explain the difference between several approaches to deal with large datasets in
cluster analysis by using TwoStep or K-Means algorithm,
3. Describe the advantages and the pitfalls of the cluster analysis methods,
4. Apply TwoStep or K-Means and explain the results as well as
5. Describe the usage of the Auto Clustering node of the SPSS Modeler and its
pitfalls.
Ultimately, the reader will be called upon to propose well thought-out and
practical business actions from the statistical results.
successful segmentation, the department store can forecast the possible needs of
a customer group and deliver attractive offers.
• Banking—identification of groups in nonperforming loans.
Examining a dataset with firm-specific information (type of firm, business
sector, number of employees, rated experience of owner, etc.) on nonperforming
loans, a bank can define firm categories. In the case of a new enquiry for a loan,
the bank is able to predict the risk, based on the data of the firm. This process is
also called rating.
• Medicine—finding diagnostic clusters.
Based on data, a risk evaluation with respect to the carcinogenic qualities of
certain substances can be performed.
• Education—Identifying groups of students with special needs.
Using data on socioeconomic background, combined with the performance of
the students in several courses or programs, the university staff can identify
student subgroups with special needs, such as more intensive support, additional
exercises, or advanced services, such as use of the electronic services offered on
a university-wide online platform.
Customer segmentation
Customer segmentation, also called market segmentation, is one of the main fields
where cluster analysis is often used. A market is divided into subsets of customers
who have typical characteristics. The aim of this segmentation is to identify target
customers or to reduce risks.
In the banking sector, this technique is used to improve the profitability of
business and to avoid risks. If a bank can identify customer groups with lower or
higher risk of default, the bank can define better rules for money lending or credit
card offers. We apply cluster algorithms for customer segmentation purposes. See
also Sect. 10.1.7.
As outlined in the previous section, the term “cluster analysis” stands for a set of
different algorithms for finding subgroups in a dataset. Before we start to explain
the three algorithms available in SPSS Modeler, we want to give an overview of the
different approaches. Figure 7.2 shows their names, as well as the structure of this
section.
The big picture allows us to characterize the procedures by their advantages and
disadvantages, so that later on we can identify the correct procedure for the data
given or the problem to solve.
Considering a given dataset with two variables, as depicted in Fig. 7.3, we can
imagine different procedures for finding subgroups of objects. The variables here
are metrical and we can find the subgroups by determining the distance between the
objects pairwise. That means, determining their dissimilarity. We will show this
approach later in detail. For now, it is important to note that measuring the
similarity or the dissimilarity/distance of objects is the basis of all cluster
algorithms. For the first step, we want to focus here on procedures for assigning
the objects to the subgroups. Later we will discuss the measures in detail.
Figure 7.4 shows the big picture: how to categorize the clustering algorithms.
Hierarchical approaches can be divided into agglomerative or divisive algorithms.
Both are easy to understand. The agglomerative algorithms measure the distance
between all objects. In the next step, objects that are close are assigned to one
subgroup. In a recursive procedure, the algorithms now calculate the distances
between the more or less large subgroups and merge them stepwise by their
distance.
590 7 Cluster Analysis
The divisive algorithms assign all objects to the same cluster. This cluster is then
divided step-by-step, so that in the end homogeneous subgroups are produced.
" The scale type of the variables is crucial for the algorithm and the
result. The clusters contain a high homogeneity themselves (intra-
cluster homogeneity) and a small homogeneity between each other
(inter-cluster separability).
Considering such binary variables, several similarity functions can be used, e.g.,
Tanimoto, simple matching, or Russel & Rao coefficients. Nonbinary qualitative
variables must be recoded into a set of binary variables. This will be discussed in
exercise 2 of Sect. 7.2.1 for more details. The interested reader is referred to Timm
(2002), p. 519–522 and Backhaus (2011), p. 402.
a
sij ¼
aþbþc
is based on a contingency table shown in Table 7.3.
As we can see, the Tanimoto coefficient is the proportion of the common
characteristics a and the characteristics that are present in at least one object,
represented by a þ b þ c. Other measures, e.g., the Russel–Rao coefficient, use
other proportions. See Timm (2002), p. 521.
Given the example of products A and B in Table 7.2, we determine the
frequencies as shown in Table 7.4. The Tanimoto coefficient is then
2
sAB ¼ ¼ 0:4
2þ2þ1
594 7 Cluster Analysis
The solution can also be found in the Microsoft Excel file, “cluster dichotomous
variables example.xlsx”.
In the case of quantitative metrical variables, the geometric distance can be used
to measure the dissimilarity of objects. The larger the distance, the less similar or
the more dissimilar they are.
Considering two prices of products, we can use the distance, e.g., given by the
absolute value of the difference. If two or three metrical variables are available to
describe the product characteristics, then we can create a diagram as shown in
Fig. 7.6 and measure the distance by using the Euclidean distance, known from
school math. This approach can also be used in n-dimensional vector space. To
have a more outlier-sensitive measure, we can improve the approach by looking at
the squared Euclidean distance. Table 7.5 gives an overview of the different
measures, depending on the scale type of the variables.
" Proximity measures are used to identify objects that belong to the
same subgroup in a cluster analysis. They can be divided into two
groups: similarity and dissimilarity measures. Nominal variables are
recoded into a set of binary variables, before similarity measures are
used. Dissimilarity measures are mostly distance-based. Different
approaches/metrics exist to measure the distance between objects
described by metrical variables.
" The SPSS Modeler offers the log-likelihood or the Euclidean distance
measures. In the case of the log-likelihood measure, the variables
have to be assumed as independent. The Euclidean distance can only
be calculated for continuous variables.
The SPSS Modeler implements two clustering methods in the classical sense:
TwoStep and K-Means. Additionally, the Kohonen algorithm, as a specific neural
7.2 General Theory of Cluster Analysis 595
network type, can be used for classification purposes. The Auto Cluster node
summarizes all these methods and helps the user to find an optimal solution.
Table 7.6 includes an explanation of these methods.
To understand the advantages and disadvantages of the methods mentioned, it is
helpful to understand in general, the steps involved in clustering algorithms.
Therefore, we will explain the theory of cluster algorithms in the following section.
596 7 Cluster Analysis
After that, we will come back to TwoStep and K-Means. Based on the theory, we
can understand the difficulties in dealing with clustering algorithms and how to
choose the most appropriate approach for a given dataset.
7.2.1 Exercises
1. Define in your own words what is meant by proximity, similarity, and distance
measure.
2. Name one example for a measure of similarity as well as one measure for
determining dissimilarity/distances.
3. Explain why we can’t use distance measures for qualitative variables.
4. Similarity measures can typically only deal with binary variables. Consider a
nominal or ordinal variable with more than two values. Explain a procedure to
recode the variable into a set of binary variables.
5. In the theoretical explanation, we discussed how to deal with binary and with
quantitative/metrical variables in cluster analysis. For binary variables, we use
similarity measures as outlined in this section by using the Tanimoto coefficient.
In the case of metrical variables, we also have a wide range of measures to choose
from. See Table 7.5. Consider a dataset with binary and metrical variables that
should be the basis of a cluster analysis. Outline at least two possibilities of how to
use all of these variables in a cluster analysis. Illustrate your approach using an
example.
1. Figure 7.6 illustrates two objects represented by two variables. Consider a firm
with employees P1 and P2. Table 7.7 shows their characteristics, depending on
employment with a company and their monthly net income. Using the formulas
in Table 7.5, calculate the City-block metric (L1-metric), the Euclidean distance
(L2-metric), and the squared Euclidean distance.
2. In the theoretical part, we said that the squared Euclidian distance is more
outlier-sensitive than the Euclidian distance itself. Explain!
598 7 Cluster Analysis
7.2.2 Solutions
1. A similarity measure helps to quantify the similarity of two objects. The measure
is often used for qualitative (nominal or ordinal) variables. A distance measures
calculate the geometrical distance between two objects. They can therefore be
interpreted as a dissimilarity measure. Similarity and distance measures together
are called proximity measures.
7.2 General Theory of Cluster Analysis 599
Table 7.9 Scheme for recoding a nominal or ordinal variable into a binary variable
Binary variable 1 Binary variable 2 Binary variable 3
2-star 0 0 0
3-star 1 0 0
4-star 1 1 0
5-star 1 1 1
Product C
1 0
Product A 1 1 3
0 1 0
1
sAC ¼ ¼ 0:2
1þ3þ1
Product C
1 0
Product B 1 1 2
0 1 0
1
sBC ¼ ¼ 0:25
1þ2þ1
X
n
d¼ ðxi yi Þ2
i¼1
Considering the 1-dimensional case of two given values 1 and 10, we can see
that the Euclidean distance is 9 and the squared Euclidean distance is definitely
81, but the reason for the outlier-sensitivity is nothing to do with getting a larger
number! If we have two objects with a larger distance, we have to square the
difference of their components, but in the case of the Euclidian distance, we will
reduce the effect by calculating the square root at the end. See also the explana-
tion of standard deviation. Here, we can find the same approach for defining an
outlier-sensitive measure for volatility in statistics.
So far, we have discussed the aim and general principle of cluster analysis
algorithms. The SPSS Modeler offers different clustering methods. These methods
are certainly advanced, however, to understand the advantages and the challenges
of using clustering algorithms; we want to explain here the idea of hierarchical
clustering in more detail. The explanation is based on an idea presented in Handl
(2010), p. 364–383, but we will use our own dataset.
The data represent prices of six cars in different categories. The dataset includes
the name of the manufacturer, the type of car, and a price. Formally, we should
declare that the prices are not representative for the models and types mentioned.
Table 7.11 shows the values.
The only variable that can be used for clustering purposes here is the price. The
car ID is nominally scaled and we could use a distance measure (e.g., the Tanimoto
coefficient) for such variables; but remember that the IDs are arbitrarily assigned to
the cars.
Based on the data given in this example, we can calculate the distance between
the objects. We discussed distance measures for metrical variables in the previous
chapter. The Euclidean distance between car 1 and 2 is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
d Euclidean ð1; 2Þ ¼ ð13 19Þ2 ¼ 6
The algorithm K-Means of the SPSS Modeler is based on this measure. See IBM
(2015a), p. 229–230. We will discuss this procedure later. For now we want to
explain with an example, how a hierarchical cluster algorithm works in general.
The following steps are necessary:
Table 7.12 Overview of Cluster description Objects/cars that belong to the cluster
initial clusters and the
Cluster 1 1
objects assigned
Cluster 2 2
Cluster 3 3
Cluster 4 4
Cluster 5 5
Cluster 6 6
5. Calculating the similarities/distances between the new cluster and all other
objects. Updating the similarity/distance matrix.
6. If not all objects are assigned to a cluster, go to step 3, otherwise stop.
Initially, the objects are not assigned to a specific cluster, but mathematically we
can say that each of them form a separate cluster. Table 7.12 shows the initial status
of the clustering.
d ð1; 2Þ ¼ dð2; 1Þ
...
Start of Iteration 1
We can see in the steps of the algorithm that more than one iteration will be
necessary, most of the time. In the following paragraphs, we extend the description
of steps 3 to 6 with the number of the iteration.
Table 7.14 Overview of Cluster description Objects/cars that belong to the cluster
clusters and the objects
Cluster 1 1
assigned after iteration 1
Cluster 2 2
Cluster 3 new 3; 4
Cluster 5 5
Cluster 6 6
d ð3; 1Þ ¼ 210:25
d ð4; 1Þ ¼ 225
Once more based on the principle that the smallest distance will be used, it is
Start of Iteration 2
Iteration 2, Step 3: Determining the objects/clusters that are “most similar”
The minimum distance is dð5; 6Þ ¼ 25.
606 7 Cluster Analysis
Table 7.16 Overview of Cluster description Objects/cars that belong to the cluster
clusters and the objects
Cluster 1 1
assigned after iteration 1
Cluster 2 2
Cluster 3 3; 4
Cluster 4 new 5; 6
d ð5; 1Þ ¼ 676
d ð6; 1Þ ¼ 961
d ð5; 2Þ ¼ 400
d ð6; 2Þ ¼ 625
Table 7.17 shows the new distance matrix so far, but we also have to calculate the
distances between the cluster with cars 3; 4 and the new cluster with cars 5 and
6. We use Table 7.15 and find that
and
So the distance is
Table 7.19 Overview of Cluster description Objects/cars that belong to the cluster
clusters and the objects
Cluster 1 new 1; 2
assigned after iteration 2
Cluster 3 3; 4
Cluster 4 5; 6
Iteration 3, 4, and 5
The minimum distance in Table 7.18 is 36. So the objects or cars 1 and 2 are
assigned to a new cluster.
The distances from the new cluster {1; 2} to the other existing clusters can be
determined using Table 7.18. They are
and
So the distance
Furthermore, the distances from the new cluster {1; 2} to the existing cluster {5; 6}
are
608 7 Cluster Analysis
So the distance
Fig. 7.10 Dendrogram for simple clustering example using the car dataset
25
0:25 ¼ 0:0517
121
Table 7.23 shows all the other values that can be found in the dendrogram in
Fig. 7.10. This calculation can also be found in the Microsoft Excel spreadsheet
“Car_Simple_Clustering_distance_matrices.xlsx”.
measures” are shown in Table 7.5. So far, we have defined how to measure the
distance between the objects or clusters.
The next question to answer with a clustering algorithm is which objects or
clusters to merge. In the example presented above, we used the minimum distance.
This procedure is called “Single-linkage” or the “nearest neighbor” method. In the
distance matrices, we determine the minimum distance and merge the
corresponding objects by using the row and the column number. This means that
the distance from a cluster to another object “A” equals the minimum distance
between each object in the cluster and “A”.
dðobject A; fobject 1; object 2gÞ ¼ minðdðobject A; object 1Þ; dðobject A; object 2ÞÞ
used to describe the data best, however. This situation is similar to PCA or PFA
factor analysis. There also, we had to determine the number of factors to use (see
Sect. 6.3). In the dendrogram in Fig. 7.10, the cutoff shows an example using two
clusters.
A lot of different methods exist to determine the appropriate number of clusters.
Table 7.25 shows them with a short explanation. For more details, the interested
reader is referred to Timm (2002), p. 531–533. In our example, the rule of thumb
tells us that we should analyze clustering results with
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffi
number of objects 6
¼ 2 ðclustersÞ
2 2
Later on we will also use the Silhouette plot of the SPSS Modeler.
612 7 Cluster Analysis
moderate priced cars” and the latter cluster with two cars can be described as
“luxury cars”.
Advantages
– TwoStep can deal with categorical and metrical variables at the same time. In
these, cases the log-likelihood measure is used.
– By using Bayesian information criterion, or Akaike’s Information Criterion, the
TwoStep algorithm implemented in the SPSS Modeler determines the “optimal”
number of clusters.
– The minimum or maximum number of clusters that come into consideration can
be defined.
– The algorithm tends not to produce clusters with approximately the same size.
Disadvantages
In Sect. 7.3.1, we discussed the theory of agglomerative clustering methods using the
single-linkage algorithm. Additionally, we learned in the previous section that
the TwoStep algorithm is an improved version of an agglomerative procedure.
616 7 Cluster Analysis
Here, we want to use TwoStep to demonstrate its usage for clustering objects in the
very small dataset “car_simple”. We know the data from the theoretical section. They
are shown in Table 7.11. For more details see also Sect. 10.1.4. First we will build the
stream and discuss the parameters of the TwoStep algorithm, as well as the results.
type of the variables is not the reason to exclude them from clustering, but the
variables “manufacturer” as well as “model” and “dealer” do not provide any
valuable information. Only a scientist can cluster the cars based on some of
these variables, using knowledge about the reputation of the manufacturer and
the typical price ranges of the cars.
So the only variable that can be used here for clustering is the price. We now
should check the scale of measurement.
4. Using the Type node, we can see that all the variables are already defined as
nominal, except the price with its continuous scale type (Fig. 7.16). Normally,
we can define the role of the variables here and exclude the first three, but we
recommend building transparent streams and so we will use a separate node to
filter the variables.
We add a Filter node to the stream, right behind the Type node (see
Fig. 7.17).
We find out that apart from the variable “price”, no other variable can be
used for clustering. As we know from the theoretical discussion in Sect. 7.1
though, a clustering algorithm only provides cluster numbers. The researcher
must find useful descriptions for each cluster, based on the characteristics of the
assigned objects.
So any variable that helps us to identify the objects should be included in the
final table alongside the cluster number determined by the algorithm. In our
dataset, the manufacturer and the model of the car is probably helpful. The
name of the dealer does not give us any additional input, so we should exclude
this variable. We exclude it by modifying the Filter node parameter, as shown
in Fig. 7.18.
618 7 Cluster Analysis
5. For the next step, we add a TwoStep node from the Modeling tab of the SPSS
Modeler. We connect this node to the Filter node (see Fig. 7.19). So the only
input variable for this stream is the price of the cars. Only this information
should be used to find clusters.
6. We double-click on the TwoStep node. In the Fields tab, we can choose to use
variables based on the settings in the Type node of the stream. Here, we will
add them manually. To do this, we use the button on the right marked with an
arrow in Fig. 7.20.
7.3 TwoStep Hierarchical Agglomerative Clustering 619
" The researcher should keep in mind, however, that the clusters must
be described based on the characteristics of the objects assigned. So
it is helpful not to filter the object ID’s along with the names. This is
helpful even if they are not used in the clustering procedure itself.
" The variables used for clustering should be determined in the clus-
tering node, rather than using the Type node settings.
620 7 Cluster Analysis
7. In the Model tab, we can find other parameters as shown in Fig. 7.21. We will
explain them in more detail here.
By default, numeric fields are standardized. This is very important for the
cluster procedure. Different scales or different coding of input variables result
in various large differences between their attributes. To make the values
comparable they must be standardized. We outlined the z-standardization in
Sect. 2.7.6. Here, this method is automatically activated and should be used.
Using the option “Cluster label”, we can define whether a cluster is a string
or a number. See the arrow in Fig. 7.21. Unfortunately, there is a failure in the
software of Modeler version 17 (and probably before). If this option “cluster
label” is changed to “number”, then the clustering result also changes for no
reason. We have reported this bug, and it will be fixed in the following releases.
For now, we do not recommend using this option, despite the fact that it makes
handling cluster numbers easier.
" The option “Cluster label” must be used carefully. The clustering
results can change for no reason. It is recommended to use the
option “string” instead.
Fig. 7.22 Options in the TwoStep node, with two clusters specified
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffi
number of objects 6
¼ 2
2 2
tells us to try and fix a number of clusters. We modify the options in the TwoStep node
as shown in Fig. 7.22. We run the node now once again with these new parameters.
9. We get a new model nugget as shown in Fig. 7.23.
10. Before we start to show the details of the analysis, we finish the stream by adding
a Table node behind the model nugget node. Figure 7.24 shows the final stream.
" Using the TwoStep algorithm, the following steps are recommended
1. The scale types defined in the Type node are especially impor-
tant for the cluster algorithms. That’s because the usage of the
distance measure (log-likelihood or euclidean), for example,
depends on these definitions.
2. In a Filter node, variables can be excluded that are unnecessary
for the clustering itself. The object ID (and other descriptive
variables) should not be filtered. That’s because it/they can
help the user to identify the objects.
7.3 TwoStep Hierarchical Agglomerative Clustering 623
Fig. 7.23 TwoStep cluster model nugget node is added to the stream
Fig. 7.24 Final TwoStep cluster stream for simple car example
" The SPSS Modeler shows the silhouette value in the model summary
of the Model Viewer on the left. Moving the mouse over the diagram
is one way to get the silhouette value. This is depicted in Fig. 7.26 in
the middle. To get a more precise result, one can use the option
“Copy Visualization Data”. To do this, the second button from left in
the upper part of the Model Viewer should be clicked. Then the
copied values must be pasted using simple word processing soft-
ware. Table 7.26 shows the result.
7.3 TwoStep Hierarchical Agglomerative Clustering 625
Table 7.26 Silhouette measure with full precision—Copied with option “Copy Visualization
Data”
Category. Silhouette measure of cohesion and separation. V3
1 0.7411 0.7
On the right in Fig. 7.26, the Modeler shows us that we have two clusters, with
four and two elements, respectively. Their size ratio is definitely two.
In the left corner of Fig. 7.26, we select the option “Clusters” from the drop-
down list, instead of the “Model Summary”.
We can then analyze the clusters as shown in Fig. 7.27. By selecting cluster
1 (marked with an arrow in Fig. 7.27) on the left, we will get a more detailed output.
In the drop-down list on the right, field “View”, we can choose “Cell Distribution”.
By selecting the cluster on the left, we can see the frequency distribution of the
objects on the right.
Another valuable analysis is offered in the model viewer, if we use the symbols
in the left window below the clustering results. We can activate the “cell distribu-
tion” button, marked with an arrow in Fig. 7.28. The distribution of each variable in
the clusters then appears above.
626 7 Cluster Analysis
Focusing once more on Fig. 7.25, we can see that the clustering result is exactly
the same as the result achieved in Sect. 7.3.1. Comparing the results of the manual
calculation with the visualization in Fig. 7.10, in the form of a dendrogram, we see
that also here cars 1–4 are assigned to one cluster and cars 5 and 6 to another one.
Summary
With the TwoStep algorithm, we used a small dataset to find clusters. We did this in
order to show that the result that we got based on the “single-linkage method”, from
the theory section, is the same. Unfortunately, we had to decide in advance, the
number of clusters to determine. Normally, this is not necessary using the TwoStep
algorithm.
After determining “segments” of the objects, the silhouette plot helps to assess
the quality of the model generally. Furthermore, we learn how to analyze the
different clusters step-by-step, using the different options from the model viewer.
7.3 TwoStep Hierarchical Agglomerative Clustering 627
7.3.4 Exercises
are you with the amount of time your IT system takes to be ready to work (from
booting the system to the start of daily needed applications)?” The users could rate
the aspects using the scale “1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent”. See also
Sect. 10.1.20.
You can find the results in the file “IT user satisfaction.sav”. In Sect. 6.3.3,
exercise 3, we used results of a PCA to determine technical and organizational
satisfaction indices. Now the users should be divided into groups based on their
satisfaction with both aspects.
1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The
Stream “Template-Stream_nutrition_habits” uses this dataset. Please open the
template stream and make sure the data can be loaded. Save the stream with
another name.
2. Use a TwoStep clustering algorithm to determine consumer segments. Assess
the quality of the clustering, as well as the different groups of consumers
identified. Use a table to characterize them.
3. The variables used for clustering are ordinally scaled. Explain how proximity
measures are used to deal with such variables in clustering algorithms.
Remark
Please note that in exercise 2 in Sect. 7.5.3 and also in the solution in Sect. 7.5.4, an
alternative Kohonen and K-Means model will be presented using the Auto
Cluster node.
7.3 TwoStep Hierarchical Agglomerative Clustering 629
7.3.5 Solutions
1. Unsupervised learning methods do not need a target field to learn how to handle
objects. In contrast to the tree methods for example, these algorithms try to
categorize the objects into subgroups, by determining their similarity or
dissimilarity.
2. First of all, in clustering we find more than one correct solution for segmentation
of the data. So different parameters, e.g., proximity measures or clustering
methods, should be used and their results should be compared. This is also a
“type of validation”, but there are many reasons not to divide the original dataset
when using TwoStep:
(a) Partitioning reduces the information presented to the algorithm for finding
“correct” subgroups of objects. Reducing the number of records in cases of
small samples will often lead to worse clustering results. We recommend
using the partitioning option only in cases of huge sample size.
(b) Clusters are defined by inspecting each object and measuring the distance
from or similarity to all other objects. If objects are excluded from cluster-
ing, by separating them in a test or validation partition, the algorithm will
not take them into account and will not assign any cluster number to these
objects.
(c) TwoStep does not produce a “formula” for how to assign objects to a
cluster. Sure, we can assign a completely new object to one of the clusters,
but this process is only based on our “characterisation” of the clusters, using
our knowledge of the data’s background, where the data came from.
3. To measure the goodness of a classification, we determine the average distance
from each object to the points of the cluster it is assigned to and the average
distance to the other clusters. The silhouette plot shows this measure. If we
assume that the silhouette is a measure of the goodness of the clustering, we can
compare different models using their silhouette value. For instance, we can
modify the number of clusters to determine in a TwoStep model, and then use
the model with the highest silhouette value. This method is used by the Auto-
clustering node, as we will see later.
viewer tells us that the segments with 50 % of the users have exactly the same
size.
If we should need to generate a unique user ID for each record, we could do
that by adding a Derive node with the @INDEX function.
4. The solution for this part of the exercise can be found in the stream “cluster_i-
t_user_satisfaction_transformed”. The stream will not be explained in detail
here. The reader is referred to Sect. 3.2.5 “Transform node and SuperNode”.
As explained in Sect. 3.2.5, the Transform node should be used to assess the
normality of data and to generate Derive nodes to transform the values. We add a
Transform node. Assessing the data, we find that using a Log transformation
could help to move the distributions towards normality. We generate a
SuperNode for these transformations. A Shapiro Wilk test would show whether
or not the original, rather than the transformed, variables are normally
distributed. The transformation helps to improve the quality though, as we will
see when assessing the clustering results.
We add a TwoStep node and use the transformed two variables to cluster the
user according to his/her level of satisfaction. Plotting the original (and not the
634 7 Cluster Analysis
Fig. 7.35 Clustered users by their satisfaction indices, based on transformed data
transformed) variables against each other in a Plot node, we get the result shown
in Fig. 7.35. The algorithm can separate the groups, but the result is unsatisfying.
The two users with technical satisfaction ¼ 5 at the bottom belong more to the
cluster in the middle than to the cluster on the left.
If we restrict the number of clusters produced by TwoStep to exactly two, we
get the solution depicted in Fig. 7.36. Also here, 50 % of the users are assigned to
each of the clusters.
This solution shows that for the given dataset, the identification of the number
of clusters determined a higher appropriate number of segments. Generally
though, the algorithm works fine on these skewed distributions.
7.3 TwoStep Hierarchical Agglomerative Clustering 635
Remarks
The TwoStep algorithm assumes that non-continuous variables are multinomially
distributed. We do not verify this assumption here.
For the dependency of the silhouette measure, and the number of clusters
determined by TwoStep or K-Means, see also the solution to exercise 5 in
Sect. 7.4.4. The solution can be found in the Microsoft Excel file “kmeans_clus-
ter_nutrition_habits.xlsx”.
Please note that in exercise 2 in Sect. 7.5.3, and also in the solution in Sect. 7.5.4,
an alternative Kohonen and K-Means model will be presented using the Auto
Cluster node.
2. Before we start to cluster, we have to understand the settings in the stream. For
this, we open the Filter node and the Type node, to show the scale type (see
Figs. 7.37 and 7.38). Here, it can be helpful to enable the ID in the Filter node to
identify the consumers related to their assigned cluster number.
We add a TwoStep node to the stream. To be sure the correct variables are
used for the cluster analysis, we can add them in the Filed tab of the TwoStep
7.3 TwoStep Hierarchical Agglomerative Clustering 637
node (see Fig. 7.39). Additionally, we must make sure that the ordinal variables
are standardized in the Model tab, as shown in Fig. 7.40. Running the TwoStep
node, we get the final stream as shown in Fig. 7.41.
638 7 Cluster Analysis
Visualization Data”, also highlighted with an arrow in Fig. 7.42. Here, the
silhouette value is 0.7201.
Using the option “clusters” from the drop-down list in the left-hand corner of
the Model Viewer in Fig. 7.42, we get the frequency distribution per variable and
cluster, as shown in Fig. 7.43. Table 7.27 shows a short assessment of the
different clusters. Cluster 4 is particularly hard to characterize. The TwoStep
algorithm should probably be used to determine only four clusters.
640 7 Cluster Analysis
3. In this case, the proximity measure is a similarity measure. Ordinal variables are
recoded internally into many dual variables. For details, see exercise 2 in
Sect. 7.2.2. To determine the similarity between the dual variables, the Tanimoto
coefficient can be used. For details see exercise 3 in Sect. 7.2.2, as well as the
explanation in Sect. 7.1.
7.4.1 Theory
xi xmin
xi, new ¼
xmax xmin
Nominal or ordinal variables (also called symbolic fields) are recoded, using
binary coding as outlined in Sect. 7.2.1, Exercise 2 and especially in Table 7.9.
Additionally, the SPSS Modeler uses a scaling factor to avoid having the
variables overweighed in the following steps. For details see also IBM
(2015a), p. 227–228. Normally, the factor equals the square root of 0.5 or
0.70711, but the user can define his/her own values in the expert tab of the
K-Means node.
3. The k cluster centers are defined as follows (see IBM (2015a), p. 229):
(a) The values of the first record in the dataset are used as the initial cluster
center.
(b) Distances are calculated from all records to the cluster centers so far
defined.
(c) The values from the record with the largest distance to all cluster centers are
used as a new cluster center.
7.4 K-Means Partitioning Clustering 641
(d) The process stops if the number of clusters equals the number predefined by
the user, i.e., until k cluster centers are defined.
4. The squared Euclidean distance (see Tables 7.5 and 7.10) between each record
or object and each cluster center is calculated. The object is assigned to the
cluster center with the minimal distance.
5. The cluster centers are updated, using the “average” of the objects assigned to
this cluster.
6. This process stops if either a maximum number of iterations took place, or there
is no change in the cluster centers recalculation. Instead of “no change in the
cluster centers”, the user can define another threshold for the change that will
stop the iterations.
" The Auto-Clustering node also uses K-Means clustering. Here, the
number of clusters does not have to be determined in advance.
Based on several goodness criteria, as defined by the user, “the
best” model will be determined.
Advantages
Disadvantages
(a) “ADDRESS” and “CUSTOMERID” are not helpful and can be excluded.
(b) “DEFAULTED” should be excluded too, because it is the result, and the
clustering algorithm is an unsupervised method, which should identify the
pattern by itself and should not “learn” or “remember” given facts.
(c) “AGE”, “CARDDEBT”, “EDUCATION”, “INCOME”,
“OTHERDEBT”, and “YEARSEMPLOYED” are relevant variables for
the segmentation.
5. The “EDUCATION” variable is defined as nominal because it is a string, but in
fact it is ordinal. To define an order, we use a Reclassify node as described in
Sect. 3.2.6. After adding this node from the “Field Ops” tab of the Modeler to
the stream (see Fig. 7.47), we can define its parameters as shown in Fig. 7.48.
We define here a new variable “EDUCATIONReclassified”. Later in this
stream, we have to add another Type node, for assigning the correct scale type
to this new variable. For now, we want to check all variables to see if any of
them can be used in clustering.
6. Interpreting the potential useful variables, we can see that some of them are
correlated. A customer with a high income can afford to have higher credit card
and other debt. Using these original variables doesn’t add much input to the model.
Let’s first of all check the correlation coefficient of the variables by using a
Sim Fit node, however. We add the node to the stream and execute it. We
explained how to use this node in Sect. 4.4. Figure 7.49 shows the actual
stream. Figure 7.50 shows the correlations.
We can see that the variables “CARDDEBT” and “OTHERDEBT”, as well
as “INCOME” and “CARDDEBT”, and “INCOME” and “OTHERDEBT” are
correlated.
7. So it is definitely necessary to calculate a new variable in the form of the ratio of
credit card debt and other debt to income. To do so, we add a Drive node from
the “Field Ops” tab of the Modeler. Using its expression builder, as outlined in
7.4 K-Means Partitioning Clustering 645
Sect. 2.7.2 we define the parameters as shown in Fig. 7.51. The name of the new
variable is “DEBTINCOMERATIO” and the formula “(CARDDEBT +
OTHERDEBT)/INCOME * 100”. So “DEBTINCOMERATIO” equals the
summarized debt of the customer vs. the income in percent.
8. To assign the correct scale types to the reclassified educational characteristics
in “EDUCATIONReclassify”, and the new derived variable
“DEBTINCOMERATIO”, we must add another Type node at the end of the
stream. We define “DEBTINCOMERATIO” as continuous and
“EDUCATIONReclassify” as ordinal (see Figs. 7.52 and 7.53).
646 7 Cluster Analysis
Fig. 7.49 Reclassify and Sim Fit nodes are added to the stream
9. Now we have finished the preliminary work and we are ready to try to cluster
the customer. We add a K-Means node to the stream from the “Modeling” tab
and open its parameter dialog window (Fig. 7.54).
7.4 K-Means Partitioning Clustering 647
Fig. 7.51 Parameters of the Derive node to calculate the debt-income ratio
Fig. 7.52 Stream with another added Type node at the end
648 7 Cluster Analysis
quality is just “fair”. On the right, the Modeler shows that there is a cluster
3 with 5.8 % of the records.
To assess the model details, we choose the option “Clusters” in the left-hand
corner of Fig. 7.57. In the left part of the window in the Model viewer, we can
find the details, as shown in Fig. 7.58. Obviously, the difference between the
clusters is not that remarkable. So the first conclusion of our analysis is that we
have to reduce the number of clusters.
7.4 K-Means Partitioning Clustering 651
We also want to assess the quality of the predictors used to build the model,
however. The SPSS Modeler offers this option in the drop-down list on the
right-hand side of the window. It is marked with an arrow in Fig. 7.57.
Figure 7.59 shows us that the clustering is dominated by the education of the
customer. First of all, this is not very astonishing. So the practical conclusion is
not that useful. Furthermore, in terms of clustering quality, we want to have the
significant influence of more than one variable in the process. The second
conclusion of our analysis is that we should try to exclude the variable
“EDUCATIONclassified” from the K-Means clustering.
We can close the Model viewer with “OK”.
As shown in Fig. 7.70, the relative frequency of customers for whom we have no
information regarding their default is approximately the same in each cluster. So
this information gap does not affect our judgement very much.
More surprisingly, we can see that the default rate in cluster 1 and cluster 4 is
relatively low; because of this, customers in cluster 1 are a good target group for the
bank. First private banking may generate profit, and the number of loans given to
them can be increased.
We described cluster 2 as the class of middle-aged customers with an average
debt-income ratio of above 20 %. These may be the customers who bought a house
or a flat and have large debts besides their credit card debt. The loss in case of a
default is high, but contrary to this the bank would be able to generate high profit.
The very high default rate of above 43 % means we recommend separating these
customers, however, and paying attention to them.
Fig. 7.68 Variables are assigned to columns and rows in the Matrix node
Fig. 7.69 Percentage per column should be calculated in the Matrix node
The characteristics of cluster 3, with its young customers, its lower debt-income
ratio, and the default rate of 32 %, are different from cluster 2, where in some
specific cases we found it better to discontinue the business relationship. These
customers may default very often, but the loss to the bank is relatively low.
Probably, payments for loans with lower rates are delayed. The bank would do
very well to support this group because of the potential to generate future profit.
7.4 K-Means Partitioning Clustering 659
Summary
We separated groups of objects in a dataset using the K-Means algorithm. To do
this, we assessed the variables related to their adequacy for clustering purposes.
This process is more or less a decision of the researcher, based on knowledge of the
practical background. New measures must be calculated, however, to condense
information from different variables into one indicator.
The disadvantage of the algorithm is that the number of clusters must be defined
in advance. Using statistical measures, e.g., the silhouette, as well as practical
knowledge or expertise, we found an appropriate solution. In the end, the clusters
must be described by assessing their characteristics in the model viewer of the
K-Means nugget node. Knowledge of the practical background is critically impor-
tant for finding an appropriate description for each cluster.
7.4.3 Exercises
3. Now update the clustering, using the new variable instead of the variables
“AGE” and “YEARSEMPLOYED” separately. Explain your findings using
the new variable.
1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The
Stream “Template-Stream_nutrition_habits” uses this dataset.
Please open the template stream. Save the stream under another name.
Using the dietary types, “vegetarian”, “low meat”, “fast food”, “filling”, and
“hearty”, the consumers were asked “Please indicate which of the following
dietary characteristics describes your preferences. How often do you eat . . .”.
The respondents had the chance to rate their preferences on a scale “(very)
often”, “sometimes”, and “never”. The variables are coded as follows:
“1¼never”, “2¼sometimes”, and “3¼(very) often”. See also Sect. 10.1.24.
662 7 Cluster Analysis
7.4.4 Solutions
1. As described in the exercise, we add a Means node to the existing stream and
connect it with the Model nugget node. Figure 7.71 shows the extended stream,
and Fig. 7.72 shows the parameters of the Means node. Running the node, we get
a result as shown in Fig. 7.73.
2. The results in Fig. 7.73 confirm our suspicion that customers in cluster 2 have
high other debts. The Means node performs a t-test in the cases of two different
clusters and a one-way ANOVA if there are more than two clusters or groups.
See IBM (2015c), p. 298–300. The test tries to determine if there are differences
between the means of several groups. In our case, we can find either 100 % or
less than a 0 % chance that the means are the same. The significance levels can
be defined in the Options tab of the Means node. More detailed statistics can also
be found in Microsoft Excel file “customer_bank_data_ANOVA.xlsx”.
1. Using the right table node in the final stream, depicted in Fig. 7.66, we get the
variables included in the original dataset and the clusters numbers as shown in
Fig. 7.74. The variables of interest here are “AGE” and “YEARSEMPLOYED”.
2. To add a Derive node to the stream, we remove the connection between the
Derive node for the “DEBTINCOMERATIO” and the second Type node. The
name of the new variable is “EMPLOY_RATIO” and the formula
“YEARSEMPLOYED/AGE”. This is also shown in Fig. 7.75.
We connect the new Type node with the rest of the stream. Figure 7.76 shows
the final stream and the new Derive node in its middle.
3. Now we update the clustering node by removing the variables “AGE” and
“YEARSEMPLOYED”. Subsequently, we add the new variable
“EMPLOY_RATIO” (see Fig. 7.77).
Running the stream, we can see in the model viewer that the quality of the
clustering could be improved, based on assessment of the silhouette plot in
Fig. 7.78. In comparison to the previous results presented in Fig. 7.62, cluster 1 is
larger with 37.4 % (previously 31.5 %), whereas all other clusters are smaller.
Using ratios or calculated measures can improve the quality of clustering, but the
result may be harder to interpret because of the increased complexity in the new
calculated variable. Additionally, the segmentation can be totally different, as we
can see here: although the percentage of records per cluster in Fig. 7.78 lets us
assume that there are probably some records now assigned to other clusters,
detailed analysis shows another picture.
Summarizing the results in Figs. 7.79 and 7.80, we can find two interesting
groups, in terms of risk management. The younger customers in cluster 2 have a
remarkable debt-income ratio and a default ratio of 28 %. Every second customer
assigned to cluster 4 defaulted in the past. The customers here are older and have a
lower debt-income ratio of 17 %. Clusters 1 and 3 consist of customers that are good
targets for new bank promotions.
4. Now we are able to use the transformed variable for clustering purposes in a
TwoStep node, so we connect the SuperNode with the original Type node. To
make sure the scale definitions for the transformed variables are correct, we
must add another Type node after the SuperNode. Finally, we can add a
TwoStep node. All these steps are shown in Fig. 7.85.
5. Additionally, we have to change the number of clusters the TwoStep node
should determine. The number of clusters should be four, as shown in Fig. 7.86.
Normally, TwoStep would offer only three clusters here.
6. We run the TwoStep node to get its model nugget.
7. To analyze the results of both streams, we must merge the results of both
sub-streams. We add a Merge node to the stream from the Record Ops tab (see
Fig. 7.87).
8. As outlined in Sect. 2.79, it is important to disable duplicates of variables in the
settings of this node. Otherwise the stream will no longer work correctly.
Figure 7.88 shows that we decided to remove variables coming from the
K-Means sub-stream section.
670 7 Cluster Analysis
Fig. 7.87 Stream with added Merge node for comparing K-Means and TwoStep results
9. Theoretically, we could now start to add the typical nodes for analyzing the
data, but as we have a lot of different variables, reordering them is helpful.
Therefore, we add a Field Reorder node from the Field ops tab (see Fig. 7.89).
Figure 7.90 shows the parameters of this node.
10. Finally, we add a Table node, a Data Audit node, and two Matrix nodes. The
parameters of the matrix node are the same as shown in Figs. 7.68 and 7.69, as
well as the TwoStep results saved in the variable “$T-TwoStep”.
11. The Table node in Fig. 7.91 shows us that probably only the cluster names have
been rearranged.
Using the Data Audit node, however, we can find that the frequency distributions
that are dependent on cluster numbers are slightly different for both methods.
Comparing the distributions in Fig. 7.92 for K-Means and Fig. 7.93 for TwoStep,
we can see differences in the size of the clusters.
Figure 7.94 shows once more the default rates for K-Means in a Matrix node,
also previously analyzed in Fig. 7.70. As we explained in Sect. 7.4.2, analyzing the
percentage per column is useful for proving if the default rates are independent or
not from the cluster numbers. In Figs. 7.94 and 7.95, we can see that this is true for
both methods.
672 7 Cluster Analysis
Fig. 7.88 Settings of the Merge node for removing duplicated variables
Inspecting the details of the distribution for each variable, Fig. 7.96 shows us, in
comparison with Fig. 7.64, that the importance of the variable
“YEARSEMPLOYED” has increased. Previously, this variable was ranked as
number 2.
Summarizing all our findings that are also shown in Fig. 7.97 for the TwoStep
algorithm, we can say that the usage of different methods leads to significantly
different results in clustering. To meet the assumption of normally distributed
variables, TwoStep needs us to transform the input variables, but they then no
longer exactly meet these assumptions.
7.4 K-Means Partitioning Clustering 673
Fig. 7.95 Default analysis per cluster, based on TwoStep, with log-likelihood distance measure
As outlined in Sect. 7.3.2, we can use the Euclidean distance measure instead.
Activating this distance measure in the TwoStep node dramatically decreases the
quality of the clustering result. Here, we get a silhouette measure of 0.2 and one
very small cluster that represents 3.9 % of the customers. These results are
unsatisfying. That’s because we have concluded using the log-likelihood distance
measure, even when we can’t completely meet the assumption of normal distributed
variables here.
676 7 Cluster Analysis
Fig. 7.97 Detailed analysis of the TwoStep result in the Model Viewer
To decide which model is the best, we need to assess the clusters by inspecting in
detail the records assigned. This is beyond the scope of this book. The higher
importance of “YEARSEMPLOYED” seems to address the risk aspect of the model
better than the higher ranked “AGE” using K-Means, but the smaller differences in
“DEBTINCOMERATIO” per cluster for TwoStep separate the subgroups less
optimally.
The factor scores, or more accurately the principal component scores, shown
in Fig. 7.98, express the input variables in terms of the determined factors, by
reducing the amount of information represented. The loss of information
depends on the number of factors extracted and used in the formula.
The reduced information and the reduced number of variables (factors) can
help more complex algorithms, such as cluster analysis, to converge or to
converge faster.
Cluster analysis represents a class of multivariate statistical method. The aim
is to identify subgroups/clusters of objects in the data. Each given object will be
assigned to a cluster based on similarity or dissimilarity/distance measures.
So the difference between, let’s say PCA and K-Means, is that PCA helps us
to reduce the number of variables, and K-Means reduces the number of objects
we have to look at, using defined consumer segments and their nutrition habits.
3. Figure 7.99 shows the extended stream. First we added a Type node on the right.
This is to ensure the correct scale type is assigned to the factor scores defined by
the PCA/Factor node.
678 7 Cluster Analysis
Figures 7.100 and 7.101 show the parameters of the K-Means node. First we
defined the factor scores to be used for clustering purposes, and then we defined
the number of clusters as three. This is explained in Sect. 6.3.2 and shown in
Fig. 6.43.
Figure 7.102 shows the results of the K-Means clustering. Judging from the
silhouette plot, the model is of good quality. The determined clusters are shown
in a 2D-plot using a Plot node. Figure 7.103 shows the parameters of the node
Here, we used the size of the bubbles as visualization of the cluster number. That
is due to printing restrictions with this book. There is of course the option to use
different colors.
As expected and previously defined in Fig. 6.43, however, K-Means finds
three clusters of respondents (see Fig. 7.104). The cluster description can be
found in Table 6.1.
4. Clustering data is complex. The TwoStep algorithm is based on a tree, to manage
the complexity of the huge number of proximity measures and to make
comparisons of the different objects. K-Means is based on a more pragmatic
approach that determines the cluster centers before starting to cluster.
In cluster analysis, however, using either approach can often lead to computer
performance deficits. Here, the PCA can help to reduce the number of variables
or more to the point determine the variables that are most important. Figure 7.105
shows the process for combining PCA and clustering algorithms. Using original
data to identify clusters will lead to precise results, but PCA can also help with
using these algorithms, in the case of large datasets. Details can be found in Ding
and He (2004).
7.4 K-Means Partitioning Clustering 679
Fig. 7.102 Results of the K-Means algorithm shown in the model viewer
680 7 Cluster Analysis
Fig. 7.103 Parameters of the Plot node to visualize the results of K-Means
Remark
In this exercise, we specify the number of clusters to determine manually and fit a
model for each new cluster number. In the solution to exercise 1 in Sect. 7.5.4, we
demonstrate how to use the Auto Cluster node for the same procedure.
copied data into a text processing application, e.g., Microsoft Word. Here, the
silhouette value is 0.4395.
There is no need to assess the clusters here in detail. We are actually only
interested in the silhouette values depending on the number of clusters. So we
7.4 K-Means Partitioning Clustering 683
repeat the procedure for all other cluster numbers from 3 to 13. In the solution to
exercise 1 in Sect. 7.5.4, we demonstrate how to use the Auto Cluster node for
the same procedure.
Table 7.29 shows the dependency of silhouette measure vs. the number of
clusters determined. Figure 7.111 shows the 2D-plot of the data.
3. In the solution to exercise 3 in Sect. 7.3.5, we found that five clusters are difficult
to characterize. The graph for K-Means tells us that there is an approximately
684 7 Cluster Analysis
Table 7.29 Dependency of silhouette measure and the number of clusters determined by
K-Means
Number of clusters Silhouette measure of cohesion and separation
2 0.4395
3 0.5773
4 0.5337
5 0.7203
6 0.7278
7 0.8096
8 0.8494
9 0.8614
10 0.9116
11 0.9658
12 0.9708
13 1.0000
linear dependency between the number of clusters and the silhouette value.
Using a simple regression function, we find that with each additional cluster,
the quality of the clustering improves by 0.0494, in terms of the silhouette
measure.
7.5 Auto Clustering 685
Reducing the number of clusters from five to four, however, results in a very
low silhouette measure of 0.5337. So when using K-Means, the better option is
to determine three clusters.
4. Figure 7.112 shows results from using the TwoStep algorithm for clustering the
data. The average difference in the silhouette value is 0.0525 when increasing
the number of clusters by one. The linear character of the curve is clear. Here, we
can modify the number of clusters based also on the background of the applica-
tion of the clustering algorithm. As suggested in exercise 3 of Sect. 7.3.5, we can
also use three or four clusters, if we think this is more appropriate.
General motivation
In the previous sections, we intensively discussed using the TwoStep and K-Means
algorithms to cluster data. An advantage of TwoStep implementation in the SPSS
Model is its ability to identify the optimal number of clusters to use. The user will
686 7 Cluster Analysis
get an idea of which number will probably best fit the data. Although K-Means is
widely used, it does not provide this convenient option.
Based on practical experience, we believe the decision on how many clusters to
determine should be made first. Additionally, statistical measures such as the
silhouette value give the user the chance to assess the goodness of fit of the
model. We discussed the dependency of the number of clusters and the clustering
quality in terms of silhouette value in more detail in exercise 5 of Sect. 7.4.3.
Determining different models with different cluster numbers, and assessing the
distribution of the variables within the clusters, or profiling the clusters, leads
eventually to an appropriate solution.
In practice, realizing this process takes a lot of experience and time. Here, the
idea of supporting the user by offering an Auto Cluster node seems to be a good one,
especially if different clustering algorithms will be tested. We will show how to
apply this node here and then summarize our findings.
Implementation details
The Auto Cluster node summarizes the functionalities of the TwoStep, the
K-Means node, and a node called Kohonen. The functionalities of the TwoStep
as a hierarchical agglomerative algorithm are intensively discussed in Sect. 7.3. The
K-Means algorithm and its implementation are also explained in Sect. 7.4.
The Auto-Clustering node also uses a partitioning K-Means clustering algo-
rithm. Here too, the user must define the number of clusters in advance. Models can
be selected based on several goodness criteria,.
Kohonen is the only algorithm that we have not discussed so far in detail.
Table 7.6 outlines some details. In this special type of neural network, an unsuper-
vised learning procedure is performed. So here no target variable is necessary.
Input variables defined by the user build an input vector. This input vector is then
presented to the input layer of a neural network. This layer is connected to a second
output layer. The parameters in this output layer are then adjusted using a learning
procedure, so that they learn the different patterns included in the input data.
Neurons in the output layer that are unnecessary are removed from the network.
After this learning procedure, new input vectors are presented to the model. The
output layer tries to determine a winning neuron that represents the most similar
pattern previously learned.
The procedure is also depicted in Fig. 7.113. The output layer is a
two-dimensional map. Here, the winning neuron is represented by its coordinates
Fig. 7.113 Visualization of Kohonen’s SOM algorithm used for clustering purposes
7.5 Auto Clustering 687
X and Y. The different combinations of the coordinates of the winning neuron are
the categories or the clusters recognized by the algorithm in the input data.
Interested readers are referred to Kohonen (2001) for more details.
" The network tries to learn patterns included in the input data. After-
wards, new vectors can be presented to the algorithm, and the
network determines a winning neuron that represents the most
similar pattern learned. The different combinations of the coordinates
of the winning neuron equal the number or the name of the cluster.
The number of neurons can be determined by restricting the width
and the length of the output layer.
" The Auto Cluster node offers the application of TwoStep, K-Means,
and Kohonen node functionalities for use with data. TwoStep and the
Kohonen implementation in the SPSS Modeler determine the “opti-
mal” number of clusters automatically. The user must use K-Means to
choose the number of clusters to determine.
" Using the Auto Cluster node allows the user steer three algorithms at
the same time. Several options allow the user to define selection
criterions for the models tested and presented.
We would like to show the Auto Cluster node in use, by clustering a dataset
representing diabetes data from a Pima Indian population near Phoenix, Arizona.
A detailed description of the variables included in the dataset can be found in
Sect. 10.1.8.
Using several variables from this dataset, we should be able to make predictions
about the individual risk of suffering from diabetes. Here, we would like to cluster
the population, so we look for typical characteristics. It is not the aim of this section
to go into the medical details. Rather, we would like to identify the best possible
algorithm for clustering the data and identify the advantages and the disadvantages
of the SPSS Modeler’s Auto Cluster node.
As we extensively discussed in Sect. 7.3.2, and in the solution to exercise 3 in
Sect. 7.4.4, the TwoStep algorithm also implemented in the Auto Cluster node
needs normally distributed variables. Using the Euclidian distance measure instead
688 7 Cluster Analysis
of the log-likelihood measure seems to be a bad choice too, as shown at the end of
exercise 3 in Sect. 7.4.4. That’s because we want to create a stream for clustering
the data based on the findings in Sect. 3.2.5 “Concept of ‘SuperNodes’ and
Transforming Variable to Normality”. Here, we assessed the variables
“glucose_concentration”, “blood_pressure”, “serum_insulin”, “BMI”, and
“diabetes_pedigree” and transformed them into normal distribution.
Remark
The solution can be found in the stream named “cluster_diabetes_auto.str”.
settings of the Type node for the role of the variable, we open the Auto Cluster
node by double-clicking on it.
5. In the fields tab of the Auto Cluster node, we activate the option “Use custom
settings” and determine the “class_variable” as the evaluation variable. See
Fig. 7.118).
6. By analyzing the meaning of the variables found in Sect. 10.1.8 we can state
that the following variables should be helpful for determining segments of
patients according to their risk of suffering from diabetes:
This result is based on different pre-tests. We can also imagine using the
other variables in the dataset. The variable settings in the Auto Cluster node in
Fig. 7.118, however, should be a good starting point.
7. We don’t have to modify an option in the model tab of the Auto Cluster node
(see Fig. 7.119). It is important to note, however, that the option “Number of
7.5 Auto Clustering 691
models to keep” determines the number that will be saved and presented to the
user later in the model nugget.
8. The Expert tab (Fig. 7.120) gives us the chance to disable the usage of
algorithms and determine stopping rules, in the case of huge datasets. We
want to test all three algorithms, TwoStep, K-Means, and Kohonen, as outlined
in the introduction to the section.
We must pay attention to the column “Model parameters” though. The
K-Means algorithm will need to define the number of clusters to identify.
We click on the “default” text in the second row, which represents the
K-Means algorithm (see Fig. 7.120). Then we use the “Specify” option, and a
new dialog window will open (see Fig. 7.121). Here, we can determine the
number of clusters. As we have patients that tested both positive and negative,
we are trying to determine two clusters in the data. We define the correct
number as shown in Fig. 7.122. After that, we can close all dialog windows.
9. We can start the clustering by clicking on “Run”. We then get a model nugget,
as shown in the bottom right corner of Fig. 7.123.
692 7 Cluster Analysis
Fig. 7.121 Parameters of the K-Means algorithm in the Auto Cluster node
10. We double click on the model nugget. As we can see in Fig. 7.124, three models
are offered. The K-Means algorithm determines the model with the best
silhouette value of 0.431. The model has two clusters that are equal to the
number of different values in the “class_variable”, which represents the test
result “0 ¼ negative test result” and “1 ¼ positive test result” for diabetes. This
is also true for the model determined by TwoStep.
11. We can now decide which model should be assessed. We discussed the
consequences of the variable transformation towards normality for the cluster-
ing algorithms in exercise 3 of Sect. 7.4.4. To outline here the consequences
once more in detail, we will discuss the TwoStep model. We enable it in the
first column in the dialog window shown in Fig. 7.124.
As we know from discussing TwoStep and the K-Means in Sects. 7.3 and
7.4, we can assess the cluster by double-clicking on the symbol in the second
column, as in Fig. 7.124.
Figures 7.125 and 7.126 show details from the TwoStep model. The model
quality is fair, and the averages of the different variables in the clusters are
694 7 Cluster Analysis
Fig. 7.125 Overview of results for TwoStep clustering in the Model Viewer
Fig. 7.126 Model results from TwoStep clustering in the Model Viewer
7.5 Auto Clustering 695
different. Using the mouse, we can see that this is also true for
“diabetes_pedigree”, with 0.42 in cluster 1 on the left and 0.29 in cluster
2 on the right. The importance of the predictors is also good.
" In the Auto Cluster nugget node, the determined models will be listed
if they meet the conditions defined in the “Discard” dialog of the Auto
Cluster node.
" There are many dependencies between the parameters regarding the
models to keep, to determine, and to discard; for instance, when the
determined number of models to keep is three and the number of
models defined to calculate in K-Means is larger. Another example is
when models with more than three clusters should be discarded, but
in the K-Means node, the user defines models with four or more
clusters to determine. So, if not all expected models are presented
in the model nugget, the user is advised to verify the restrictions
“models to keep” vs. the options in the “discard section”.
" A model can be selected for usage in the following calculations of the
stream, by activating it in the first column of the nugget node.
12. The variable “test_result”, also shown as a predictor on the right in Fig. 7.126,
is definitely not a predictor. It is obviously a not correct repesentation of the
settings in the Modeler and not based on inaprropriate parameters in Fig. 7.118.
Here we excluded the variable “test_result”. In our model, it is the evaluation
variable. If we close the model viewer and click on the first column of
Fig. 7.124, we get a bar chart as shown in Fig. 7.127. Obviously, the patients
that tested positive are assigned to cluster 2. So the classification is not very
good, but satisfying enough.
13. Finally, we can verify the result by adding a Matrix node to the stream and
applying a chi-square test of independence to the result. Figure 7.128 shows the
final stream. Figure 7.129 shows the parameters of the Matrix node, using a
cross tabulation of the cluster result vs. the clinical test result. The resulting
chi-square test of independence in Fig. 7.130 proves our result: the cluster
results are not independent from the clinical test result.
14. We found out so far that the Auto Cluster node offers a good way to apply
different clustering algorithms in one step. K-Means and TwoStep algorithms
seem to produce fair clustering results for the given dataset. By modifying the
Auto Cluster node parameters, we want to hide models, such as the Kohonen
model, which are not appropriate.
We double-click on the Auto Cluster node and open the Discard tab as shown
in Fig. 7.131. As we know, we want to produce a model that distinguishes two
groups of patients. So we set the parameter “Number of clusters is greater than”
at the value 2.
696 7 Cluster Analysis
Fig. 7.127 Distribution of the evaluation variable “class_variable”, depending on the cluster
15. We run the Auto Cluster node with the new settings and get the results shown in
Fig. 7.132.
The Auto cluster node can help to apply the TwoStep, K-Means, and
Kohonen algorithms to the same variables at the same time. Furthermore, it
allows testing of different model parameters. For instance, in Figs. 7.120,
7.121, and especially Fig. 7.122, a multiple number of clusters for testing can
be defined for the K-Means algorithm. The user can determine the parameters
of all these models at the same time. So using the Auto Clustering node can
help to determine many models and select the best.
7.5 Auto Clustering 697
The Auto Cluster node can also be used to produce a series of models with
different cluster numbers at the same time. Then the user can compare the
models and select the most appropriate one. This functionality will be
demonstrated in exercise 1.
It is important to note here, however, that all the algorithms must use the
same input variables. This is a disadvantage, because the TwoStep algorithm
needs transformed variables to meet the assumption of normally distributed
values. Using the Euclidean distance measure for TwoStep instead produces a
bad model. Interested users can test this. We came to the same conclusion in
exercise 3 of Sect. 7.4.4. In summary, the user has to deal with whether he/she
7.5 Auto Clustering 699
should avoid the assumption of normally distributed values for the TwoStep or
produce a model based on this assumption. The consequence is that the
K-Means algorithm, which uses the same variables in the Auto Cluster node,
often cannot perform well. We will show in exercise 1 that using untransformed
data will lead to better results for K-Means.
" The Auto Cluster node allows access details inside the data, by
determining a set of models, all with different parameters at the
same time. When applying this node to data, the following aspects
should be kept in mind:
Advantages
– The node allows testing for the application of algorithms K-Means,
TwoStep, and Kohonen, at the same time, to the same selected
variables.
– The user gets possible appropriate models in reference to the selected
variables.
– Several options can be defined for the algorithms individually.
– Restrictions for the identified models, such as number of clusters
and size of clusters can be determined in the discard tab of the
Auto Cluster node.
– The Auto Cluster node can be used to produce a series of models
with different cluster numbers at once. Then the user can compare
the models and select the most appropriate.
Disadvantages
– To meet the TwoStep algorithms assumption of normally distributed
input variables, the variables must be transformed. As all algorithms
must deal with the same input variables, then the K-Means node
does not perform very well. So the user must often deal with the
trade-off between avoiding the normality assumption for TwoStep
or producing probably better results with K-Means.
– Usage of the Auto Cluster node needs experience, because a lot of
parameters, such as “different numbers of clusters to test”, can be
defined separately for each algorithm. For experienced users, it is a
good option for ranking models. Nevertheless, caution is needed to
keep track of all parameters and discard unhelpful options.
7.5.3 Exercises
results, the K-Means algorithm produces the best result in terms of silhouette value.
In this exercise, you should assess the model. Furthermore, you should examine the
dependency of the silhouette value and the number of clusters to separate.
1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The
Stream “Template-Stream_nutrition_habits” uses this dataset. Save the stream
under another name.
2. Use an Auto Cluster node to determine useful algorithms for determining
consumer segments. Assess the quality of the different cluster algorithms.
7.5.4 Solutions
Fig. 7.135 K-Means model with two clusters in the model nugget
increases from 0.373, found in Fig. 7.132, to 0.431. To start the model viewer,
we double-click on the model nugget in the second column.
More detailed K-Means results are shown in Fig. 7.136. The clustering quality
is based on assessment of the silhouette value of 0.431 as good. The importance
of the variables in Fig. 7.137 changes, in comparison to the results found in
Fig. 7.126.
We now can investigate the model quality using the Matrix node. In
Fig. 7.138, we illustrate the results for the frequency distribution of the diabetes
test results per cluster. In comparison with the results of the Auto Cluster node
shown in Fig. 7.130, we can state that here also cluster 1 represents the patients
that tested negative. With its frequency of 212 patients, against 196 in the
7.5 Auto Clustering 703
Fig. 7.137 Model details from K-Means clustering in the Model Viewer
TwoStep model, the K-Means model is of better quality in terms of test results
correctly assigned to the clusters.
3. To make the stream easier to understand, we can add a comment to the Auto
Cluster node (see Fig. 7.139). Then we copy the Auto Cluster node and paste
704 7 Cluster Analysis
it. After that, we connect the new node to the type node. Figure 7.139 shows the
actual status of the stream.
We double-click on the pasted Auto Cluster node and activate the “Model”
tab. It is important to define the number of models to keep. Here, we want to
determine nine models. So this number must be at least 8 (see Fig. 7.140).
In the original stream, we only determined models with two clusters. Here, we
have to remove this restriction in the Discard tab of the Auto Cluster node.
Otherwise, other models will not be saved (see Fig. 7.141).
To determine the models with different cluster numbers, we activate the
Expert tab (see Fig. 7.142). Then we click on “Specify . . .”. As shown in
7.5 Auto Clustering 705
Fig. 7.143, we open the dialog window in Fig. 7.144. Here, we define models
with between 2 and 10 clusters to determine.
This option has an advantage over the possibilities offered in the expert node
of the K-Means node. Here, we can only specify one model to determine at a
706 7 Cluster Analysis
Fig. 7.143 Defining the number of clusters to determine in the K-Means part of the Auto
Cluster node
time. In that way, the Auto Cluster node is more convenient for comparing
different models with different cluster numbers.
We can close all dialog windows.
We do not need to add a Matrix node, because we only want to compare the
different models based on their silhouette values. Figure 7.145 shows the final
stream, with an additional comment added. We can run the second Auto Cluster
node now.
Assessing the results in Fig. 7.146, we can see that the model quality decreases
if we try to determine models with more than two clusters. This proves the model
quality determined in part 2 of the exercise. Furthermore, we can say that the
algorithm indeed tries to determine segments that describe patients suffering or
not suffering from diabetes.
7.5 Auto Clustering 707
Fig. 7.146 Auto Cluster node results for different numbers of clusters
Remark
The TwoStep node is implemented in the Auto Cluster node. The log-likelihood
distance measure needs normally distributed continuous variables and multi-
nomially distributed variables. In Sect. 7.5.2, we found that using the Euclidean
distance measure does not result in better models. That’s because we use untrans-
formed data in combination with the log-likelihood measure here.
algorithm tends to produce models with too many clusters. Here, with its six
clusters and a silhouette value of 0.794, it’s a good and useful model.
The predictor importance on the right of Fig. 7.151, and characteristics of the
different clusters shown on the left in Fig. 7.151, can also be used for consumer
segmentation purposes. Based on this short assessment, we have found an
alternative model to the TwoStep clustering presented in the solution to exercise
3 of Sect. 7.3.5.
710 7 Cluster Analysis
7.6 Summary
Summarizing all our findings, we can state that clustering data often entails more
than one attempt at finding an appropriate model. Not least, the number of clusters
must be determined based on practical knowledge.
" The Modeler offers K-Means, TwoStep, and the Kohonen nodes for
identifying clusters in data. K-Means is widely used and portions the
datasets, by assigning the records to identified cluster centers. The
user has to define the number of cluster in advance.
" This is also necessary using the Kohonen node. This node is an imple-
mentation of a two-layer neural network. The Kohonen algorithm
transforms a multidimensional input vector into a two-dimensional
space. Vectors presented to the network are assigned to a pattern
first recognized in the data during a learning period.
" The Auto Cluster node can be used with the TwoStep, K-Means, and
Kohonen algorithm all steering in the background. This gives the user
a chance to find the best algorithm in reference to the structure of the
given data. This node is not as easy to use as it seems, however. Often
not all appropriate models are identified, so we recommend using the
separate TwoStep, K-Means, and Kohonen model nodes instead.
Literature
Bacher, J., Wenzig, K., & Vogler, M. (2004). SPSS TwoStep Cluster—A first evaluation.
Accessed 07/05/2015, from http://www.statisticalinnovations.com/products/twostep.pdf
Backhaus, K. (2011). Multivariate Analysemethoden: Eine anwendungsorientierte Einf€ uhrung,
Springer-Lehrbuch (13th ed.). Berlin: Springer.
Bühl, A. (2012). SPSS 20: Einf€
uhrung in die moderne Datenanalyse, Scientific tools (13th ed.).
München: Pearson.
Ding, C., & He, X. (2004). K-means Clustering via Principal Component Analysis. Accessed
18/05/2015, from http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf
712 7 Cluster Analysis
In Chap. 5, we dealt with regression models and applied them to datasets with
continuous, numeric target variables, the common data for these kinds of models.
We then dedicated Chap. 7 to unsupervised data and described various cluster
algorithms. In this chapter, we turn back to the supervised methods and attend to
the third big group of data mining methods, the classification algorithms.
Here, we are confronted with the problem of assigning a category to each input
variable vector. As the name suggests, the target variable is categorical, e.g.,
“Patient is ill” or “Patient is healthy”. As in this example, the target variable’s
possible values of a classification problem can’t often be ordered. These kinds of
problems are very common in all kinds of areas and fields, such as Biology, Social
Science, Economics, Medicine, and Computer Science. Almost any problem you
can think that involves deciding between different kinds of discrete outcome.
In the first section, we describe some real-life classification problems in more
detail, where data mining and classification methods are useful. After these
motivating examples, we explain in brief the concept of classification in data
mining, using the basic mathematical theory that all classification algorithms
have in common. Then in the remaining sections, we focus on the most famous
classification methods, which are provided by the SPSS Modeler individually, and
describe their usage on data examples. Figure 8.1 shows the outline of the chapter
structure. A detailed list of the classification algorithms discussed here are
displayed in the subsequent Fig. 8.6.
After finishing this chapter, the reader . . .
1. Is familiar with the most challenges when dealing with a classification problem
and knows how to handles them.
2. Possesses a large toolbox of different classification methods and knows their
advantages and disadvantages.
3. Is able to build various classification models with the SPSS Modeler, and is able
to apply it to new data for prediction.
4. Knows various validation methods and criteria and can evaluate the quality of
the trained classification models within the SPSS Modeler stream.
Classification methods are needed in a variety of real world applications and fields,
and in many cases they are already utilized. In this chapter, we present some of
these applications as motivating examples.
Example 1
Diagnosis of breast cancer
To diagnose breast cancer, breast mass is extracted from numerous patients by
fine needle aspiration. Each sample is then digitalized into an image from which
different features can be extracted. Among others, these include the size of the cell
core or the number of mutated cells. The feature records of the patients, together
with the target variable (cancer or no cancer), are then used to build a classification
model. In other words, the model learns how the features have to look, for the tissue
sample to be tumorous or not. Now for a new patient, doctors can easily decide
whether or not breast cancer is present, by establishing the above-mentioned
features and putting these into the trained classifier.
This is a classical application of a classification algorithm, and logistic regres-
sion is a standard method used for this problem. See Sect. 8.3 for a description of
logistic regression.
An example dataset is the Wisconsin Breast Cancer Data (see Sect. 10.1.35).
Details on the data, as well as the research study, can be found in Wolberg and
Mangasarian (1990).
Example 2
Credit scoring of bank customers
When applying for a loan at a bank, your credit worthiness is calculated based on
some personal and financial characteristics, such as age, family status, income,
number of credit cards, or amount of existing debt. These variables are used to
estimate a personal credit score, which indicates the risk to the bank when giving
you a loan.
8.1 Motivating Examples 715
An example of credit scoring data is the “tree_credit” dataset. For details, see
Sect. 10.1.33.
Example 3
Mathematical component of a sleep detector (Prediction of sleepiness in EEG
Signals)
Drowsiness brings with it a lesser ability to concentrate, which can be dangerous
and should be avoided in some everyday situations. For example, when driving a
car, severe drowsiness is the precursor of microsleep, which can be life-threatening.
Moreover, we can think of jobs where high concentration is essential, and a lack of
concentration can lead to catastrophes. For example, the work of plane pilots,
surgeons, or technique observers at nuclear reactors. This is one reason why
scientists are interested in detecting different states in the brain, to understand its
functionality.
For the purpose of sleep detection, EEG signals are recorded in different sleep
states, drowsiness and full consciousness, and these signals are analyzed to identify
patterns that indicate when drowsiness may occur. The EEG_Sleep_Signals.csv
(see Sect. 10.1.10) is a good example, which is analyzed with an SVM in
Sect. 8.5.2.
Example 4
Handwritten digits and letter recognition
When we send a letter with the mail, this letter is gathered together with a pile of
other letters. The challenge for the post office is to sort this huge mass of letters by
their destination (country, city, zip-code, street). In former days, this was done by
hand, but nowadays computers do this automatically. These machines scan the
address on each letter and allocate it to the right destination. The main challenge is
that handwriting differs noticeably between different people. Today’s sorting
machine’s use an algorithm that is able to recognize alphabetical characters and
numbers from individual people, even if it has never seen the handwriting before.
This is a fine example of a machine learning model, trained on a small subset and
able to generalize to the entire collective.
Other examples where automatic letter and digit identification can be relevant
are signature verification systems or bank-check processing. The problem of
handwritten character recognition falls into the more general area of pattern recog-
nition. This is a huge research area with many applications, e.g., automatic identifi-
cation of the correct image of a product in an online shop or analysis of satellite
images.
The optical recognition of handwritten digits data obtained from the UCI
Machine Learning Repository Machine Learning Repository (1998) is a good
example. See also Sect. 10.1.25.
716 8 Classification Models
• Sports betting. Prediction of the outcome of a sports match, based on the results
of the past.
• Determining churn probabilities. For example, a telecommunication company
wants to know if a customer has a high risk of switching to a rival company. In
this case, they could try to avoid this with an individual offer to keep the
customer.
• In the marketing area: to decide if a customer in your database has a high
potential of responding to an e-mailing marketing campaign. Only customers
with a high response probability are contacted.
The procedure for building a classification model follows the same concept as
regression modeling (see Chap. 5). As described in the regression chapter, the
original dataset is split into two independent subsets, the training and the test
set. A typical partition is 70 % training data and 30 % test data. The training data is
used to build the classifier, which is then applied to the test set for evaluation. Using
a separate dataset is the most common way to measure the goodness of the
classifier’s fit to the data and its ability to predict. This process is called cross-
validation (see Sect. 5.1.2) or James et al. (2013). We thus recommend always using
this training and test set framework when building a classifier.
Often, some model parameters have to be predefined. These however are not
always naturally given and have to be chosen by the data miner. To find the optimal
parameter, a third independent dataset is used, the validation set. A typical
8.2 General Theory of Classification Models 717
partition of the original data is then 60 % training, 20 % validation and 20 % test set.
After training several classifiers with different parameters, the validation set is used
to evaluate these models in order to find the one with best fit (see evaluation
measures in Sect. 8.2.5). Afterwards, the winner of the previous validation is
applied to the test set, to measure its predicting performance on independent data.
This last step is necessary to eliminate biases that might have occurred in the
training and validation steps. For example, it is possible that a particular model
performs very well on the validation set, but is not very good on other datasets. In
Fig. 8.2, the process of cross-validation and building a prediction model is
illustrated.
The general idea behind a classification model is pretty simple. When training a
classification model, the classification algorithm inspects the training data and tries
to find regularities in data records with the same target value and differences
between data records of different target values. In the simplest case, the algorithm
converts these findings into a set of rules, such that the target classes are
characterized in the best possible way through these “if . . . then . . .” statements.
In Fig. 8.3, this is demonstrated with a simple data example. There, we want to
predict if we can play tennis this afternoon based on some weather data. The
classification algorithms transform them into a set of rules that define the classifier.
For example, if the outlook is “Sunny” and the temperature is greater than 15 C, we
will play tennis after work. If otherwise the outlook is “Rain” and the wind
prognosis is “strong”, we are not going to play tennis.
Applying a classifier to new and unseen data now simply becomes the assign-
ment of a class (target value) to each data record using the set of rules. For example,
in Fig. 8.4, the classifier that suggests whether we can play tennis is applied to new
data. For day 10, where the Outlook ¼ ”Sunny”, Temperature ¼ 18 C and
Wind ¼ ”strong”, the classifier predicts that we can play tennis in the afternoon,
since these variable values fulfill the first of its rules (see Fig. 8.3).
Many classification models pre-process the input data via a numeric function,
which is used to transform and quantify each data record. Each data record is then
assigned a score, which can be a measure of the probabilities of each class, or some
distances. When training the model, this scoring function and its parameters are
determined, and a set of decision rules for the function’s value are generated. The
class of unseen data is now predicted by calculating the score with the scoring
function and assigning the target class suggested by the score. See Fig. 8.5 for an
illustration of a classification model with an internal scoring function.
The models of the first two types follow a mainly mathematical approach and
generate functions for scoring and separation of the data and classes. Whereas
linear methods try to separate the different classes with linear functions, nonlinear
classifiers can construct more complex scoring and separating functions. In contrast
to these mathematical approaches, the rule-based models search the input data for
structures and commonalities without transforming the data. These models generate
“if . . . then . . .” clauses on the raw data itself. Figure 8.6 lists the most important
and favored classification algorithms that are also implemented in the SPSS Mod-
eler. Chapters where the particular models are discussed are shown in brackets.
Each classification model has its advantages and disadvantages. There is no
method that outperforms every other model for every kind of data and classification
problem. The right choice of classifier strongly depends on the data type. For
example, some classifiers are more applicable to small datasets, while others are
more accurate on large data. Other considerations when choosing a classifier are the
properties of the model, such as accuracy, robustness, speed, and interpretability of
the results. A very accurate model is never bad, but sometimes a robust model,
which can be easily updated and is insensitive to strange and unusual data, is more
important than accuracy. In Table 8.1, some advantages and disadvantages of the
classification algorithms are listed to give the reader guidelines for the selection of
the right method. See also Tuffery (2011).
720 8 Classification Models
In the case of a binary target variable, i.e., a “yes” or “no” decision, four possible
events can occur when assigning a category to the target variable via classification.
– True positive (TP). The true value is “yes” and the classifier predicts “yes”.
A patient has cancer and cancer is diagnosed.
– True negative (TN). The true value is “no” and the classifier predicts “no”.
A patient is healthy and no cancer is diagnosed.
– False positive (FP). The true value is “no” and the classifier predicts “yes”.
A patient is healthy but cancer is diagnosed.
– False negative (FN). The true value is “yes” and the classifier predicts “no”.
A patient has cancer but no cancer is diagnosed.
722 8 Classification Models
These four possible classification results are displayed below in Table 8.2.
A model is said to be a good classifier if it predicts the true value of the outcome
variable with high accuracy, that is, if the proportion of TP’s and TN’s is high, and
the number of FP’s and FN’s is very low. Unfortunately, a perfect classifier with no
misclassification is pretty rare. It is almost impossible to find an optimal classifier.
Now, one can justifiably interject that nearly every dataset of different labeled
points can be separated perfectly using a function, which is also called a decision
boundary. This is undoubtedly true. However, a complex decision boundary might
perfectly classify the existing data, but if a new and unknown data point has to be
classified, it can easily lie on the wrong side of the decision boundary, and thus be
misclassified. The perfect classifier no longer exists. The problem here was that the
classification model overfitted the data and was inappropriate for independent data,
even though the doctor perfectly classified the training data. Thus, we have to
reduce our expectations and condone little misclassifications in favor of a simple
and more universal decision boundary. See Kuhn and Johnson (2013), for more
information on the decision boundary and overfitting, as well as Sect. 5.1.2.
These considerations are illustrated in Fig. 8.8. There, the decision boundary in
the graph to the left separates the squares from the circles perfectly. In the middle
graph, a new data point is added, the filled square, but this lies on the circle side of
the boundary. Thence, the decision boundary is no longer optimal, and it overfits the
other data. So, the linear boundary in the graph on the right is as good as the
boundary in the other graph, but much simpler.
So, in order to avoid overfitting, we recommend always using the training/testing
model setting and potentially the validation model setting too.
Fig. 8.8 Illustration of the problem of overfitting, when choosing a decision boundary
8.2 General Theory of Classification Models 723
Various measures and methods exist for the evaluation of models. Here, we discuss
the most common ones implemented in the SPSS Modeler.
Classification error
The obvious measure for evaluating a classifier is the rate of misclassified instances,
e.g.,
number of FP þ FN
:
number of data points
This quotient is called a Classification error and should be small for a well-fitted
model.
FP TP
FPR ¼ and TPR ¼ :
FP þ TN TP þ FN
More precisely, consider a binary classifier (0 and 1 as target categories) with a
score function that calculates the probability of the data point belonging to class
1. Then each of these scores (probabilities) t is considered as a threshold for the
decision, if a data point is of class 1 or 0; that is, a data point is considered to be
of class 1 if its predicted probability is larger than t. From these predictions, the
FPR(t) and TPR(t) are calculated. Thus, the FPR and TPR are actually vectors with
values between 0 and 1. Of course, this works analog with every other form of
score.
As an example, consider the following probability predictions (Fig. 8.9):
To calculate the ROC, the FPR and TPR are determined for thresholds 1, 0.8,
0.6, 0.5, 0.2, 0. These are displayed in Fig. 8.10, together with the coincidence
matrix that contains the TP, FP, FN, and TN values in the different cases.
As an example, let us take a look at t ¼ 0.2. There are three data points with a
probability for class 1 larger than 0.2; hence, they are assumed to belong to class
1. Two of them are actually of class 1, while one is misclassified (the one with
probability 0.6) (see Fig. 8.11). The last data point with probability 0.2 is assigned
to class 0 since its predicted probability is not larger than t. From these predicted
classes, compared with the true classes, we can construct the coincidence matrix
and then easily calculate the TPR and FPR. See the second column in Fig. 8.10.
The ideal classifier has TPR ¼ 1 and FNR ¼ 0, which means that a good classifi-
cation model should have an ROC that goes from the origin in a relatively straight
line to the top left corner, and then to point (1,1). The diagonal symbolizes the
random model, and if the curve coincides with it, it doesn’t perform any better.
In cases where the ROC is under the diagonal, it is most likely that the model
classifies the inverse, i.e., 0 is classified as 1 and vice versa. A typical and an ideal
ROC is displayed in Fig. 8.12.
This ROC curve can be transformed into a single goodness of fit measure for
the classification model: the AUC. This is simply the area under the ROC curve
and can take values between 0 and 1. With the AUC, two models can be compared,
and a higher value indicates a more precise classifier and thus a more suitable
model for the data.
For more detailed information on the ROC and AUC, we recommend Kuhn
and Johnson (2013) and Tuffery (2011).
In the SPSS Modeler, the ROC can be drawn with the Evaluation node. This is
demonstrated in Sect. 8.3.2.
Gini index
Another common goodness of fit measure for the classifier is the Gini index, which
is just a transformation of the AUC, i.e.,
Gini ¼ 2 AUC 1:
8.2 General Theory of Classification Models 725
Fig. 8.12 A typical ROC curve is illustrated in the first graph and an ideal ROC curve in the
second graph
As before with the AUC, a higher Gini index indicates a better classification model.
The shaded area in the first graph in Fig. 8.12 visualizes the Gini index and shows
the relationship with the AUC.
These presented measures are implemented in the SPSS Modeler and displayed
in the output of the Analysis node. To get these statistics on the model’s
goodness of fit, the Analysis node has to be added to the model nugget and then
executed.
The statistics that can be calculated and displayed with the Analysis node can be
seen in Fig. 8.13. The three most important additional statistics and measures that
are available for a classifier model are listed in Table 8.3, and the output of the
Analysis node for a binary target variable can be viewed in Fig. 8.14. The output in
the multiple classification case is basically the same, except for the missing AUC
and Gini values. For an example of such a multiple classifier output, see Fig. 8.89.
In the output window of the Analysis node, the accuracy is shown at the top,
followed by the coincidence matrix. The Gini and AUC are displayed at the bottom
of the Analysis node output. All statistics are individually calculated for the training
and test set, respectively. See Fig. 8.14 for the example of the Analysis node
statistics for a classifier on the Wisconsin breast cancer data.
726 8 Classification Models
Fig. 8.14 Output of the Analysis node of a classifier with two target categories, on the example of
the Wisconsin Breast Cancer Data (see Sect. 10.1.35) and a logistic regression classifier
8.2.7 Exercises
Fig. 8.15 Defining the target variable measurement as “Flag” in the Type node. This ensures
correct calculation of the AUC and Gini
Use this classifier to predict the credit rating of the following customers:
“good” credit rating. The probabilities of the customers and the true credit ratings
are as follows:
1. Predict the credit rating in cases where the classifier suggests a “good” rating, if
the probability is greater than 0.5, and a “bad” rating otherwise. What are the
number of TP, FP, TN, and FN and the error rate in this case?
2. Calculate the ROC and the AUC from the probabilities of a “good” credit rating
and a true credit rating.
Give reasons why these effects can worsen your model and eliminate a good
prediction. Explain the term “overfitting” in your own words.
730 8 Classification Models
8.2.8 Solutions
1. This has been explained in Sect. 8.2.1, and we refer to this chapter for the answer
to this first question.
2. The predicted credit ratings are listed in Fig. 8.16.
1. The predicted credit ratings are the same as in exercise 1 and can be viewed in
Fig. 8.16. Table 8.4 shows the credit card rating values of TP, FP, TN, FN.
With these values, we obtain the error rate
number of FP þ FN 1þ2
Error rate ¼ ¼ ¼ 0:231:
number of data points 13
2. The values of TPR and FPR for the relevant thresholds are listed in Fig. 8.17, and
they are calculated as described in the ROC part of Sect. 8.2.5. The ROC then
looks like Fig. 8.18.
Fig. 8.17 FPR and TPR for the relevant threshold of the ROC
– Imbalanced data: Classifier algorithms can prefer the majority class and opti-
mize the model to predict this class very accurately. As a consequence, the
minority class is then poorly classified. This leaning towards the larger target
class is often the optimal choice for highest prediction accuracy, when dealing
with highly imbalanced data.
732 8 Classification Models
In Fig. 8.19, the imbalanced data problem is visualized. The dots are under-
represented compared with the rectangles, and thus the decision boundary is
matched to the latter class of data points. This leads to high misclassification in
the dots’ class, hence, 5 out of 7 dots are misclassified, as they lay on the wrong
side of the boundary. See Haibo He and Garcia (2009) for details on this issue
and a variety of methods for dealing with this problem.
– Outliers: An outlier in the data can vastly change the location of the decision
boundary and, thus, highly influence the classification quality. In Fig. 8.20, the
calculated decision boundary is heavily shifted, in order to take into account the
outlier at the top. The dotted decision boundary might be a better choice for
separation of the classes.
– Missing values: There are two pitfalls when it comes to missing values. First, a
data record with missing values leaks of information. For some models, these
data are not usable for training, or they can lead to incorrect models, due to
misinterpretation of the distribution or importance of the variables with missing
values. Moreover, some models are unable to predict the target for incomplete
data. There are several methods to handle missing values and assign a proper
value to the missing field. See Han et al. (2012) for a list of concepts to deal with
missing values.
– A huge number of predictor variables: If the ratio of input variables to the
number of observations is extremely high, classifiers tend to overfit. The more
input variables that exist, the more possibilities for splitting the data. So, in this
case we can definitely find a way to distinguish between the different samples
8.3 Logistic Regression 733
through at least one variable. One possible technique for dealing with this
problem is dimension reduction, via PCA or factor analysis (see Chap. 6). For
further information, we refer to James et al. (2013).
Overfitting is a phenomena that occurs when the model is too optimized to the
training data, such that it is very accurate on the training data, but inaccurate when
predicting unseen data. See Sect. 8.2.4, Fig. 8.8, and Sect. 5.1.2.
Logistic regression (LR) is one of the most famous classification models and is used
for many problems in a variety of fields. It is so incisive and of such relevance and
importance, especially in the financial sector, that the main contributors to the theory
received a Nobel prize in economy in 2000 for their work. LR is pretty similar to
Linear Regression as described in Chap. 5. The main difference is the scale of the
target value. Linear regression assumes a numeric/continuous target variable and tries
to estimate the functional relationship between the predictors and the target variables,
whereas in a classification problem, the target variable is categorical and linear
regression models become inadequate for this kind of problem. Hence, a different
approach is required. The key idea is to perform a transformation of the regression
equation to predict the probabilities of the possible outcomes, instead of predicting
the target variable itself. This resulting model is then the “LR model”.
734 8 Classification Models
8.3.1 Theory
The setting for an LR is the following: consider n data records xi1, . . ., xip, each
consisting of p input variables, and each record having an observation yi. The
observations y1, . . ., yn thereby are binary and take values 0 or 1.
Instead of predicting the categories (0 and 1) directly, the LR uses a different
approach and estimates the probability of the observation being 1, based on the
co-variables, i.e.,
P yi ¼ 1 xi1 , . . . , xip ;
with a regression
h xi1 ; . . . ; xip ¼ β0 þ β1 xi1 þ . . . þ βp xip :
expðtÞ
Fð t Þ ¼ :
1 þ expðtÞ
See Fig. 8.21 for the graph of the logistic function. Hence,
exp h xi1 ; . . . ; xip
P yi ¼ 1 xi1 , . . . , xip ¼ ;
1 þ exp h xi1 ; . . . ; xip
we get the usual (linear) regression term on the right-hand side. This equation is
referred to as log-odds or logit, and it is usually stated as determination of the
logistic regression equation.
8.3 Logistic Regression 735
Odds
Recall that in the linear regression models, the coefficients β0, . . ., βp give the effect
of the particular input variable. In LR, we can give an alternative interpretation of
the coefficients, by looking at the following equation
P yi ¼ 1 xi1 , . . . , xip
¼ expðβ0 Þexpðβ1 xi1 Þ . . . exp βp xip ;
P yi ¼ 0 xi1 , . . . , xip
which is derived from the former equation, by taking the exponential. The quotient
of probabilities on the left-hand side of this equation is called odds and gives the
weight of the probability to the observation being 1, compared to the probability
that the observation is 0. So, if the predictor xik increases by 1, the odds change with
a factor of exp(βk). This particularly means that a coefficient of βk > 0, and thus
expðβk Þ > 1; increases the odds and therefore the probability of the target variable
being 1. On the other hand, if βk < 0; the target variable tends to be of the category
0. A coefficient of 0 does not change the odds, and the associated variable has
therefore no influence on the prediction of the observation.
This model is called the Logit model, due to the transformation function F, and
it is the most common regression model for binary target variables. Besides the
Logit model, there are other models that use the same approach but different
transformation functions, and we refer interested readers to Fahrmeir (2013) for
further information.
To perform an LR, some requirements are needed, which should be checked
before trusting the results of the model.
736 8 Classification Models
Necessary conditions
Goodness of fit measures
Besides the common measures of fit introduced in Sect. 8.2.5, there are a variety of
other measures and parameters to quantify the goodness of fit of the logistic
regression model to the data. First, we mention the Pseudo R-Square Measures,
in particular the Cox and Snell, the Nagelkerke, and the McFadden measures,
which compare the fitted model with the naı̈ve model, i.e., the model which
includes only the intercept. See Tuffery (2011) and Allison (2014). These
measures are similar to the coefficient of determination R2, but each of them has
its limitations. We would further like to point out that we have to pay attention
when interpreting the Cox and Snell R2 measure since its maximum value is always
less than 1.
Another measure for the goodness of fit used by the Modeler is the likelihood
ratio test, which is also a measure that compares the fitted model with the model
including only the intercept. This statistic is asymptotically Chi-square distributed.
If the value of the likelihood ratio test is large, the predictors significantly improve
the model fit. For details see Azzalini and Scarpa (2012).
A logistic regression model can be built with the Logistic node in the SPSS
Modeler. We will present how to set up the stream for a logistic regression, using
this node and the Wisconsin breast cancer dataset, which comprises data from
breast tissue samples for the diagnosis of breast cancer. A more detailed description
of the variables and the medical experiment can be gleaned from motivating
Example 1, Sect. 10.1.35 and Wolberg and Mangasarian (1990).
8.3 Logistic Regression 737
Fig. 8.23 Detailed view of the Type node for the Wisconsin Breast Cancer data
this example. As input variables, we select all but the “SampleID”, as this
variable only names the samples, and so is irrelevant for any classification.
5. In the Model tab, we can select if the target variable is binary or has more than
two categories. See top arrow in Fig. 8.25. We recommend using the Multino-
mial option, as this is more flexible and the procedure used to estimate the model
doesn’t differ much from the Binomial, while the results are equally valid.
As in the other regression models (see Chap. 5), we can choose between
several variable selection methods. See Sect. 5.3.1 for more information on these
methods. Table 8.5 shows which variable selection methods are available for
Multinomial and Binomial logistic regression.
In the example of the breast cancer data, we choose the Backwards stepwise
method, see middle arrow in Fig. 8.25. That means, that the selection process starts
with a complete model, i.e., all variables included, and then removes variables step
by step, until the resulting model cannot be anymore improved upon.
We can also specify the base category, which is the category of the target
variables that all other variables are compared with. In other words, the base
category is interpreted as 0 for the logistic regression, and the probability of
non-occurrence in the base category is estimated by the model.
By default, each input variable will be considered separately, with no
dependencies or interactions between each other. This is the case when we
8.3 Logistic Regression 739
Fig. 8.24 Selection of the criterion variable and input variables for logistic regression, in the case
of the Wisconsin Breast Cancer data
select the model type “Main Effects” in the model options (see Fig. 8.25). On the
other hand, if “Full Factorial” is selected as the model type, all the possible
interactions between predictor variables are considered. This will lead to a more
complex model that can describe more complicated data structures. The model
may be likely to suffer from overfitting in this situation however, and the
calculations may increase intensively due to the amount of new coefficients
that have to be estimated. If we know the interactions between variables, we can
also declare them manually. This is shown in detail in Sect. 8.3.3.
If we select the Binomial logistic model, it is also possible to specify the
contrast and base category for each categorical input. This can sometimes be
useful if the categories of a variable are in a certain order, or a particular value is
the standard (base) category to which all other categories are compared. As this
is a feature for experienced analysts, however, we omit giving a detailed
description of the options here, and refer interested readers to IBM (2015b).
740 8 Classification Models
Fig. 8.26 The “Analyze” tab, with the option “Calculate predictor importance”
for the accuracy and Gini/AUC values of the model for the Wisconsin Breast
cancer data. Here, we have 97 % correctly classified samples and a very high
Gini value of 0.992. Both indicate a well-fitted model using the training data.
8. As we are faced with a binary classification problem, we add an Evaluation node
to the model nugget, to visualize the ROC in a graph and finish the stream. Open
the node and select the ROC graph as the chart type (see Fig. 8.28).
After clicking on the “Run” button at the bottom, the graph output pops up in a
new window. This is displayed in Fig. 8.29. The ROC is visualized with a line
above the diagonal and has nearly the optimal form (recall Sect. 8.2.6), whereas
the diagonal symbolizes the purely random prediction model.
Fig. 8.29 Graph of the ROC of the logistic regression model on the Wisconsin breast cancer data
There are three possible model types for a logistic regression. Two of these are:
“Main Effects”, where the input variables are treated individually and indepen-
dently with no interactions between them, and “Full Factorial”, where all possible
dependencies between the predictors are considered for the modeling. In the latter
case, the model describes more complex data dependencies, and itself becomes
more difficult to interpret. Since there is an increase of terms in the equation, the
calculations may take much longer, and the resulting model is likely to suffer from
overfitting.
With the third model type, i.e., “Custom” (see Fig. 8.30), we are able to
manually define variable dependencies that should be considered in the modeling
process. We can declare these in the Model tab of the Logistic node, in the bottom
field, as framed in Fig. 8.30.
– Single interaction: a term is added that is the product of all selected variables.
For example, if A, B, and C are chosen, then the term A*B*C is included in
the model.
– Main effects: each variable is added individually to the model, hence, A, B,
and C separately, for the above example.
– All *-way interaction: All possible products with variable combinations of *,
which stands for 2, 3, or 4, are inserted. In the case of “All 2-way”
interactions for example, this means that for the terms A*B, A*C, and B*C
are added to the logistic regression model.
3. We choose one of the five term types and mark the relevant variables for the term
we want to add, by clicking on them in the field below. In the Preview field, the
selected terms appear and by clicking on the Insert button, these terms are
included in the model. See Fig. 8.32, for an example of “All 2-way” interaction.
The window closes, and we are back in the options view of the Logistic node.
4. The previous steps have to be repeated until every necessary term is added to the
model.
8.3 Logistic Regression 745
The estimated coefficients and the parameters describing the goodness of fit can be
viewed by double-clicking on the model nugget.
In the right-hand field, the predictor importance of the input variables is displayed.
The predictor importance gives the same description as it would with linear
regression; that is, the relative effect of the variables on the prediction of the
observation in the estimated model. The length of the bars indicates the importance
of the variables in the regression model, which add up to 1. For more information on
predictor importance, we refer to the “Multiple Linear Regression” Sect. 5.3.3, and
for a complete description of how these values are calculated, read IBM (2015a).
Fig. 8.35 Case summary statistics of logistic regression in the Wisconsin breast cancer data
variables, the model-fitting criteria value is displayed, together with test statistics
that were determined in order to evaluate if the variable should be contained in the
model or removed. Here, the variables “cell size” (clem_var3_) and “single epithe-
lial cell size” (clem_var6_) were removed, based on the 2 log likelihood model-
fitting criteria. For details, we refer to Fahrmeir (2013) and James et al. (2013).
The model-fitting information of the final model can be viewed in an additional
table, also contained in the Advanced tab. See the first table in Fig. 8.37. Here, the
model-fitting criteria is visualized, alongside scores from the model validation test
and the selected variables.
Beneath this overview, further model-fitting criteria values are listed, the pseudo
R2 values. These comprise the Cox and Snell, Nagelkerke, and McFadden R2
values. These characteristic numbers are described in the theory section (Sect.
8.3.1), and the references given there. Here, all R2 values indicate that the regres-
sion model describes the data well.
Output setting
In the Settings tag, we can specify which additional values, besides the estimated
class or category, should be included in the output of a prediction (see Fig. 8.38).
These can be “Confidences”, that is the probability of the estimated target category,
or the “Raw propensity score”, which is the estimated probability actually calcu-
lated by the model. This is the probability of the occurrence in the non-base
category. Alongside these options, all probabilities can also be appended to the
output. An example on how such an output might look is pictured in Fig. 8.39. Next
to the estimated class ($L-class), the probability for the outcome of this class ($LP-
class), the probabilities of all possible outcomes ($LP-2, $LP-4), and the probability
estimated by the model ($LRP-class) are added to the output of each inserted
sample record.
750 8 Classification Models
Fig. 8.38 Output definition of the model, when used for prediction
Predicting classes for new records or for a logistic regression model with an
unknown dataset is done just like linear regression modeling (see Sect. 5.2.5). We
copy the model nugget and paste it into the modeler canvas. Then, we connect it to
the Source node with the imported data that we want to classify. Finally, we add an
Output node to the stream, e.g., a Table node, and run it to obtain the predicted
classes. The complete prediction stream should look like Fig. 8.40.
8.3 Logistic Regression 751
Related exercises: 1, 2
1. We consider the stream for the logistic regression as described in Sect. 8.3.2 and
add a Partition node to the stream in order to split the dataset into a training
set and a test set. This can be, for example, placed before the Type node. See
Sect. 2.7.7 for a detailed description of the Partition node. We recommend using
70–80 % of the data for the training set and the rest for the test set. This is a
typical partition ratio. Additionally, let us point out that the model and the hit
rates coincide with the randomly selected training data. To get the same model
in every run, we fix the seed of the random number generator in the Partition
node.
752 8 Classification Models
2. In the Logistic node, we have to select the field that indicates an affiliation to the
training set or test set. This is done in the Fields tab (see Fig. 8.41).
3. Afterwards, we mark the “Use partitioned data” option in the Model tab (see
Fig. 8.42). Now, the modeler builds the model only on the training data and uses
the test data for cross-validation. All other options can be chosen as in a normal
model building procedure and are described in Sect. 8.3.2.
4. After running the stream, the summary and goodness of fit information can be
viewed in the model nugget that now appears. These are the same as in Sect.
8.3.4, despite the fact that fewer data were used in the modeling procedure. See
Fig. 8.43 for a summary of the data used in the modeling process. Here, we see
that the total number of samples has reduced to 491, which is 70 % of the whole
Wisconsin dataset.
Since fewer data are used to train the model, all parameters, the number of
included variables and the fitting criteria values change. See Fig. 8.44 for the
parameters of this model. We note that compared with the model built on the
whole dataset (see Sect. 8.3.4), this model, using 70 % trained data, contains
only 6 rather than 7 predictor variables.
8.3 Logistic Regression 753
Fig. 8.42 Use the partitioned data option in the Logistic node
Fig. 8.43 Summary of the data size used in the modeling process with the training data
5. In the Analysis node, we mark the “Separate by partition” option to calculate the
hit rates for each partition separately, see Fig. 8.45. This option is enabled as
standard. The output of the Analysis node can be viewed in Fig. 8.14, which
shows the hit rates and further statistics for both the training set and the test set.
754 8 Classification Models
Fig. 8.44 Model parameters of the model built on a subset of the original data
Fig. 8.45 Options in the Analysis node with “Separate by partition” selected
In our example of the Wisconsin Breast Cancer data, the regression model
classifies the data records in both sets very well, e.g., with a hit rate of over
97 % and a Gini of over 0.98.
8.3 Logistic Regression 755
Fig. 8.46 Settings in the Evaluation node. The “Separate by partition” option is enabled, to treat
the training data and test data separately
6. In the Evaluation node, we also enable the “Separate by partition” option, so that
the node draws an ROC for the training set, as well as the test set, separately (see
Fig. 8.46). These plots can be viewed in Fig. 8.47.
756 8 Classification Models
Fig. 8.47 Output of the Evaluation node for the logistic regression of Wisconsin breast cancer
data. ROC for the training sets and test sets separately
8.3.7 Exercises
1. Import and inspect the Titanic data. How many values are missing from the
dataset and in which variable fields?
2. Build a logistic regression model to predict if a passenger survived the Titanic
tragedy. What is the accuracy of the model and how are the records with missing
values handled?
3. To predict an outcome for the passengers with missing values, the blanks have to
be replaced with appropriate values. Use the Auto Data Prep node to fill in the
missing data. Which values are inserted in the missing fields?
4. Build a second logistic regression model on the data without missing values. Has
the accuracy improved?
5. Compare the model with and without missing values, by calculating the Gini and
plotting the ROC curve. Use the Merge node to show all measures in an Analysis
node output and draw the ROC of both models in one graph.
8.3.8 Solutions
Fig. 8.48 Stream of the credit rating via logistic regression exercise
Fig. 8.50 Definition of the variable “Credit rating” as the target variable in the Type node
760 8 Classification Models
Fig. 8.51 Definition of the stepwise variable selection method and cross-validation process
!
P CR ¼ Good x
log ¼ 1:696 þ 0:1067 Age
P CR ¼ Bad x
8
< 1:787, IL ¼ High
þ 0, IL ¼ Medium
:
1:777, IL ¼ Low
2:124, NCC ¼ 5 or more
þ
0, NCC ¼ Less than 5
where CR represents the “Credit rating”, IL the “Income level”, and NCC the
“Number of credit cards”.
These predictors are significant to the model. This can be viewed in the
Advanced tab at the bottom of the table (see Fig. 8.54).
4. To calculate the performance measures and to evaluate our model, we add an
Analysis node to the stream and choose the coinciding matrix and AUC/Gini
options. See Sect. 8.2.6 for a description of the Analysis node. We run the
stream. See Fig. 8.55 for the output of the Analysis node and the evaluation
statistics.
We note that the hit ratios in both the training set and the testing set are nearly
the same at a bit over 80 %. Furthermore, the Gini and AUC are close to each
other in both cases at about 0.78. All these statistics suggest good prediction
accuracy and a non-overfitting model.
5. We visualize the ROC by adding an Evaluation node to the stream and
connecting it to the model nugget. We open it and choose the ROC chart option,
762 8 Classification Models
Fig. 8.55 Evaluation statistics of the logistic regression model for predicting credit rating
as shown in Fig. 8.28. Figure 8.56 shows the ROC of the training dataset and test
dataset. Both curves are similarly shaped.
Now we add a second Evaluation node to the model nugget, open it, and
choose “Gains” as our chart option. Furthermore, we mark the box “Include best
8.3 Logistic Regression 763
line”. See Fig. 8.57. Then, we click the “Run” button at the bottom. The output
graph is displayed in Fig. 8.58.
The gain/lift chart visualizes the effectiveness of a model and shows the
amount of positive samples obtained with, and lost by, the model. More pre-
cisely, the gain chart in Fig. 8.58 displays three lines: the diagonal, also called
baseline, and two lines representing the gains of the logistic regression model
(middle line) and the optimal model (top line). The gain chart is drawn as
follows. All customers are sorted according to their score assigned by the logistic
regression (or another classification model). On the x-axis, the percentage of top
scored customers are shown, while the y-axis displays the percentage of all
“Good” rated customers which are in the top scored customers. If we take, for
example, the top 40 % scored customers from the test set, we have identified
about 60 % of all “Good” rated customers with the regression model, while in the
optimal case, we would detect about 65 %. So, we gained 60 %, but lost 5 % of
the “Good” rated customers. In the test set, there are 61 % “Good” rated
customers, but when taking the best 61 % scored, we identified about 83 % of
them, which is still very good.
If we think the other way around and we want to contain 80 % of the “Good”
rated customers in our sub-sample of all customers, we have to select the 57 %
customers that were scored best.
For additional information on the gain/lift chart and its interpretation, we refer
to Kuhn and Johnson (2013).
764 8 Classification Models
The final stream for this exercise is shown in Fig. 8.59. Here, the two models that
are built in this exercise are compressed into SuperNodes, so the stream is more
structured and clear.
Fig. 8.61 Quality output with missing value inspection of the Titanic data
Fig. 8.62 Stream of the logistic regression classifier with missing values in the Titanic data
furthermore have only very few missing values, but the “Age” field is only 80 %
filled. This latter variable can become problematic in our further analysis.
2. We now build the logistic regression classifier on the data with missing values.
Figure 8.62 shows the complete part of this stream. This is the part that is
wrapped up in a SuperNode named “with missing data”.
First, we open the Type node and declare the “Survived” variable as the target
field and set the measurement of it to “Flag”. This will ensure that the “Sur-
vived” variable is automatically selected as the target variable in the Modeling
node. See Fig. 8.63. Then, we add a Logistic node to the stream and connect it to
the Type node. After opening it, we specify “stepwise” as our variable selection
method and enable the use of partitioned data and the predictor importance
calculations. See Figs. 8.51 and 8.52 for the set up of these options in the
Logistic node.
Now, we run the stream and the model nugget appears, which we then open.
In the Model tab (see Fig. 8.64), we note that the variables “Sex”, “Pclass”,
“Embarked”, and “Age” are selected as predictors by the stepwise algorithm.
8.3 Logistic Regression 767
Fig. 8.63 Definition of the “Survived” field as the target and its measurement as “Flag”
Fig. 8.64 Input variables and their importance for predicting the survival of a passenger on the
Titanic
768 8 Classification Models
Fig. 8.65 Performance measures of the Titanic classifier with missing values
The most important of them is the variable “Sex”, and we see by looking at the
regression equation on the left that women had a much higher probability of
surviving the Titanic sinking, as their coefficient is 2.377 and the men’s coeffi-
cient is 0. Furthermore, the variables “Age” and “Embarked”, which contain
missing values, are also included in the classifier.
To evaluate the model, we add the usual Analysis node to the model nugget, to
calculate the standard goodness of fit measures. These are the coincidence
matrix, AUC and Gini. See Sect. 8.2.6. After running the stream, these statistics
can be obtained in the opening window. See Fig. 8.65. We observe that accuracy
in the training set and test set are similar (about 63 %). The same holds true for
the Gini (0.53). This is a good indicator for non-overfitting. In the coincidence
matrix, however, we see a column named $null$. These are the passenger
records with missing values. All these records are non-classifiable, because a
person’s age, for example, is necessary for any prediction, and therefore the
record is treated as 0; hence, they are non-survivors. This obviously reduces the
goodness of fit of our model. So, we have to assign a value to the missing fields,
in order to predict the survival status properly.
8.3 Logistic Regression 769
Fig. 8.66 Renaming the prediction fields for the model with missing values
Fig. 8.67 Stream of the logistic regression classifier and missing value data preparation in the
Titanic data
To distinguish between predictions of the model with missing values and the
one built thereafter without missing values, we change the name of the perdition
fields with a Filler node. To do this, we add a Filler node to the model nugget,
and rename these fields, as seen in Fig. 8.66.
3. Now, we use the Auto Data Prep node to replace the missing values and predict
their outcome properly. Figure 8.67 shows this part of the stream, which is
wrapped in a SuperNode named “mean replacement”.
We add an Auto Data Prep node to the Type node and open it. In the Settings
tab, the top box should be marked in order to perform data preparation. Then, we
tick the bottom three boxes under “Inputs”, so the missing values in continuous,
770 8 Classification Models
nominal, and ordinal fields are replaced with appropriate values. Since the
variables “Age” and “Fare” are continuous, the missing values are replaced by
the mean of the field’s values, and the missing embarked value is replaced by the
field’s mode. See Fig. 8.68. Furthermore, we enable the standardization of
continuous fields. This is not obligatory, but recommended as it typically
improves the prediction, at least it doesn’t worsen it.
After finishing these settings, we click the Analyze Data button at the top,
which starts an analysis process on the data-prepared fields. In the Analysis tab,
the report of the analyzed data can be viewed. We see, e.g., for the “Age”
variable, that in total 263 missing values were replaced by the mean, which is
29.88. Moreover, the age variable was standardized and now follows a normal
distribution, as can be seen in the top right histogram. See Fig. 8.69.
4. To build a classifier on the prepared data, we add another Type node to the
stream, to ensure that the Model node recognizes the input data correctly.
Afterwards, we add a Logistic node to the stream, connect it to the Type node
and choose the same options as in the first Logistic node. See step 2 for details.
Afterwards, we run the stream and the Model nugget appears.
We note that “Sex”, “Pclass”, and “Age” are still included as input variables
in the model, with nearly the same prediction importance. The Embarked
variable is now replaced by “sibsp”, however. This is due to the more powerful
“Age” variable, which is now able to influence more passenger records, as it now
contains complete data. See Fig. 8.70.
8.3 Logistic Regression 771
Fig. 8.69 Analysis of the “Age” field after data preparation by the Auto Data Prep. node
Fig. 8.70 Input variables and their importance for predicting the survival of a passenger on the
Titanic after data preparation
772 8 Classification Models
Fig. 8.71 Performance measures of the Titanic classifier without missing values
To view the Gini and accuracy, we add an Analysis node to the stream and set
up the usual options (see step 2 and Sect. 8.2.6). The output of the Analysis node
can be viewed in Fig. 8.71. We see that no missing values are left in the data and
thus all passengers can be classified with higher accuracy. Consequently, both
the accuracy and Gini have improved, when compared to the model with missing
values (see Fig. 8.65). So, the second model has a higher prediction power than
the one that ignores the missing values.
5. As in the previous model, we add a Filter node to the stream and connect it to the
model nugget; this Filter node allows us to change the name of the prediction
outcome variables by adding “mean repl” as a suffix. See Fig. 8.72.
Now, we add a Merge node to the stream canvas and connect the two Filter
nodes to it. In the Filter tab of the Merge node, we cross out every duplicate field,
to eliminate conflicts from the merging process. See Fig. 8.73.
Now, we add an Evaluation node and an Analysis node to the stream and
connect them to the Merge node. See Fig. 8.74. The settings of these nodes are as
usual. See Sect. 8.2.6 for a description of the Analysis node. In the Evaluation
node, we choose ROC as the chart type, see Fig. 8.28.
8.3 Logistic Regression 773
Fig. 8.72 Renaming of the prediction fields for the model without missing values
Fig. 8.75 ROC of the logistic regression classifiers with and without missing values
In Fig. 8.75, the ROC of the model is shown, with and without missing values
considered. As can be seen, the ROC of the model without missing values lies
noticeably above the ROC of the classifier ignoring missing values. Hence, the
first model has a better prediction ability.
In the Analysis output, the performance measures of both models are stated in
one output. Additionally, the predictions of both models are compared with each
other. This part of the Analysis output is displayed in Fig. 8.76. The first table
shows the percentage of equal-predicted classes. In other words, the table shows
how many passengers are classified as survivor or non-survivor by both
classifiers. The second table then takes the commonly classified records
(passengers) and compares their prediction with the actual category. Here, of
the equally classified passengers, about 80 % are correct.
8.3 Logistic Regression 775
– It is a linear classifier.
– It is robust to outliers
– Has a probabilistic interpretation.
If the variables are highly correlated, the logistic regression can leak in
performance.
4. The right probabilities are:
The calculations are the following. Starting with the odds equations
pA pB
¼ 0:34 and ¼ 0:22;
pC pC
776 8 Classification Models
we imply
pA ¼ 0:34 pC and pB ¼ 0:22 pC :
Since pA þ pB þ pC ¼ 1, we get
and finally
pC ¼ 0:64:
Linear discriminant analysis is one of the oldest classifiers and goes back to Fisher
(1936) and Welch (1939) and their biological frameworks. This classifier is still one
of the most known and preferred classifiers. For example, the Linear discriminant
classifier (LDC) is very popular with banks, since it works well in the area of credit
scoring. When used correctly, the LDC provides great accuracy and robustness. The
LDC only works properly however, if the data are normally distributed in the target
categories. So, the LDC must make strict assumptions and is therefore not applica-
ble to all data. The LDC furthermore uses a linear approach and tries to find linear
functions that describe an optimal separation of the classes. The theory of LDC is
described in more detail in the subsequent section.
8.4.1 Theory
The idea of LDC goes back to Fisher (1936) and Welch (1939) who developed the
method separate to each other and with different approaches. We here describe the
method by Fisher, called Fisher’s linear discriminant method. The key idea
behind the method is to find linear combinations of input variables that separate
the classes in an optimal way. See, e.g., Fig. 8.77. The LDC chooses a linear
discriminant, such that it maximizes the distance of the classes from each other,
8.4 Linear Discriminate Classification 777
while at the same time minimizing the variation within. In other words, a linear
function is estimated that best separates the distributions of the classes from each
other. This algorithm is outlined in the following binary case. For a more detailed
description, we recommend Runkler (2012), Kuhn and Johnson (2013), Tuffery
(2011) and James et al. (2013).
Consider that the target variable is binary. This method now finds the optimal
linear separator of the two classes in the following way: First, it calculates the mean
of each class. The linear separator now has to go through the mid-point between
these two means. See the top left graph in Fig. 8.78. There, a cross symbolizes the
mid-point, and the decision boundary has to go through this point, just like the two
possibilities in the graph. To find the optimal discriminant function, the data points
are projected along the candidates. Now on the projected data, the “within classes”
variance and “between classes” variance are calculated, and the linear separator
with the minimal covariance within the projected classes, and simultaneously with
a high covariance between the projected classes, is picked as the decision boundary.
This algorithm is visualized in Fig. 8.78. In the top right graph, the data are
projected on the two axes. As can be seen, the class distributions of the projected
data overlap and thus the classes are not clearly distinguishable. So the discriminant
functions, parallel to the axes, are not optimal separators. In the bottom left graph,
however, the data points are projected onto the dotted line. This separates the
distributions better, as can be seen in the bottom right graph. Hence, the solid
black line is chosen as the linear discriminant function.
Compared to logistic regression, LDC makes more strict assumptions on the
data, but if these are fulfilled it typically gives better results, particularly in
778 8 Classification Models
accuracy and stability. For example, if the classes are well separated, then logistic
regression can have a surprisingly unstable parameter estimate and thus perform
particularly poorly. On the other hand, logistic regression is more flexible, and thus
it is still preferred in most cases. As an example, linear discriminant analysis
requires numeric/continuous input variables, since the covariance, and so the
distance between data points, has to be calculated. This is a drawback of LDC,
when compared with logistic regression, which can process these kinds of data.
Necessary conditions
Besides the numeric requirement of input variables, Fisher’s discriminant algo-
rithm, and thus LDC, requires the data in each category to be approximately
normally distributed to work properly. If it is not, the data has to first be normalized
before an LDC can work.
The Iris dataset contains data on 150 flowers from three different Iris species,
together with width and length information on their petals and sepals. We want to
build a discriminant classifier that can assign the species of an Iris flower, based
on its sepal and petal width and length. The Discriminant node can automatically
perform a cross-validation when a training set and test set are specified. We will
make use of this feature. This is not obligatory however; if the test set comes from
external data, then a cross-validation process will not be needed.
1. To build the above stream, we first open the stream “012 Template-
Stream_Iris” (see Fig. 8.79) and save it under a new name. This stream already
contains a Partition node that divides the Iris dataset into two parts, a training
and a test set in the ratio 70 % to 30 %.
2. Next, we run the stream and observe the output of the Audit node. We see that
all the continuous variables are nearly Gaussian distributed, which is necessary
for the discriminant analysis to work properly. See Fig. 8.80.
3. To get a deeper insight into the Iris data, we add several Graphboard nodes
to the stream by connecting them to the Type node. We use these nodes to
visualize the data in scatterplots and boxplots. Further descriptive analysis
could be considered, however. Here, we just inspected the two petal variables
in one scatterplot and two boxplots. See Figs. 8.81, 8.82, and 8.83. As can be
seen in the scatterplot, the petal width and length cluster the data by species.
Furthermore, the petal widths and lengths of the diverse species vary in
different ranges. This is evidenced by the boxplots. These descriptive analyses
indicate that a species-detecting classifier can be built.
4. We add the Discriminant node to the Modeler canvas and connect it to the
Partition node. After opening the former, we specify the target, partition, and
input variables in the Field tab. Thereby, the species variable is chosen as the
target, the Partition field as the partition and all four remaining variables are
chosen as input variables. See Fig. 8.84.
5. In the Model tab, we enable the usual “Use partition data” option, in order to
ensure that the training set and the test set are treated differently in their
purpose, for the cross-validation technique. See Fig. 8.85.
Furthermore, we choose the “Stepwise” method, to find the optimal input
variables for our classification model. See the bottom arrow in Fig. 8.85.
6. In the Analyze tab, we also enable the predictor importance calculation option.
See Fig. 8.86.
7. In the Expert tab, further parameters can be specified, in particular, additional
outputs, which will be displayed in the model nugget. These include the
8.4 Linear Discriminate Classification 781
Fig. 8.81 Scatterplot of the Iris data. Species’ are visualized with different shapes and colors
parameters and statistics of the feature selection process and estimated discrim-
inant functions. See Fig. 8.87 for the output options. We omit here the descrip-
tion of every possible additional output and refer the reader to IBM (2015b) for
more information. We recommend trying several of these options; however, in
8.4 Linear Discriminate Classification 783
order to see the surplus profit of information, it gives about the model and its
building process.
8. Now we run the stream, and the model nugget appears on the canvas.
9. We add an Analysis node to the model nugget, to evaluate the goodness of
fit of our discriminant classifier. Since the Iris data has a triple-valued
target variable, we cannot calculate the AUC and Gini index (see
Sect. 8.2.5), but we intend to evaluate the model based on all other measures
(Fig. 8.88).
10. After pressing the “Run” button in the Analysis node, the output statistics
appear and are shown in Fig. 8.89. A look at these model evaluation statistics
shows that the model has very high accuracy. That is, an accuracy of 97 %
correctly classified Iris flowers in the training set and 100 % correctly classified
in the test data. In total numbers, only three of the 150 flowers are
miscategorized.
784 8 Classification Models
Fig. 8.89 The output statistics in the Analysis node for the Iris dataset
The main information on the built model is provided in the Model tab and the
Advanced tab, the contents of which are roughly described hereafter.
Model tab
When opening the model nugget, first the Model tab is shown with the predictor
importance, provided its calculation was enabled in the Discriminant node. As
presumed by descriptive analysis of the Iris data (see Figs. 8.81, 8.82, and 8.83),
the variables “Petal Length” and “Petal Width” are the most important for
differentiating between the Iris species. See Fig. 8.90.
Advanced tab
The Advanced tab comprises statistics and parameters from the estimated model
and the final parameters. Most of the outputs in the Advanced tab are very technical
786 8 Classification Models
Fig. 8.90 Predictor importance of the linear discriminant analysis for the Iris data
and are only interpretable with extensive background and mathematical knowledge.
We therefore keep the explanations of the parameters and statistics rather short and
refer the reader to IBM (2015b) and Tuffery (2011) for further information if
desired.
The outputs chosen in the output options of the Expert tab in the Discriminant
node are displayed in this tab of the model nugget (see Sect. 8.4.2). Here, we will
briefly describe just the standard reports and statistics of the Advanced tab. There
are several further interesting and valuable statistical tables that can be displayed,
however, and we recommend trying some of these additional outputs.
The first two tables in this tab give a summary of the groups, i.e., the samples
with the same category. These include, e.g., the number of valid sample records in
each group and the number of missing values.
The next two tables show the number of estimated discriminants: the quality of
these functions and their proportion to the classification is shown. Here, two linear
functions are needed to separate the three Iris species properly. See Fig. 8.91. The
parameter calculation of the linear functions can be traced back to an eigenvalue
problem, which then has to be solved. See once again Runkler (2012) and Tuffery
(2011). The first table shows the parameters of the eigenvalue estimation. See
Fig. 8.91. The eigenvalue thereby gives a quantity to the discriminating ability.
8.4 Linear Discriminate Classification 787
Fig. 8.91 Quality measures and parameters of the estimated discriminants, including the
eigenvalues and significance test
Score1 ¼ 0:432 Sepal Length 0:514 Sepal Width þ 0:943 Petal Length
þ 0:563 Petal Width
Score2 ¼ 0:265 Sepal Length þ 0:629 Sepal Width 0:445 Petal Length
þ 0:583 Petal Width:
Thereby, the coefficients are stated so that the scores are standardized, meaning that
their distributions have zero mean and a standard deviation equal to one. The
coefficients themselves, thereby, specify the magnitude of the effect of the discrim-
inating variable. For example, the “Petal Length” has the highest effect of the
predictors on the first score.
8.4.4 Exercises
1. Import the data with an appropriate Source node and divide the dataset into
training data and test data.
2. Add a Type node and specify the scale type as well as the role of each variable.
3. Build a linear discriminant classification model with the Discriminant node.
Include the afore done partitioning and the calculation of the predictor impor-
tance in your model.
4. Survey the model nugget. What is the most important predictor variable?
Determine the equation for calculating the discriminant score. Does the model
have a discriminant ability?
5. Add an Analysis node to the nugget and run the stream. Is the model able to
classify the samples? What is the hitting rate and AUC of the test set?
3. Train a Logistic Regression model on the same training data as the Linear
Discriminant model. To do this, use the forwards stepwise variable selection
method. What are the variables included in the model?
4. Compare the performance and goodness of fit of the two models with the
Analysis and Evaluation node.
5. Use the Ensemble node to combine the two models into one model. What are the
performance measures of this new ensemble model?
8.4.5 Solutions
1. First, import the dataset with the File Var. File node and connect it to a Partition
node, to divide the dataset into two parts in a 70:30 ratio. See Sect. 2.7.7 for
partitioning datasets.
790 8 Classification Models
Fig. 8.93 Stream of the LDA classifier for the Wisconsin Breast Cancer dataset
Fig. 8.94 Definition of the scale level and role of the variables
2. Now, add a Type node to the stream and open it. The variable “Sample code”
just labels the samples and has thus no further meaning for the model building
process. Hence, its role is declared as “None”. Now, assign the role “Target” to
the “class” variable. Furthermore, it contains flag values. So, set the measure-
ment to “Flag”. See Fig. 8.94. Afterwards, click the “Read Values” button to
make sure that the stream knows all the scale levels of the variables. Otherwise,
the further discriminant analysis might fail and produce an error.
3. Now, build a linear discriminant classification model by adding a Discriminant
node to the stream. Open the node, enable the predictor importance calculation
in the Analyze tab and choose the “Use type node settings” option in the Fields
tab. The latter now uses the roles of the variables as declared in the Type node.
8.4 Linear Discriminate Classification 791
Fig. 8.95 Predictor importance of the LDA for the Wisconsin breast cancer data
Furthermore, make sure the “Use partitioned data” option is marked in the
Model tab. Now run the stream to build the model.
4. Open the model nugget. In the Model tab, you can see the predictor importance
plot. The “Bare nuclei” variable, with 0.42, has the most importance for
classifying the cancer probes. See Fig. 8.95.
In the Advanced tab, in the Wilks’ lambda tail, we can see that the discrimi-
nating ability is very high, since the significance level is almost zero. See
Fig. 8.96. In the Standardized Canonical Discriminant Function Coefficients,
you can find a list of the coefficients of the linear equation, from which the
discriminant scores are calculated. See Fig. 8.97 for these coefficients.
792 8 Classification Models
5. Now add an Analysis node to the nugget and select at least the options “Coinci-
dence matrices” and “Evaluation metric” in the node. Then run the stream again.
A window similar to Fig. 8.98 opens with the evaluation statistics. We see that
the LDA performs very well, as the accuracy is greater than 96 % in both the
training data and testing data. Furthermore, the AUC is extremely high, with
0.996 for the training data and 0.992 for the test data.
8.4 Linear Discriminate Classification 793
Fig. 8.99 Stream of the LDA and Logistic Regression classifiers for the diabetes dataset
Fig. 8.101 Type node with the measurement type of the target variable (class_variable) is set
to “Flag”
evaluation measures of the later trained models. See Fig. 8.101. Furthermore, we
make sure that the role of the “class_variable” is “Target” and all other roles are
“Input”. Now we add the usual Partition node and split the data into training
(70 %) and test (30 %) sets. See Sect. 2.7.7 for a detailed description of the
Partition node.
2. We add an LDA node to the stream, connect it to the Partition node and open
it. As the roles of the variables are already set in the Type node (see Fig. 8.101),
the node automatically identifies the roles of the variables and thus nothing has
to be done in the Fields tab. In the Model tab however, we select the “Stepwise”
variables selection method (see Fig. 8.102) and in the Analyze tab, we enable the
variable importance calculations (see Fig. 8.103).
Now we run the stream and open the model nugget that now appears.
We observe in the Model tab that the variable “glucose_concentration” is
by far the most important input variable, followed by “age”, “BMI”, and
“times_pregnant”. See Fig. 8.104.
3. We now add a Logistic node to the stream and connect it to the Partition node, to
train a logistic regression model. Afterwards, we open the node and select the
8.4 Linear Discriminate Classification 795
Fig. 8.102 Model tab of the LDA and definition of the stepwise variable selection method
Fig. 8.103 Analyze tab in the LDA node and the enabling of the predictor importance calculation
forwards stepwise variable selection method in the Model tab. See Fig. 8.105.
There, we choose the Binomial procedure, since the target is binary. The
multinomial procedure with the stepwise option is also possible and results in
the same model.
Before running the node, we further enable the importance calculation in the
Analyze tab. We then open the model nugget and note that the most important
796 8 Classification Models
variables included in the LDA are also the most important ones in the logistic
regression model, i.e., “glucose_concentration”, “age”, and “BMI”. These three
are the only variables included in the Model when using the variable selection
method (Fig. 8.106).
4. Before comparing both models, we rearrange the stream by connecting the two
model nuggets in series. See Fig. 8.107. This can easily be done by dragging a
part of the connecting arrow of the Partition node to the logistic regression model
nugget on the LDA model nugget.
Now, we add an Analysis and an Evaluation node to the stream and connect
them to the logistic regression nugget. Then, we run these two nodes to calculate
the accuracy and Gini values. The setting of these two nodes is explained in
Sect. 8.2.5 and Fig. 8.28. In Fig. 8.108 the model statistics of the LDA and
logistic regression can be viewed. We see that the AUC/Gini values are slightly
better for the linear discriminant model; this is also visualized in Fig. 8.109,
where the ROC of the LDA model is located a bit above the Logistic Regression
ROC. Accuracy in the test set is higher in the logistic regression model,
however. This is a good example of how a higher Gini doesn’t have to go
8.4 Linear Discriminate Classification 797
Fig. 8.105 Model tab in the Logistic node. Definition of the variable selection method
along with higher accuracy and vice versa. The decision of a final model is thus
always associated with the choice of the performance measure. Here, the LDA
would be preferred when looking at the Gini, but when taking accuracy as an
indicator, the logistic regression would be slightly preferred.
To analyze where these differences arise from, we inspect the coincidence
matrices. We note that the two models differ in their favoritism of the target
class. The logistic regression model has a higher tendency to the non-diabetes
class (class_variable ¼ 0) than the linear discriminant model. The first one thus
predicts non-diabetes diagnosis more often than the LDA, for both the training
set and the test set. Moreover, the LDA predicts diabetes more often than it
occurs in the data; that is, in the training set the LDA predicts 98 patients will
have diabetes although there are only 90 diabetes patients in the set. This is
similar for the test set. Thus, to compensate for this over-prediction of diabetes
patients and create a more robust prognosis, one possibility is to combine the two
models into an ensemble model.
798 8 Classification Models
Fig. 8.107 Rearranging the stream by connecting the model nuggets in series
5. To combine the two models into an ensemble model, we add the Ensemble node
to the stream and connect it to the logistic regression model nugget. We open the
node and choose “class_variable” as the target field for the ensemble model. See
Fig. 8.110. Additionally, we set the aggregation method of the ensemble to
“Average raw propensity”. When running this node now, the predictions,
8.4 Linear Discriminate Classification 799
Fig. 8.108 Performance measures in the LDA and Logistic Regression model for the
diabetes data
namely the probabilities for the target classes, of the two models LDA and
Logistic Regression are averaged, and the target class with the higher probability
wins and is therefore predicted by the ensemble.
For the averaging to work properly, we have to ensure that the propensities of
the models are calculated. This can be enabled in the model nuggets in the
800 8 Classification Models
Settings tab. See Fig. 8.111 for the LDA model nugget. This can be done analog
for the Logistic Regression nugget.
We add an Analysis node to the stream and connect it to the Ensemble node.
In Fig. 8.112, the accuracy and Gini are displayed in the ensemble model. We
Fig. 8.112 Evaluation measures in the ensemble model, consisting of the LDA and Logistic
Regression models
802 8 Classification Models
see that the Gini of the test set has increased by 0.004 points compared to the
LDA (see Fig. 8.108). Furthermore, the accuracy in the training set and test set
has slightly improved. In conclusion, the prediction power of the LDA improves
slightly when it is combined with a Logistic Regression model within an
ensemble model.
Fig. 8.113 Solution stream for the exercise of mining gene expression data and dimensional
reduction
8.4 Linear Discriminate Classification 803
Fig. 8.114 Template stream of the gene expression leukemia all data
Fig. 8.115 Selection of the subset that contains only ALL or healthy patients
Fig. 8.116 Distribution of the Leukemia variable within the selected subset
the Leukemia variable in the subset. There are in total 207 patients in the subset,
whereof 73 are healthy and the others suffer from ALL.
When building a classifier on this subset, the large number of input variables,
851, is a problem when compared with the observation records, 207. In this case,
many classifiers tend to overfit.
804 8 Classification Models
We add a Discriminant node to the stream and connect it to the Type node. As
the roles of the variables are already defined in the Type node, the Modeler
automatically detects the target, input, and Partition fields. In the Model tab of
the Discriminant node, we further choose the stepwise variable selection method
before running the node. After the model nugget appears, we add an Analysis
node to the stream and connect it to the nugget. For the properties of the Analysis
node, we recall Sect. 8.2.5. The evaluation statistics, including accuracy and
Gini, are shown in Fig. 8.117. We see that the stats are very good for the training
set, but are clearly worse for the test set. The Gini indicates this especially since
here, it is only half as large in the test set (0.536) as in the training set (1.0). This
signals an overfitted model.
So, the LDA is unable to handle the gene expression data with disproportional
input variables compared to the patients. The reason for this can be that the huge
number of features overlay the basic structure in the data.
2. We add a PCA node to the stream and connect it to the Type node in order to
consolidate variables and identify common structures in the data. We open the
PCA node and select all genomic position variables as inputs in the Fields tab.
See Fig. 8.118.
In the Model node, we mark the “Use partitioned data” option in order to only
use the training data for the factor calculations. The method we intend to use is
the “Principal Component” method, which we also select in this tab. See
Fig. 8.119.
Fig. 8.117 Analysis output for the LDA on the raw gene expression data
8.4 Linear Discriminate Classification 805
Fig. 8.119 Definition of the extraction method and usage of only the training data to calculate the
factors
806 8 Classification Models
Fig. 8.120 Setup of the scatterplot to visualize factors 1 and 2, determined by the PCA
Now, we run the PCA node and add a Plot node to the appearing nugget. In the
Plot node, we select factor 1 and factor 2 as the X and Y field. Furthermore, we
define the “Leukemia” variable as a coloring and shape indicator, so the two
groups can be distinguished in the plot. See Fig. 8.120.
In Fig. 8.121, the scatterplot of the first two factors is shown. As can be seen,
the “ALL” patients and the “healthy” patients are concentrated in clusters and
the two groups can be separated by a linear boundary.
3. We add another Type node to the stream and connect it to the PCA model
nugget, so the following Discriminant node is able to identify the new variables
with its measurements. As just mentioned, we add a Discriminant node and
connect it to the new Type node. In the Fields tab of the Model node, we select
“Leukemia” as the target, “Partition” as the partition variable and all 5 factors,
which were determined by the PCA, as input variables. See Fig. 8.122. In the
Model tab, we further choose the stepwise variable selection method before
running the stream.
4. We connect the model nugget of the second LDA to a new Analysis node and
choose the common evaluation statistics as the output. See Sect. 8.2.5. The
output of this Analysis node is displayed in Fig. 8.123. We immediately see
the accuracy, as well as the Gini, has improved for the test set, while the stats in
the training set have not decreased. Both values now indicate extremely good
separation and classification power, and the problem of overfitting has faded.
8.4 Linear Discriminate Classification 807
Fig. 8.122 Selection of the factors as inputs for the second LDA
808 8 Classification Models
Fig. 8.123 Analysis output for the LDA with factors as input variables
An explanation of why the LDA on the PCA calculated factors performs better
than on the raw data is already given in previous steps of this solution. The huge
number of input variables has supposedly hidden the basic structure that
separates the two patient groups, and the model trained on the raw data has
therefore overfitted. The PCA has now uncovered this basic structure and the
two groups are now linearly separable, as visualized in Fig. 8.121. Another
advantage of this dimensional reduction is an improvement in the time it takes to
build the model and predict the classes. Furthermore, less memory is needed to
save the data in this reduced form.
In the previous two sections, we introduced the linear classifiers, namely, logistic
regression and linear discriminant analysis. From now on, we leave linear cases and
turn to the nonlinear classification algorithms. The first one in this list is the
Support Vector Machine (SVM), which is one of the most powerful and flexible
classification algorithms and is discussed below. After a brief description of the
theory, we attend to the usage of SVM within the SPSS Modeler.
8.5 Support Vector Machine 809
8.5.1 Theory
The SVM is one of the most effective and flexible classification methods, and it can
be seen as a connection between linear and nonlinear classification. Although the
SVM separates the classes via a linear function, it is often categorized as a nonlinear
classifier due to the following fact: the SVM comprises a preprocessing step, in
which the data are transformed so that previously nonlinear separable data can now
be divided via a linear function. This transformation technique makes the SVM
applicable to a variety of problems, by constructing highly complex decision
boundaries. We refer to James et al. (2013) and Lantz (2013).
Fig. 8.124 Illustration of the decision boundary discovery and the support vectors
810 8 Classification Models
• Linear
• Polynomial
• Sigmoid
• Radial basis function (RBF)
See Lantz (2013) for a more detailed descriptions of the kernels. Among these,
the RBF kernel is the most popular for SVM, as it performs well on most data types.
So this kernel is always a good starting point.
The right choice of the kernel guarantees robustness and a high accuracy in
classification. The kernel function and its parameters must be chosen carefully
however, as an unsuitable kernel can cause the model to overfit the training data.
For this reason, a cross-validation with the training set and test set is always
strongly recommended. Furthermore, an additional validation set can be used to
find the optimal kernel parameters (Sect. 8.2.1). See Sch€olkopf and Smola (2002)
and Lantz (2013) for additional information on the kernel trick.
In this section, we introduce the SVM node of the Modeler and explain how it is
used in a classification stream. The model we are building here refers to the sleep
detection problem described as a motivating example in Sect. 8.1.
8.5 Support Vector Machine 811
part is dedicated to the feature calculation, and the model is built in the second part
of the stream.
The feature calculation is performed in R (R Core Team (2014)) via an R node.
Therefore, R has to be installed on the computer and included within the SPSS
Modeler. This process and the usage of the R node are explained in detail in
Chap. 9. We split the stream up into two separate streams, a feature calculation
stream and a model building stream. If one is not interested in the feature calcula-
tion, the first part can be skipped, as the model building process is described on the
already generated feature matrix. For the interested reader, we refer to Niedermeyer
et al. (2011) for detailed information on EEG signals, their properties, and analysis.
In this situation, the raw data has to be pre-processed and transformed into more
appropriate features on which the model is able to separate the data. This transfor-
mation of data into a more suitable form and generating new variables out of given
ones is common in data analytics. Finding new variables that will improve model
performance is one of the major tasks of data science. We will experience this in
Exercise 3 in Sect. 8.5.4, where new variables are obtained from the given ones,
which will increase the prediction power of a classifier.
Feature generation
1. We start the feature calculations by importing the EEG signals with a Var. File
node. The imported data then look like Fig. 8.126.
2. Now, we add an R Transform node, in which the feature calculations are then
performed. Table 8.6 lists the features we extract from the signals.
The first three features are called Hjorth parameters. They are true classic
statistical forms within signal processing and are often used for analytical
purposes, see Niedermeyer et al. (2011) and Oh et al. (2014).
8.5 Support Vector Machine 813
Fig. 8.128 The R Transform node in which the feature calculations are declared
We open the R Transform node and include in the R Transform syntax field
the R syntax that calculates the features. See Fig. 8.128 for the node and
Fig. 8.129 for the syntax inserted into the R node. The syntax is thereby
displayed in the R programming environment RStudio, RStudio Team (2015).
The syntax is provided under the name “feature_calculation_syntax.R”.
814 8 Classification Models
Fig. 8.129 R syntax that calculates the features and converts them back to SPSS Modeler format,
displayed in RStudio
Now we will explain the syntax in detail. The data inserted into the R
Transform node for manipulation is always named “modelerData”, so that the
SPSS Modeler can identify the input and output data of the R nodes, when the
calculations are complete.
Row 1: In the first row, the modelerData are assigned to a new variable named
“old_variable”.
Rows 3 + 4: Here, the signal data (the first 3000 columns) and the sleep
statuses of the signal segments are assigned to the variables “signals” and
“sleepiness”.
Row 6: Here, a function is defined that calculates the mobility of a signal
segment x.
Rows 8–15: Calculation of the features. These, together with the sleepiness
states, are then consolidated in a data.frame, which is in an R-matrix format.
This data.frame is then assigned to the variable “modelerData”. Now the Mod-
eler can further process the feature matrix, as the variable “modelerData” is
normally passed onto the next node in the stream.
Row 17–24: In order that the data can be processed correctly, the SPSS
Modeler must know the fields and their measurement types, and so the fields
in the “modelerData” data.frame have to be specified in the data.frame variable
“modelerDataModel”. The storage type is defined for each field, which is “real”
for the features and “string” for the sleepiness variable.
3. We add a Data Audit node to the R Transform node, to inspect the new
calculated features. See Fig. 8.130 for the distributions and statistics of the
feature variables.
4. To save the calculated features, we add the output node Flat File to the stream
and define a filename and path, as well as a column delimiter and the further
structure of the file.
8.5 Support Vector Machine 815
By preprocessing the EEG signals, we are now able to build an SVM classifier
that separates sleepiness states from awake states.
816 8 Classification Models
1. We start by importing the data in the features_eeg_signal.csv file with a Var. File
node. The statistics obtained with the Data Audit node can be viewed in
Fig. 8.130. Afterwards, we add a Partition node to the stream and allot 70 %
of the data as a training set and 30 % as a test set. See Sect. 2.7.7 for how to use
the Partition node.
2. Now, we add the usual Type node to the stream and open it. We have to make
sure that the measurement type of the “sleepiness variable”, i.e., the target
variable, is “nominal” or “flag” (see Fig. 8.131).
3. We add an SVM node to the stream and connect it to the Type node. After
opening it, we declare the variable “sleepiness” as our target in the Fields tab,
“Partition” as the partitioning variable and the remaining variable, i.e., the
calculated features listed in Table 8.6, are declared as input variables (see
Fig. 8.132).
4. In the Model tab, we enable the “Use partitioned data” option so that cross-
validation is performed. In other words, the node will only use the training data
to build and the test set to validate the model (see Fig. 8.133).
5. In the Expert tab, the kernel function and parameters for use in the training
process can be defined. By default, the simple mode is selected (see Fig. 8.134).
This mode utilizes the RBF kernel function, with its standard parameters for
building the model. We recommend using this option if the reader is unfamiliar
with the precise kernels, their tuning parameters, and their properties. We
proceed with the simple node in this description of the SVM node.
If one has knowledge and experience with the different kernels, then the
kernel function and the related parameters can be specified in the expert mode
(see Fig. 8.135). We omit a precise description of the kernel parameters and refer
interested readers to Lantz (2013) and Sch€olkopf and Smola (2002).
6. Now, we run the stream and the model nugget appears.
7. We connect the nugget to an Analysis node and an Evaluation node. See
Sect. 8.2.5 and Fig. 8.28 for the settings of these nodes. Figures 8.136 and
8.137 display the evaluation statistics and the ROC of the training set and test
818 8 Classification Models
set for the SVM. We note that accuracy in both the training and test sets is very
high (over 90 %), and the Gini is also of high value in both cases. This is
visualized by the ROC curves. In conclusion, the model is able to separate
sleepiness from an awake state.
Statistics and goodness of fit measures within the SVM model nugget are very few
when compared to other model nuggets. The only statistic the SPSS Modeler
provides in the SVM node is the predictor importance view (see Fig. 8.138). For
the sleep detection model, the “crossing0” feature is the most important variable,
followed by “Range”, “Complexity”, and “Activity”. The importance of x-axis
crossings suggests that the fluctuation around the mean of the signal is an indicator
of being asleep or being awake.
8.5.4 Exercises
1. Import the data and familiarize yourself with the gene expression data. How
many different types of leukemia are there, and how often do they occur in the
dataset?
2. Unite all leukemia types into a new variable value that just indicates the patient
has cancer. How many patients have cancer and how many are healthy?
3. Build an SVM classifier that can differentiate between a leukemia patient and a
non-leukemia patient. What is the accuracy and Gini value for the training set
and the test set? Draw an ROC to visualize the Gini.
1. Import the Titanic data and inspect each variable. What additional information
can be derived from the data?
2. Create three new variables that describe the deck of the passenger’s cabin,
his/her (academic) title, and the family size. The deck can be extracted from
the cabin variable, as it is the first symbol in the cabin code. The passenger’s title
can be derived from the name variable, it is located between the symbols “,” and
“.” in the name variable entries. The family size is just the sum of the variables
“sibsp” and “parch”. Use the Derive node to generate these new variables. What
are the values and frequencies of these variables? Adjoin the values that only
occur once with other similar values.
3. Use the Auto Data Prep node to normalize the continuous variables and replace
the missing data.
4. Build four classifiers with the SVM node to retrace the prediction accuracy and
performance when adding new variables. So, the first model is based on the
original variables, the second model includes the deck variable, and the third
also comprises the title of the passengers. Finally, the fourth model also takes the
family size of the passenger as an input variable.
5. Determine the most important input variables for each of the models. Are the
new variables (deck, title, family size) relevant for the prediction? Compare the
accuracy and Gini of the models. Have the new variables improved the predic-
tion power? Visualize the change in model fitness by plotting the ROC of all four
models with the Evaluation node.
8.5.5 Solutions
The stream displayed in Fig. 8.139 is the complete solution of this exercise.
Fig. 8.141 Data Audit node for the gene expression data
2. To assign all four leukemia types (AML, ALL, CLL, CML) to a joint “cancer”
class, we add a Reclassify node to the stream. In the node options, select the
variable “Leukemia” as the reclassify field and click on the “Get” button, which
will load the variable categories as original values. In the New value fields of the
four leukemia types, we put “Cancer” and “healthy” for the non-leukemia entry
(see Fig. 8.143). At the top, we then choose the option “reclassify into existing
824 8 Classification Models
Fig. 8.143 Reclassification node that assigns the common variable label “Cancer” to all leukemia
types
field”, to overwrite the original values of the “Leukemia” variable with the new
values.
Now, we add a Distribution node to the stream and connect it to the Reclassify
node. In the graph options, we choose “Leukemia” as the field and run the
8.5 Support Vector Machine 825
stream. Figure 8.144 shows the distribution of the new assigned values of the
“Leukemia” variable. 94.27 % of the patients in the dataset have leukemia,
whereas 5.73 % are healthy.
3. Before building the SVM, we add another Type node to the stream and define in
it the target variable, i.e., the “Leukemia” field (see Fig. 8.145). Furthermore, we
set the role of the “Patient_ID” field to None, as this field is just the patients
identifier and irrelevant for predicting leukemia.
Now, we add the SVM node to the stream. As with the previous definitions
of the variable roles, the node automatically identifies the target, input, and
partitioning variables. Here, we choose the default settings for training the SVM.
Hence, nothing has to be changed in the SVM node. We run the stream and the
model nugget appears.
To evaluate the model performance, we add an Analysis node and Evaluation
node to the stream and connect it to the nugget. The options in these nodes are
described in Sect. 8.2.5 and Fig. 8.28. After running these two nodes, the
goodness of fit statistics and the ROC pop up in a new window. These are
displayed in Figs. 8.146 and 8.147. We see that SVM model accuracy is pretty
high in both the training set and the test set, and the Gini also indicates good
prediction ability. The latter is visualized by the ROC in Fig. 8.147.
Although these statistics look pretty good, we wish to point out one fact that could
be seen as a little drawback of this model. When just looking at the prediction
performance of the “Healthy” patients, we note that only 11 out of 18 predictions
are correct in the test set. This is just 61 % correctness in this class, compared to
96 % overall accuracy. This could be caused by the high imbalance of the target
classes, and the SVM may slightly favor the majority class, i.e., “Cancer”. One
should keep that in mind when working with the model.
Fig. 8.147 ROC of the training set and test set of the SVM leukemia classifier
Fig. 8.148 Complete stream of the leukemia detection classifier with sigmoid and RBF kernel
The stream displayed in Fig. 8.148 is the complete solution of this exercise.
1. We follow the first part of the solution to the previous exercise, and open the
template stream “019 Template-Stream gene_expression_leukemia_short”, as
828 8 Classification Models
seen in Fig. 8.140, and save it under a different name. This template already
contains a Partition node, which splits the data into a training set and a test set in
the usual ratio. We add a Distribution node to the Type node, to display the
frequencies of the leukemia types. This can be viewed in Fig. 8.142.
In the Type node, we set the roles of the variables as described in step 3 of the
previous solution. See also Fig. 8.145. Now, the SVM nodes can automatically
identify the target and input variables.
2. We add an SVM node to the stream and connect it to the Type node. As we
intend to use a sigmoid kernel, we open the model building node and go to the
Expert tab. There, we enable the expert mode and choose sigmoid as the kernel
function. See Fig. 8.149. Here, we work with the default parameter settings, i.e.,
Gamma is 1 and Bias equals 0. See IBM (2015a); Lantz for the meaning and
influence of the parameters. Now, we run the stream.
3. We add an Analysis node to the model nugget and set the usual evaluation
statistics calculations. Note, that the target variable is multinomial here and no
Gini or AUC can be calculated. See Sect. 8.2.5. We run the Analysis node
and inspect the output statistics in Fig. 8.150. We observe that the training set
and the test set are about 72–73 % accurate, which is still a good value. When
looking at the coincidence matrix however, we see that the model only predicts
the majority classes AML and CLL. The minority classes ALL, CLL, and
Non-leukemia are neglected and cannot be predicted by the SVM model with
a sigmoid kernel. Although the accuracy is quite good, the sigmoid kernel
Fig. 8.149 Choosing the sigmoid kernel type in the SVM node
8.5 Support Vector Machine 829
Fig. 8.150 Accuracy and coincidence matrix in the SVM with sigmoid kernel
Fig. 8.151 Accuracy and coincidence matrix of the SVM with RBF kernel type
second model describes the data better and is therefore preferred to the sigmoid
one. This is an example of how the wrong choice of kernel function can lead to
corrupted and inadequate models.
The stream displayed in Fig. 8.152 is the complete solution to this exercise. The
stream contains commentaries that point to its main steps.
Fig. 8.152 Complete stream of the Titanic survival prediction stream with new feature
generation
The sex of the passenger is already one of the variables, as a woman more
likely survives a sinking ship than a man. There are more differences in survival
indicators, e.g., masters are normally rescued before their servants. Furthermore,
the probability of survival can differ for married or unmarried passengers, as for
example a married woman may refuse to leave her husband on the ship. This
information is hidden in the name variable. There, the civil status of the person,
academic status, or aristocratic title are located after the “,” (Fig. 8.155).
Furthermore, when thinking of the chaotic situation on a sinking ship, families
are separated, some get lost and fall behind and passengers are looking for
their relatives. Thus, it is reasonable to assume that the family size can have
an influence on the survival probability. The two variables “sibsp” and “parch”
describe the number of siblings/spouses and parents/children. So, the sum of
these two variables gives the number of relatives that were traveling with the
passenger.
2. Figure 8.156 shows the stream to derive these three new variables “deck”,
“title”, and “family size”. In Fig. 8.152, the complete stream is joined together
into a SuperNode.
First, we add a Derive node to the Type node, to extract the deck from the
cabin number. In this node, we set the field type to nominal, since the values are
letters. The formula to extract the deck can be seen in Fig. 8.157. If the cabin
number is present, then the first character is taken as the deck, and otherwise, the
deck is named with the dummy value “Z”. In Fig. 8.158, the distribution of the
new deck variable is displayed.
Next, we extract the title from the name variable. For that purpose, we add
another Derive node to the stream, open it, and choose “nominal” as the field
type once again. The formula that dissects the title from the name can be seen in
Fig. 8.159. The title is located between the characters “,” and “.”. Therefore, the
locations of these two characters are established with the “locchar” statement,
and then the sub-string between these two positions is extracted. Figure 8.160
visualizes the occurrence of the titles in the names.
When reviewing Figs. 8.158 and 8.160, we note that there are some values of
the new variables, “deck” and “title”, that occur uniquely, in particular in the
834 8 Classification Models
“title” variable. We assign these single values a similar, more often occurring
value. To do that, we first add a Type node to the stream and click on the “read
value” button, so the values of the two new variables are known in the
succeeding nodes. Now, we add a Reclassify node to the stream and open
it. We choose the deck variable as our reclassify field and check the reclassify
8.5 Support Vector Machine 835
into the existing field. The latter ensures that the new values are replaced and no
new field is created. Then, we click on the “Get” button to get all the values of
the deck variable. Lastly, we put all the existing values as new values, except for
the value “T” which is assigned to the “Z” category. See Fig. 8.161.
We proceed similarly with the title variable. We add a Reclassify node to the
stream, select “Title” as the reclassify field, and make sure that the values are
overwritten by the new variables and no additional variable is created. We click
on “Get” and assign the new category to the original values. Thereby, the
following values were reclassified:
3. We add an Auto Data Prep node to the stream and select the standard options for
the data preparation, i.e., replacement of the missing values with the mean and
mode, and performing of a z-transformation for the continuous variables.
See Fig. 8.165 and Sect. 2.7.6 for additional information on the Auto Data
Prep node. After running the Auto Data Prep node, we add another Type node
to the stream, in order to determine all the variable values, and we define the
838 8 Classification Models
“survival” variable as our target variable and set the measurement type to
“Flag”.
4. Now, we add four SVM nodes to the stream and connect them all to the last Type
node. We open the first one and in the Fields node we define the variable
“survived” as the target and the following variables as input: “sibsp_transformed”,
“parch_transformed”, “age_transformed”, “fare_transformed”, “sex_transformed”,
“embarked_transformed”, “pclass_transformed”. Furthermore, we put “Parti-
tion” as the partition field. In the Analyze tab, we then enable the predictor
importance calculations. See Fig. 8.166.
We proceed with the other three SVM nodes in the same manner, but add
successively the new established variables “deck”, “Title”, and “famSize”. We
then run all four SVM nodes and rearrange the appearing model nuggets by
connecting them into a series. See Fig. 8.167, where the alignment of the model
nuggets is displayed.
5. We open the model nuggets one after another to determine the predictor impor-
tance. These are displayed in Figs. 8.168, 8.169, 8.170, and 8.171. We observe
that in the model with only the original variables as input, the “sex” variable is
the most important for survival prediction, followed by “pclass” and
“embarked.”
If the “deck” variable is considered as an additional input variable, the
importance of the “sex” is reduced, but this variable is still the most important
one. The second most important variable for the prediction is the new variable
8.5 Support Vector Machine 839
Fig. 8.166 Selection of the variable roles in the SVM node of the model without new established
features
“deck”, however. This means that this new variable describes a new aspect in
the data.
When the “Title” variable is also included in the SVM model, it becomes the
most important one, even before the “sex” variable. This could be due to the fact
that the Title variable describes the civil standing of a passenger as well as their
gender and therefore contains more information.
Therefore, “famSize” is the variable with the least predictor importance in the
model that includes all variables. See Fig. 8.171.
840 8 Classification Models
Fig. 8.168 Variable importance in the SVM with no new features included
Fig. 8.169 Variable importance in the SVM with the deck variable included
8.5 Support Vector Machine 841
Fig. 8.170 Variable importance in the SVM, with deck and title included
Fig. 8.171 Variable importance in the SVM, with “deck,” “Title,” and “famSize” included
842 8 Classification Models
Fig. 8.173 Evaluation statistics calculated by the Analysis node for the four Titanic survival
SVM classifiers
8.6 Neuronal Networks 843
We now add a Filter node to the stream and connect it to the last model
nugget. This is only done to rename the predictor fields. See Fig. 8.172
We then add the Analysis and Evaluation nodes to the stream and connect
them to the Filter node. See Sect. 8.2.5 and Fig. 8.28 for the options in these
nodes. When inspecting the evaluation statistics from the Analysis node
(Fig. 8.173), we observe that accuracy as well as the Gini increase successively
in the training set and test set. There is just one exception in the test set statistics.
When adding the “deck” variable, the accuracy and Gini are both a bit lower than
when we exclude the “deck”. All in all, however, the newly generated features
improve the prediction performance, more precisely, from 0.723 Gini points to
0.737 points in the test set and 0.68–0.735 in the training data. This improvement
is visualized by the ROC of the four classifiers in Fig. 8.174. There, the model
including all new generated variables lies above the other ones.
Neural networks (NN) are inspired by the functionality of the brain. They also
consist of many connected units that receive multiple inputs from other units,
process them, and pass new information onto yet other units. This network of
units simulates the processes of the brain in a very basic way. Due to the relation-
ship to the brain, the units are also called neurons, hence, neural network. An NN is
844 8 Classification Models
a black box algorithm, just like the SVM, since the structure and mechanism of data
transformation and the transfer of neurons are so complex and unintuitive. The
results of an NN are difficult to retrace and therefore hard to interpret. On the other
hand, its complexity and flexibility makes the NN one of the most powerful and
universal classifiers, which can be applied to a variety of problems where other
methods, such as rule based ones, would fail. In the first section, we describe in brief
the theoretical background of a NN in more detail by following Lantz (2013),
before proceeding to look at utilization in the SPSS Modeler.
8.6.1 Theory
where x0 ¼ 1: The input x0 is added as a constant in the sum and is often called a
bias. The purpose of the weight is to regularize the contribution of each input signal
to the sum. Since every neuron has multiple inputs with different weights, this gives
huge flexibility in tuning the inputs individually for each neuron. This is one of the
biggest strengths of the NN, qualifying their application to a variety of complex
classification problems. The weights are not interpretable in their contribution to
the results however, due to the complexity of the network.
The activation function φ is typically the sigmoid function or the hyperbolic
tangent function, where the latter is used in the SPSS Modeler, see IBM (2015a). A
linear function is also plausible, but a function that is linear in a neighborhood of
0 and nonlinear at the limits, as the two above-mentioned functions are, is a more
suitable choice, since both situations can be modeled with these kind of functions.
outputs the score for this category. Between the input and output layers can be one
or multiple hidden layers. The neurons of these layers get the data from neurons of
the previous layer and process them as described above. The manipulated data are
then passed to the neurons of the next hidden or the output layer.
A neural network (NN) can be trained with the Neural Network node in the SPSS
Modeler. We now present how to set up the stream for an NN with this node, based
on the digits recognition data, which comprises data on handwritten digits from
different people. The goal now is to build a classifier that is able to identify the
correct digit from an arbitrary person’s handwriting. These classifiers are already in
use in many areas, as described in Sect. 8.1.
8.6 Neuronal Networks 847
The stream consists of two parts, the training and the validation of the model. We
therefore split the description into two parts also.
Training of an NN
Here, we describe how to build the training part of the stream. This is displayed in
Fig. 8.177.
Fig. 8.177 Training part of the stream, which builds a digits identification classifier
848 8 Classification Models
Fig. 8.179 Type node of the digits data and assignment of the target field and values
Fig. 8.181 Definition of the target and input variables in the Neural Network node
In the Basic options, the type of the network, with its activation function and
topology, has to be specified. The Neural Network node provides the two
network models MLP and RBF, see Sect. 8.6.1 for a description of these two
model types. We choose the MLP, which is the default setting and the most
common one. See Fig. 8.183. The number of hidden layers and the unit size can
be specified here too. Only networks with a maximum of 2 hidden layers can be
built with the Neural Network node, however. Furthermore, the SPSS Modeler
provides an algorithm that automatically determines the number of layers and
units. This option is enabled by default and we accept. See bottom arrow in
Fig. 8.183. We should point out that automatic determination of the network
topology is not always optimal, but a good choice to go with in the beginning.
With the next options, the stopping rules of network training can be defined.
Building a neural network can be time and resource consuming. Therefore, the
SPSS Modeler provides a couple of possibilities for terminating the training
process at a specific time. These include a maximum training time, a maximum
number of iterations of the coefficient estimation algorithm, and a minimum
8.6 Neuronal Networks 851
accuracy. The latter can be set if a particular accuracy has adequate prediction
power. We choose the default setting and fix a maximum processing time of
15 min. See Fig. 8.184.
In the Ensemble options, the aggregation function and number of models in
the ensemble can be specified. See Fig. 8.185. These options are only relevant if
an ensemble model is trained.
The available aggregation options for a categorical target, as in classifiers, are
listed in Table 8.7. For additional information on ensemble models, as well as
boosting and bagging, we refer to Sect. 5.3.6.
In the Advanced option view, the size of the overfitting prevention set can be
specified; 30 % is the default setting. Furthermore, an NN is unable to handle
missing values. Therefore, a missing values handling tool should be specified.
The options here are the deletion of data records with missing values or the
replacement of missing values. Continuous variables impute the average of
minimum and maximum value, while the category field imputes the most
852 8 Classification Models
Fig. 8.183 Definition of the network type and determination of the layer and unit number
frequent category. See Fig. 8.186 for the Advanced option view of the Neural
Network node.
In the Model Options tab, the usual calculation of predictor importance can be
enabled, which we do in this example. See Fig. 8.187.
4. The Option setting for the training process is now completed and we run the
stream, thus producing the model nugget. The model nugget, with its graphic and
statistics, is explained in the subsequent Sect. 8.6.3.
5. We now add an Analysis node to the stream, connect it to the model nugget, and
enable the calculation of the coincidence matrix. See Sect. 8.2.5 for a description
of the Analysis node. The output of the Analysis node can be viewed in
Fig. 8.188. We see that accuracy is extremely high, with a recognition rate of
over 97 % on handwritten digits. On the coincidence matrix, we can also see that
854 8 Classification Models
Fig. 8.186 Overfit prevention and the setting of missing values handling
Fig. 8.187 The predictor importance calculation is enabled in the Model Option tab
8.6 Neuronal Networks 855
Fig. 8.188 Analysis node statistics for the digits training and NN classifier
Fig. 8.189 Validation part of the stream, which builds a digits identification classifier
the prediction is very precise for all digits. In other words, there is no digit that
falls behind significantly in the accuracy of the prediction by the NN.
Validation of the NN
Now, we validate the trained model from part one with a test set. Figure 8.189
shows the stream.
Since the validation of a classifier is part of the modeling process, we continue
the enumeration of the model training here.
6. First, we copy the model nugget and paste it into the stream canvas. Afterwards,
we connect the new nugget to the Type node of the stream segment that imports
the test data (“optdigits_test.csv”).
856 8 Classification Models
Fig. 8.190 Analysis node statistics for the digits test set and NN classifier
7. We then add another Analysis node to the stream and connect it to the new
nugget. After setting the option in the Analysis node, see Sect. 8.2.5, we run it
and the validation statistics open in another window. See Fig. 8.190 for the
accuracy statistics of the test set. We observe that the NN predicts the digits still
very precisely with over 94 % accuracy, without neglecting any digit. Hence, we
see that the digits recognition model is applicable to independent data and can
identify digits from unfamiliar handwritings.
In this section, we introduce the contents of the Neural Network model nugget. All
graphs and statistics from the model are located in the Model tab, which is
described in detail below.
Model summary
The first view in the Model tab of the nugget displays a summary of the trained
network. See Fig. 8.191. There, the target variable and model type, here “Digit” and
MLP, are listed as well as the number of neurons in every hidden layer that was
included in the network structure. In our example, the NN contains one hidden layer
with 18 neurons. This information on the number of hidden layers and neurons is
8.6 Neuronal Networks 857
particularly useful when the SPSS Modeler automatically determined them. Fur-
thermore, the reason for stopping is displayed. This is important to know, as a
termination due to time issues or overfitting reasons, instead of an “Error cannot be
further decreased” stopping, means that the model is not optimal and can be
improved by adjusting parameters or having a larger run-time.
Basic information on the accuracy of the NN on the training data is displayed
below. Here, the handwriting digits NN classifier has a 97.9 % accuracy. See
Fig. 8.191.
Predictor importance
The next view displays the importance of the input variables in the NN. See
Fig. 8.192. This view is similar to the one in the Logistic node, and we refer to
Sect. 8.3.4 for a description of this graph. At the bottom of the graph there is a
sliding regulator, where the number of displayed input fields can be selected. This is
convenient when the model includes many variables, as in our case with the digits
data. We see that nearly all fields are equally important for digit identification.
Coincidence matrix
In the classification view, the predicted values against the original values are
displayed in a heat map. See Fig. 8.193. The background color intensity of a cell
thereby correlates with its proportion of cross-classified data records. The entries on
the matrix can be changed at the bottom, see arrow in Fig. 8.193. Depending on the
selected option, the matrix displays the percentage of correctly identified values for
each target category, the absolute counts, or just a heat map without entries.
Network structure
The Network view visualizes the constructed neural network. See Fig. 8.194. This
can be very complex with a lot of neurons in each layer, especially in the input
layer. Therefore, only a portion of the input variables can be selected, e.g., the most
858 8 Classification Models
important variables, by the sliding regulator at the bottom. Furthermore, the align-
ment of the drawn network can be changed at the bottom, from horizontal to vertical
or bottom to top orientation. See left arrow in Fig. 8.194.
Besides the structure of the network, the estimated weights or coefficients within
the NN can be displayed in network form. See Fig. 8.195. One can switch between
these two views with the select list. See bottom right arrow in Fig. 8.194. Each
connecting line of the coefficients network represents a weight, which is displayed
when the mouse cursor is moved over it. Each line is also colored; a darker tones
indicate a positive weight, and lighter tones indicate a negative weight.
860 8 Classification Models
8.6.4 Exercises
1. Import the chess endgame data with a proper Source node and reclassify all
“Result for White” variables into a binary field that indicates whether white wins
the game or not. What are the proportion of “draws” in the dataset?
2. Train a neural network with 70 % of the chess data and use the other 30 % as a
test set. What are the accuracy and Gini values for the training set and test set?
Does the classifier overfit the training set?
3. Build an SVM and a Logistic Regression model on the same training data. What
are the accuracy and Gini values for these models? Compare all three models
with each other and plot the ROC for each of them.
Exercise 2 Credit rating with a Neural network and finding the best network
topology
The “tree_credit” dataset (see Sect. 10.1.33) comprises demographic and historical
loan data from bank customers, as well as a prognosis on credit worthiness (“good”
or “bad”). In this exercise, a neural network has to be trained that decides if the bank
should give a certain customer a loan or not.
1. Import the credit data and set up a cross-validation scenario with training, test,
and validation sets, in order to compare two networks with different topologies
from each other.
2. Build a neural network to predict the credit rating of the customers. To do this,
use the default settings provided by the SPSS Modeler and, in particular, the
automatic hidden layer and unit determination method. How many hidden layers
and units are included in the model and what is its accuracy?
3. Build a second neural network with customer defined hidden layers and
units. Try to improve the performance of the automatically determined network.
Is there a set-up with a higher accuracy on the training data and what is its
topology?
4. Compare the two models, automatic and custom determination of the network
topology, by applying the trained models to the validation and test sets. Identify
the Gini values and accuracy for both models. Draw the ROC for both models.
8.6 Neuronal Networks 861
1
φð x Þ ¼ :
1 þ ex
The task now is to assign values to the weights ω0, ω1, ω2, such that
1, x ¼ 1 and x2 ¼ 1
φðxÞ if 1
0, otherwise:
A proper solution for the AND operator is shown in Fig. 8.196. When looking at the
four possible input values and calculating the output of this NN we get
x1 x2 φ(x)
0 0 φð200Þ 0
1 0 φð50Þ 0
0 1 φð50Þ 0
1 1 φð100Þ 1
Construct a simple neural network for the logical OR and NOT operators, by
proceeding in the same manner as just described for the logical AND. Hint: the
logical OR is not exclusionary, which means the output of the network has to be
nearly 0, if and only if both input variables are 0.
8.6.5 Solutions
1. First, we import the dataset with the File Var. File node and connect it to a Type
node. Then open the latter one and click the “Read Values” button, to make sure
the following nodes know the values and types of variables in the data. After-
wards, we add a Reclassify node and connect it to the Type node. In the
Reclassify node, we select the “Result for White” field, since we intend to
change its values. By clicking on the “Get” button, the original values appear
and can be assigned to another value. In the New value column, we write “win”
next to each original value that represents a win for white, i.e., those which are
not a “draw”. The “draw” value, however, remains the same in the newly
assigned variable. See Fig. 8.198.
We now connect a Distribution node to the Reclassify node, to inspect how
often a “win” or “draw” occurs. See Sect. 3.2.2 for how to plot a bar plot with the
Distribution node. In Fig. 8.199, we observe that a draw occurs about 10 % of the
time in the present chess dataset.
2. We add another Type node to the stream after the Reclassify node, and set the
“Result for White” variable as the target field and its measurement type to
“Flag”, which ensures a calculation of the Gini values for the following
classifiers. See Fig. 8.200.
We now add a Partition node to the stream to split the data into a training set
(70 %) and test set (30 %). See Sect. 2.7.7 for how to perform this step in the
Partition node. Afterwards, we add a Neural Network node to the stream and
connect it to the Partition node. Since the target and input variables are defined in
the preceding Type node, the roles of the variables should be automatically
Fig. 8.198 Reclassification of the “Results for White” variable to a binary field
identified by the Neural Network node and, hence, appear in right role field. See
Fig. 8.201.
We furthermore use the default settings of the SPSS Modeler, which in
particular include the MLP network type, as well as an automatic determination
of the neurons and hidden layers. See Fig. 8.202. We also make sure that the
864 8 Classification Models
Fig. 8.200 Definition of the target variable and setting its measurement type to “Flag”
Fig. 8.201 Target and input variable definition in the Neural Network node
8.6 Neuronal Networks 865
Fig. 8.202 Network type and topology settings for the chess outcome prediction model
Fig. 8.203 Summary of the neural network classifier that predicts the outcome of a chess game
predictor importance calculation option is enabled in the Model Options tab. See
how this is done in Fig. 8.187.
Now, we run the stream and the model nugget appears. In the following, the
model nugget is inspected and the results and statistics interpreted.
The first thing that strikes our attention is the enormous accuracy of 99.5 %
correct outcome predictions in the training data. See Fig. 8.203. To eliminate
overfitting, we have to inspect the statistics for the test set later in the Analysis
866 8 Classification Models
Fig. 8.204 Importance of the pieces” positions on the chessboard for outcome prediction
Fig. 8.205 Classification heat map of counts of predicted versus observed outcomes
node. These are almost as good (see Fig. 8.206) as the accuracy here however,
and we can assume that the model is not overfitting the training data. Moreover,
we see in the model summary overview, that one hidden layer with 7 neurons
was included in the network. See Fig. 8.203.
When inspecting the predictor importance, we detect that the positions of the
white rook and black king are important for prediction, while the white king’s
position on the board plays only a minor role in the outcome of the game in
16 moves. See Fig. 8.204.
In the classification heat map of absolute counts, we see that only 100 out of
19,632 game outcomes are misclassified. See Fig. 8.205.
To also calculate the accuracy of the test set, and the Gini values for both
datasets, we add an Analysis node to the neural network model nugget. See
Sect. 8.2.5 for the setting options in the Analysis node. In Fig. 8.206, the output
of the Analysis node is displayed, and we verify that the model performs as well
8.6 Neuronal Networks 867
Fig. 8.206 Analysis node statistics of the chess outcome Neural Network classifier
on the test set as on the training set. More precisely, the accuracies are 99.49 %
or 99.48 %, and the Gini values are 0.999 and 0.998 for the training and test set,
respectively. Hence, the model does not overfit the training set.
3. To build an SVM and Logistic Regression model on the same training set, we
add an SVM node and a Logistic node to the stream and connect each of them to
the Partition node. In the SVM node, no options have to be modified, while in the
Logistic node, we choose the “Stepwise” variable selection method. See
Fig. 8.207. Afterwards, we run the stream.
Before comparing the prediction performances of the three models, NN,
SVM, and Logistic Regression, we rearrange the model nuggets and connect
them into a series. The models are now executed successively on the data.
Compare the rearrangement of the model nuggets with Fig. 8.107.
We now add an Analysis node and an Evaluation node to the last model
nugget in this series and run these two nodes. See Sect. 8.2.5 and Fig. 8.28 for a
description of these two nodes. The outputs are displayed in Figs. 8.208, 8.209,
and 8.210. When looking at the accuracy of the three models, we notice that the
NN performs best here; it has over 99 % accuracy in the training set and test set,
868 8 Classification Models
followed by the SVM with about 96 % or 95 % accuracy, and lastly the Logistic
Regression, with still a very good accuracy of about 90 % in both datasets. The
coincidence matrix, however, gives a more detailed insight into the prediction
performance and reveals the actual inaccuracy of the last model. While NN and
SVM are able to detect both “win” and “draw” outcomes, the Logistic Regres-
sion model has assigned every game as a win for white. See Fig. 8.208. This
technique gives good accuracy, since only 10 % of the games end with a draw,
recall Fig. 8.199, but that is by chance. The reason for this behavior could be that
the chess question is non- linear, and a linear classifier such as Logistic Regres-
sion is unable to separate the two classes from each other, intensified by
imbalance in the data. For this chess question, a nonlinear classifier performs
better, and this is also confirmed by the Gini values of the three models in
Fig. 8.209. The Gini of the NN and SVM are pretty high and nearly 1, while the
Gini of the Logistic Regression is about 0.2, noticeably smaller. This also
indicates an abnormality in the model. The Gini or AUC is visualized by the
ROC in Fig. 8.210. The ROC of the NN and SVM are almost perfect, while the
ROC of the Logistic Regression runs clearly beneath the other two.
8.6 Neuronal Networks 869
Fig. 8.208 Accuracy of the Neural Network, SVM, and Logistic regression chess outcome
classifier
Fig. 8.209 AUC/Gini of the Neural Network, SVM, and Logistic regression chess outcome
classifier
870 8 Classification Models
Fig. 8.210 ROC of the Neural Network, SVM, and Logistic regression chess outcome classifier
Exercise 2 Credit rating with a Neural network and finding the best network
topology
Name of the solution streams tree_credit_nn
Theory discussed in section Sect. 8.2
Sect. 8.6.1
and since we use the default settings, nothing has to be modified in the network
settings. In particular, we use the MLP network type and automatic topology
determination. See Fig. 8.214.
Now we run the stream.
In the Model summary view in the model nugget, we see that one hidden layer
with six neurons is included while training the data. See Fig. 8.215. We also
notice that the model has an accuracy of 76.4 % in the training data. The network
is visualized in Fig. 8.216.
3. We add a second Neural Network node to the stream and connect it to the Type
node. Unlike with the first Neural Network node, here we manually define the
number of hidden layers and units. We choose two hidden layers with ten and
five neurons as our network structure, respectively. See Fig. 8.217. The type
remains MLP.
We run this node and open the model nugget that appears. The accuracy of this
model with two hidden layers (ten units in the first one and five in the second),
has increased to 77.9 %. See Fig. 8.218. Hence, the model performs better on the
872 8 Classification Models
Fig. 8.213 Definition of target and input fields in the Neural Network node
Fig. 8.214 Default network type and topology setting in the Neural Network node
8.6 Neuronal Networks 873
Fig. 8.215 Summary of the NN, with automatic topology determination that predicts credit
ratings
Fig. 8.216 Visualization of the NN, with automatic topology determination that predicts credit
ratings
874 8 Classification Models
Fig. 8.217 Manually define the network topology of the Neural Network node
Fig. 8.218 Summary of the NN, with manually defined topology that predicts credit ratings
training data than the automatically established network. Figure 8.219 visualizes
the structure of this network with two hidden layers.
4. To compare the two models with each other, we add a Filter node to each model
nugget and rename the prediction fields with a meaningful name. See Fig. 8.220
for the Filter node, after the model nugget with automatic network determina-
tion. The setting of the Filter node for the second model is analog, except for the
inclusion of the “Credit rating” variable, since this is needed to calculate the
evaluation statistics.
8.6 Neuronal Networks 875
Fig. 8.219 Visualization of the NN, with manually defined topology that predicts credit ratings
Fig. 8.221 Evaluation statistics from the two networks that predict credit ratings
Gini measures can be viewed for both models and all three subsets. We notice
that for the network with a manually defined structure, the accuracy is higher in
all three sets (training, validation, and test), as are the Gini values. This is also
visualized by the ROCs in Fig. 8.222, where the curves of the manually defined
8.6 Neuronal Networks 877
topology network lie above the automatically defined network. Hence, a network
with two hidden layers having ten and five units describes the data slightly better
than a network with one hidden layer containing six neurons.
x1 x2 φ(x)
0 0 φð100Þ 0
1 0 φð100Þ 1
0 1 φð100Þ 1
1 1 φð300Þ 1
x1 φ(x)
0 φð100Þ 1
1 φð100Þ 0
The k-nearest neighbor (kNN) algorithm is nonparametric and one of the simplest
among the classification methods in machine learning. It is based on the assumption
that data points similar to each other are of the same class. So, the classification of
an object is simply done by majority voting within the data points in a neighbor-
hood. The theory and concept of kNN is described in the next section. Afterwards,
we turn to the application of kNN on real data with the SPSS Modeler.
8.7.1 Theory
The kNN algorithm is nonparametric, which means that model parameters don’t
have to be calculated. A kNN classifier is trained with just the set of training data
and the values of the involved features. This learning technique is also called lazy-
learning. So training of a model is pretty fast. In return, however, the prediction of
new data points can be very resource and time consuming. We refer to Lantz (2013)
and Peterson (2009) for information that goes beyond our short instruction here.
Fig. 8.225 Visualization of the kNN method for different k, for k ¼ 3 on the left and k ¼ 1 on the
right graph
Unfortunately, choosing the right k is important. A small k will take into account
only points within a small radius and thus give each neighbor a very high impor-
tance when classifying. In so doing, however, it makes the model prone to noisy
data and outliers. If, for example, the rectangle nearest to the star is an outlier in the
right graph of Fig. 8.225, the star will probably be misclassified, since it would
rather belong to the circle group. If k is large, then the model becomes more stable
and less affected by noise. On the other hand, more attention will then be given to
the majority class, as a huge number of data points are engaged in the decision
making. This can be a big problem for skewed data. In this case, the majority class
most often wins, suppressing the minority class, which is thus ignored in the
prediction.
Of course the choice of k depends upon the structure of the data and the number
of observations and features. In practice, k is typically set somewhere between three
and ten. A common choice for k is the square root of the number of observations.
This is just a suggestion that turned out to be a good choice of k for many problems,
but does not have to be the optimal value for k, and can even result in poor
predictions.
The usual way to identify the best k is via cross-validation. That is, various
models are trained for different values of k and validated on an independent set. The
model with the lowest error rate is then selected as the final model.
Distance metrics
A fundamental aspect of the kNN algorithms is the metric with which the distance
of data points is calculated. The most common distance metrics are the Euclidian
distance and City-block distance. Both distances are visualized in Fig. 8.226. The
880 8 Classification Models
Euclidian metric describes the usual distance between data points and is the black
solid line between the two points in Fig. 8.226. The City-block distance is the sum
of the distance between the points in every dimension. This is indicated as the
dotted line in Fig. 8.226. The City-block distance can be also thought of as the way
a person has to walk in Manhattan to get from one street corner to another one. This
is why this metric is also called the Manhattan metric. Both distance formulas are
shown in Table 8.8, and we also refer to the Clustering Chap. 7 and IBM (2015a).
Feature normalization
A problem that occurs when calculating the distances of different features is the
scaling of those features. The values of different variables are located on differing
scales. For example, consider the following data in Fig. 8.227, of bank customers
who could be rated for credit worthiness, for example.
8.7 k-Nearest Neighbor 881
The consequence of this is that a change in the number of credit cards that John or
Frank own has nearly no effect on the distance between these two customers,
although this might have a huge influence on their credit scoring.
To prevent this problem, all features are transformed before being entered into
the kNN algorithm, so they all lie on the same scale and thus contribute equally to
the distance. This process is called normalization. One most common normaliza-
tion is the min–max normalization, that is
x minðXÞ
xnorm ¼
maxðXÞ minðXÞ
for a value x of the feature X. The SPSS Modeler provides an adjusted min–max
normalization, namely,
2 ðx minðXÞÞ
xnorm ¼ 1:
maxðXÞ minðXÞ
Whereas the min–max normalization maps the data between 0 and 1, the adjusted
min–max normalized data take values between 1 and 1.
Additionally, to calculate distances all variables have to be numeric. To provide
this, the categorical variables are transferred into numerical ones, by dummy coding
the categories of these variables with integers.
882 8 Classification Models
A k-nearest neighbor classifier (kNN) can be trained with the KNN node in the
SPSS Modeler. Here, we show how this node is utilized for classification of wine
data. This dataset contains chemical analysis data on three Italian wines, and the
goal is to identify the wine based on its chemical characteristics.
8.7 k-Nearest Neighbor 883
1. First, we open the template stream “021 Template-Stream wine” (see Fig. 8.228)
and save it under a different name. The target variable is called “Wine” and can
take values 1, 2, or 3, indicating the wine type.
This template stream already contains a Type node, in which the variable
“Wine” is set as the target variable and its measurement type is set to nominal.
See Fig. 8.229. Additionally, a Partition node is already included that splits the
wine data into a training set (70 %) and a test set (30 %), so a proper validation of
the kNN is provided. See Sect. 2.7.7 for a description of the Partition node.
2. Next, we add a KNN node to the stream and connect it to the Partition node.
After opening the KNN node, the model options can be set.
3. In the Objectives tab, we will define the purpose of the KNN analysis. Besides
the usual prediction of a target field, it is also possible to identify just the nearest
neighbors, to get an insight into the data and maybe find the best data
representatives to use as training data. Here, we are interested in predicting the
type of wine and thus select the “predict target field” option. See Fig. 8.230.
Furthermore, predefined settings can be chosen in this tab, which are related to
the performance properties, speed, and accuracy. Switching between these
options changes the settings of the KNN, in order to provide the desired
performance. When changing the setting manually, the option changes automat-
ically to “custom analysis”.
4. In the Fields tab, the target and input variables can be selected. See Fig. 8.231 for
variable role definition with the wine data. If the roles of the fields are already
defined in a Type node, the KNN node will automatically identify them.
5. The model parameters are set in the Settings tab. If a predefined setting is chosen
in the Objectives tab, these are present according to the choice of objective.
Model
In the Model options, the cross-validation process can be initialized by marking the
usual “Use partitioned data” option. See top arrow in Fig. 8.232. Furthermore,
feature normalization can be enabled. See bottom arrow in Fig. 8.232. The SPSS
Modeler uses the adjusted min–max normalization method; see Sect. 8.7.1 for
details and the formula. We also recommend always normalizing the data, since
this improves the prediction in almost all cases.
Neighbors
In the Neighbors view, the value of k, i.e., the number of neighbors, is specified
along with the metric used to calculate the distances. See Fig. 8.233. The KNN node
provides an automatic selection of the “best” k. Therefore, a range for k has to be
886 8 Classification Models
Fig. 8.232 Model setting and normalization are enabled in the KNN node
Fig. 8.233 The neighborhood and metric are defined in the KNN node
8.7 k-Nearest Neighbor 887
defined, in which the node identifies the “optimal” k, based on the classification
error. The value with the lowest error rate is then chosen as k. Detection of the best
k is done either by feature selection or cross-validation. This depends on whether or
not the feature selection option is requested in the Feature Selection panel. Both
options are discussed later.
" The method used to detect the best k depends upon whether or not
the feature selection option is requested in the Feature Selection
panel.
" In both cases, the k with the lowest error rate will be chosen as the
size of the neighborhood. A combination of both options cannot be
selected, due to performance issues. These options are described in
the introduction within their panels.
The number of neighbors can also be fixed to a specific value. See framed field in
Fig. 8.233. Here, we chose the automatic selection of k, with a range between 3 and
5.
As an option, the features, i.e., the input variables, can be weighted by their
importance, so more relevant features have a greater influence on the prediction, in
order to improve accuracy. See Sect. 8.7.1 for details. When this option is selected,
see bottom arrow in Fig. 8.233, the predictor importance is calculated and shown in
the model nugget. We therefore select this option in our example with the
wine data.
Feature Selection
In the Feature Selection panel, feature selection can be enabled in the model
training process. See arrow in Fig. 8.234. Thereby, features are added one by one
to the model, until one of the stopping criteria is reached. More precisely, the
feature that reduces the error rate most is included next. The stopping criteria will
either be a predefined maximum number of features or a minimum error rate
improvement. See Fig. 8.234. Feature selection has the advantage of reducing
dimensionality and just focusing on a subset of the most relevant input variables.
See Sect. 8.7.1 for further information on the “curse of dimensionality”.
We exclude feature selection from the model training process, as we intend to
use the cross-validation process to find the optimal k.
888 8 Classification Models
Cross-validation
This panel defines the setting of the cross-validation process for finding the best k in
a range. The method used here is V-fold cross-validation, which randomly
separates the dataset into V segments of equal size. Then V models are trained,
each on a different combination of V1 of these subsets, and the remaining subset
is not included in the training, but is used as a test set. The V error rates on the test
sets are then aggregated into one final error rate. V-fold cross-validation gives very
reliable information on model performance.
Fig. 8.235 Visualization of the V-fold cross-validation process. The gray boxes are the test sets
and the rest are subsets of the training data in each case
" Since each data record is used for test data, V-fold cross-validation
eliminates a “good luck” error rate that can occur by splitting the data
into training and test sets just once. For this reason, it gives very
reliable information on model performance.
Fig. 8.237 Analysis output and evaluation statistics in the KNN wine classifier
8.7 k-Nearest Neighbor 891
The main tab in the model nugget of the KNN node comprises all the graphs and
visualizations from the model finding process. The tab is split into two frames. See
Fig. 8.239. In the left frame, a 3-dimensional scatterplot of the data is shown,
colored by the target classes. The three axes describe the most important variables
in the model. Here, these are “Alcohol”, “Proline”, and “color_intensity”. The
scatterplot is interactive and can be spun around with the mouse cursor. Further-
more, the axes variables can be changed. A detailed description would exceed the
purpose of this book, so we refer interested readers to IBM (2015b).
" The KNN node and the model nugget are pretty sensitive to variable
names with special characters or blanks, and they have problems
dealing with them. Therefore, in order for all graphs in the model
nugget to work properly and for predictions to work properly, we
recommend avoiding special characters and blanks in the variable
names. Otherwise, the graphic will not be displayed or parts will be
missing. Furthermore, predictions with the model nugget might fail,
and produce errors, as shown in Fig. 8.238.
The second graph in Fig. 8.239 shows error rates for each k considered in the
neighborhood selection process. To do this, the “k Selection” option is selected in
the drop-down menu at the bottom of the panel. See arrow in Fig. 8.239. In our
example of the wine data, the model with k ¼ 4 has the lowest error rate, indicated
by the minimum of the curve in the second graphic, and is thus picked as the final
model.
Fig. 8.238 Error message when a variable with a special character, such as (, is present in the
model
892 8 Classification Models
Fig. 8.239 Main graphics view with a 3-dimensional scatterplot (left side) and the error rates of
the different variants of k (right side)
When selecting the “Predictor Importance” option in the bottom right drop-
down menu, the predictor importance graph is shown, as we already know from the
other models. See Fig. 8.240 and Sect. 8.3.4, for more information on this chart. In
8.7 k-Nearest Neighbor 893
the 4-nearest neighbor model on the wine data, all the variables are almost equally
relevant with “Alcohol”, “Proline”, and “color_intensity” being the top three most
important in the model.
The stream is split into two parts, the data import and target setting part, and
the dimensional reduction and model building part. As the focus of this section is
dimension reduction with the PCA, we keep the target setting part short, as it is
unimportant for the purpose of this section.
solution. The template already consists of the usual Type and Partition nodes,
which define the target variable and split the data into 70 % training data and
30 % test data.
2. We then add a second Type node, a Reclassify node, and a Select node to the
stream and insert them between the Source node and the Partition node. In the
Type node, we click the “Read Values” button, so that the nodes that follow
know the variable measurements and value types. In the Reclassify node, we
select the “Leukemia” variable as our reclassification field, enable value replace-
ment at the top, and run the “Get” button. Then, we merge AML and ALL into
one group named “acute” and proceed in the same manner with the chronic
leukemia types CML and CLL by relabeling both “chronic”. See Fig. 8.241.
3. As we are only interested in the treatment of leukemia patients, we add a Select
node to exclude the healthy people. See Fig. 8.242 for the formula in the Select
node.
8.7 k-Nearest Neighbor 895
Fig. 8.242 Healthy people are excluded with the Select node
4. To perform dimension reduction with PCA, we add a PCA node to the stream
and connect it to the last Type node. In the PCA node, we then select all the
genomic positions as inputs and the partition indicator variable as the partition
field. See Fig. 8.243.
In the Model tab, we mark the box that enables the use of partitioned data, so
the PCA is only performed on the training data. Furthermore, we select the
principle component method, so that the node uses PCA. See Fig. 8.244.
In the Expert tab, we choose the predefined “Simple” set-up (Fig. 8.245). This
ensures a calculation of the first five factors. If more factors are needed, these
can be customized under the “Expert” options. Please see Sect. 6.3 for more
information on these options.
5. After running the PCA node, it calculates the first five factors, and we observe in
the PCA model nugget that these five factors explain 41.6 % of the data
variation. See Fig. 8.246.
6. An advantage of dimensional reduction with PCA is the consolidation of mutual
variables into much more meaningful new variables. With these, we can get an
impression of the position of the groups (acute, chronic) and the potential for
classification. For that purpose, we add a Plot node to draw a scatterplot of the
first two factors and a Graphboard node to draw a 3D scatterplot of the first three
896 8 Classification Models
factors. In both cases, the data points are colored and shaped according to their
group (acute, chronic). These plots are shown in Figs. 8.247 and 8.248, respec-
tively, and we immediately see that the two groups are separated, especially in
the 3D graph, which indicates that separation might be feasible.
8.7 k-Nearest Neighbor 897
Fig. 8.245 Expert tab and set-up of the PCA factors that are determined by PCA
Fig. 8.246 Variance explained by the first five factors calculated by PCA, on the gene expression
dataset
7. Now, we are ready to build a kNN on the five established factors with PCA.
Before we do, we have to add another Type node to the stream, so that the KNN
node is able to read the measurement and value types of the factor variables.
898 8 Classification Models
8. We finally add a KNN node to the stream and connect it to the last Type node.
In the KNN node, we select the 5 factors calculated by the PCA as input
variables and the “Leukemia” field as the target variable. See Fig. 8.249.
In the Settings tab, we select the Euclidian metric and set k to 3. See
Fig. 8.250. This ensures a 3-nearest neighbor model.
9. Now we run the stream and the model nugget appears. We connect an Analysis
node to it and run it to get the evaluation statistics of the model. See Sect. 8.2.5
for a description on the options in the Analysis node. See Fig. 8.251, for the
output from the Analysis node. As can be seen, the model has high prediction
accuracy for both the training data and the test data. Furthermore, the Gini
values are extremely high. So, we conclude that the model fits the reduced data
well and is able to distinguish between acute and chronic leukemia from factor
variables alone, which explain just 41.6 % of the variance, with a minimal error
rate.
8.7 k-Nearest Neighbor 899
Fig. 8.249 The factors are selected as input variables in the KNN node
900 8 Classification Models
The Gini error rate can probably be improved by adding more factors,
calculated by PCA, as input variables to the model.
10. To sum up, the PCA reduced the multidimensional data (851 features) into a
smaller dataset with only five variables. This smaller data, with 170 times fewer
data entries, contains almost the same information as the multidimensional
dataset, and the kNN model has very high prediction power on the reduced
dataset. Furthermore, the development and prediction speed have massively
increased for the model trained on the smaller dataset. Training and evaluation
of a kNN model on the original and multidimensional data takes minutes,
whereas the same process requires only seconds on the dimensionally reduced
data (five factors), without suffering in prediction power. The smaller dataset
obviously need less memory too, which is another argument for dimension
reduction.
8.7 k-Nearest Neighbor 901
8.7.5 Exercises
1. Normalize the features of the training set and the new data, with the min–max
normalization method.
2. Use the normalized data to calculate the Euclidian distance between the
customers John, Frank, Penny, and Howard, and each of the customers in the
training set.
3. Determine the 3-nearest neighbors of the four new customers and assign a credit
rating to each of them.
902 8 Classification Models
Fig. 8.252 The training dataset of bank customers that already have a credit rating
Fig. 8.253 List of four new bank customers that need to be rated
4. Repeat steps two and three, with the City-block metric as your distance measure.
Has the prediction changed?
1. Build a 4-nearest neighbor model as in Sect. 8.7.2 on the wine data, but enable
the feature selection method, to include only a subset of the most relevant
variables in the model.
2. Inspect the model nugget. Which variables are included in the model?
3. What is the accuracy of the model for the training data and test data?
1. Import the dataset “gene_expression_leukemia.csv” and merge the data from all
leukemia patients (AML, ALL, CML, CLL) into a single “Cancer” group. What
is the frequency of both the leukemia and the healthy data records in the dataset?
Is the dataset skewed?
2. Perform a PCA on the gene expression data of the training set.
3. Build a kNN model on the factors calculated in the above step. Use the automatic
k selection method with a range of 3–5. What is the performance of this model?
Interpret the results.
4. Build three more kNN models for k equals to 10, 3, and 1, respectively. Compare
these three models and the one from the previous step with each other. What are
the evaluation statistics? Draw the ROC. Which is the best performing model
from these four and why?
8.7.6 Solutions
1. Figure 8.254 shows the min–max normalized input data, i.e., age, income,
number of credit cards. The values 0 and 1 indicate the minimum and maximum
values. As can be seen, all variables are now located in the same range, and thus
the effect of the large numbers and differences in Income are reduced.
Fig. 8.255 Euclidean distance between John, Frank, Penny, and Howard and all the other bank
customers
Fig. 8.256 Final credit rating for John, Frank, Penny, and Howard
2. The calculated Euclidian distance between customers John, Frank, Penny, and
Howard and each of the other customers is shown in Fig. 8.255. As an example,
we show here how the distance between John and Sandy is calculated:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
dðJohn; SandyÞ ¼ ð0 0:29Þ2 þ ð0:17 0:44Þ2 þ ð0:67 0:67Þ2 0:4:
3. The 3-nearest neighbors for each new customer (John, Frank, Penny, Howard)
are highlighted in Fig. 8.255. For example, John’s 3-nearest neighbors are
Andrea, Ted, and Walter, with values 0.3, 0.31, and 0.27.
Counting the credit ratings of these three neighbors for each of the four new
customers, we get, by majority voting, the following credit ratings, as listed in
Fig. 8.256.
4. Figure 8.257 displays the City-block distances between John, Frank, Penny, and
Howard and each of the training samples. The colored values symbolize the
3-nearest neighbors in this case. As an example, here we calculate the distance
between John and Sandy:
Fig. 8.257 City-block distance between John, Frank, Penny, and Howard and all the other bank
customers
We see that the nearest neighbors haven’t changed, from comparing the
Euclidian distance. Hence, the credit ratings are the same as before and can be
viewed in Fig. 8.256.
1. Since this is the same stream construct as in Sect. 8.7.2, we open the stream
“k-nearest neighbor—wine” and save it under a different name. We recall that
this stream imports the wine data, defines the target variables and measurement,
splits the data into 70 % training data and 30 % test data, and builds a kNN model
that automatically selects the best k; in this case k ¼ 4. This is the reason why we
perform a 4-nearest neighbor search in this exercise.
To enable a feature selection process for a 4-nearest neighbor model, we open
the KNN node and go to the Neighbors panel in the Settings tab. There, we fix
k to be 4, by changing the automatic selection option to “specify fixed k” and
writing “4” in the corresponding field. See Fig. 8.259.
In the Feature Selection panel, we now activate the feature selection process
by checking the box at the top. See top arrow in Fig. 8.260. This enables all the
options in this panel, and we can choose the stopping criteria. We can choose to
stop when a maximum number of features is included in the model or when the
error rate cannot be lowered by more than a minimum value. We choose the
second stopping criteria and define the minimum improvement for each inserted
feature by 0.01. See Fig. 8.260.
906 8 Classification Models
Fig. 8.260 Activation of the features selection procedure in the KNN node
Fig. 8.261 Model view in the model nugget and visualization of the feature selection process
graph. See Fig. 8.261. In this case, seven features are included in the model. These
are, in order of their inclusion in the model, “Flavanoids”, “Color_intensity”,
“Alcohol”, “Magnesium”, “Proline”, “Proanthocyanins”, and “Ash”.
908 8 Classification Models
Fig. 8.262 Predictor importance in the case of feature selection on the wine data
Since not all variables are considered in the model, the predictor importance
has changed, which we already recognized by the axes variables in the 3-D
scatterplot. The importance of variables in this case is shown in Fig. 8.262.
3. The output of the Analysis node is displayed in Fig. 8.263. As can be seen, the
accuracy has not changed much, since the performance of the model in
Sect. 8.7.2 was already extremely precise. The accuracy has also not suffered
from feature selection, however, and the dimensionality of the data has halved,
which reduces memory requirements and hastens the prediction performance.
8.7 k-Nearest Neighbor 909
1. Since we are dealing here with the same data as in Sect. 8.7.4, where a kNN model
was also built on PCA processed data, we use this stream as a starting point. We
therefore open the stream “knn_pca_gene_expression_acute_chronic_leukemia”
and save it under a different name. See Fig. 8.265 for the stream in Sect. 8.7.4.
This stream has already imported the proper dataset, partitioned it into a
training set and a test set, and reduced the data with PCA. See Sect. 8.7.4 for
details.
Starting from here, we first need to change the reclassification of target values.
For that purpose, we delete the Select node and connect the Distribution node to
the Reclassify node. The latter is then opened and the value of the “Leukemia”
variable representing all leukemia types (AML, ALL, CML, CLL) is set to
“Cancer”. See Fig. 8.266.
Fig. 8.264 Stream of the kNN leukemia and healthy patient identification exercise
Next we run the Distribution node, to inspect the frequency of the new values
of the “Leukemia” variable. As can be seen in Fig. 8.267, the dataset is highly
skewed, with the healthy data being the minority class, comprising only 5.73 %
of the whole data.
8.7 k-Nearest Neighbor 911
Fig. 8.268 Explained variation in the first five components established by PCA
Fig. 8.269 Scatterplot matrix for all five factors calculated by PCA
2. The PCA node is already included in the stream, but since the healthy patients
are now added to the data, unlike in the stream from Fig. 8.265, we have to run
the PCA node again to renew the component computations. In the PCA model
nugget, we can then view the variance in the data, explained by the first five
components. This is 41.719, as shown in Fig. 8.268.
To get a feeling for the PCA components, we add a Graphboard node to the
PCA model nugget and plot a scatterplot matrix for all five factors. See
Fig. 8.269. We notice that in all dimensions, the healthy patient data seems to
912 8 Classification Models
Fig. 8.272 Evaluation statistics in the KNN model, with automatic neighborhood selection
914 8 Classification Models
To build the 3-nearest and 1-nearest neighbor models, we proceed in the same
manner. Afterwards, we run all three models (10-, 3-, and 1-nearest neighbor).
To provide a clear overview and make comparison of the models easier, we
connect all KNN model nuggets into a series. See Fig. 8.107 as an example of
how this is performed. Next, we add a Filter node to the end of the stream, to
rename the predictor fields of the models with a proper name.
We then add an Analysis node and Evaluation node to the stream and connect
them to the Filter node. See Sect. 8.2.5 and Fig. 8.28 for options within these two
nodes. Figure 8.274 shows the accuracy and Gini, as calculated by the Analysis
node. We notice that the models improve according to Gini and accuracy
measures, by reducing the value of k. The smaller the neighborhood, the better
the model prediction. Thus, the statistics indicate that a 1-nearest neighbor
classifier is the best one for this data. This is evidenced by the perfect classifica-
tion of this model, i.e., accuracy of 100 %. Analysis further exposes that the
8.7 k-Nearest Neighbor 915
Fig. 8.274 Accuracy and Gini for the four kNN models
automatic k selection method output is not always the best solution. k ¼ 3 may be
more appropriate than k ¼ 4 in this situation.
Improvement of the models is also visualized by the ROCs in Fig. 8.275.
The coincidence matrices can also confirm improvement in the prediction
performance of models with a smaller k. A smaller k brings more attention to the
absolute nearest data points, and ergo the minority class. Hence, misclassi-
fication of “Healthy” patients is reduced when k is lowered. See Fig. 8.276.
916 8 Classification Models
We now turn to the rule-based classification methods, from which Decision Trees
(DT) are the most famous group of algorithms. In comparison with the classifica-
tion methods discussed so far, rule-based classifiers have a completely different
approach; they inspect the raw data for rules and structures that are common in a
target class. These identified rules then become the basis for decision making.
This approach is closer to real-life decision making. Consider a situation where
we plan to play tennis. The decision on whether to play tennis or not depends on the
weather forecast. We will go to the tennis court “if it is not raining and the
temperature is over 15 C”. If “rainy weather or a temperature of less than 15 C”
is forecasted we decide to stay at home. These rules are shown in Fig. 8.277 and we
recall Sect. 8.2.1 for a similar example.
A decision tree algorithm now constructs and represents this logical structure in
the form of a tree, see Fig. 8.278. The concept of finding appropriate rules within
the data is discussed in the next chapter. Then the building of a DT model with the
SPSS Modeler is presented.
Rule-based classifiers have the advantage of being easily understandable and
interpretable without statistical knowledge. For this reason, DT, and rule-based
classifiers in general, are widely used in fields where the decision has to be
transparent, i.e., in credit scoring, where the scoring result has to be explained to
the customer.
8.8.1 Theory
Decision trees (DT) belong to a group of rule-based classifiers. The main charac-
teristic of a DT lies within how it orders the rules in a tree structure. Looking again
at the tennis example, where a decision is made based upon the weather, the logical
rule on which the decision will be based is shown in Fig. 8.277. In a DT, these rules
are represented in a tree, as can be seen in Fig. 8.278.
A DT is like a flow chart. A data record is classified by going through the tree,
starting at the root node, and deciding in each tree node which of the conditions the
particular variable satisfies, and then following the branch of this condition. This
procedure is repeated until a leaf is reached that brings the combination of decisions
made thus far to a final classification. For example, let us consider an outlook “No
rain”, with a temperature of 13 C. We want to make our decision to play tennis
using the DT in Fig. 8.278. We start at the root node and see that our outlook is “No
rain” and we follow the left branch to the next node. This looks at the temperature
variable, is it larger or smaller than 15 C? As the temperature in our case is 13 C,
we follow the branch to the right and arrive at a “No” leaf. Hence, in our example,
we decide to stay home as the temperature is too cold.
As can be seen in this short example, our decision to cancel our plans to play
tennis was easily determined by cold weather. This shows the great advantage of
DT or rule-based classifiers in general. The cause of a certain decision can be
reconstructed and is interpretable without statistical knowledge. This is the reason
for DT’s popularity of use for a variety of fields and problems, where the classifi-
cation has to be interpretable and justifiable to other people.
– All (or almost all) training data in the node are of the same target category.
– The data can no longer be partitioned by the variables.
– The tree has reached its predefined maximum level.
– The node has reached the minimum occupation size.
7. A node that fulfills one of the stopping criteria is called a leaf and indicates the
final classification. The target category that occurs most often in the subset of the
leaf is taken as the predictor class. This means each data sample that ends up in
this leaf when passed through the DT is classified as the majority target category
of this leaf.
Fig. 8.279 The DT nodes are partitioned for the “play tennis” example
920 8 Classification Models
and the rectangles indicate days we stay at home. In the first node, the variable
“Outlook” is chosen and the data are divided into the subsets “Non rain” and
“Rain”. See the first graph in Fig. 8.279. If the outlook is “Rain”, we choose not
to play tennis, while in the case of “No rain”, we can make another split on the
values of the temperature. See right graph in Fig. 8.279. Recalling Fig. 8.278, this
split is done at 15 C, whereby on a hotter day, we decide to play a tennis match.
As can be seen in the above example and Fig. 8.279, a DT is only able to perform
axis-parallel splits in each node. This divides the data space into several rectangles,
where each of them is assigned to a target category.
Pruning
If the stopping criteria are set too narrowly, the finished DT is very small and tends
to underfit the training data. In other words, the tree is unable to describe particular
structures in the data, as the criteria are too vague. Underfitting can occur, for
example, if the minimum occupation size of each node is set too large, or the
maximum level size is set too small. On the other hand, if the stopping criteria are
too broad, the DT can continue splitting the training data until each data point is
perfectly classified, and then the tree will be overfitted. The constructed DT is
typically extremely large and complex.
Several pruning methods have therefore been developed to solve this dilemma,
originally seen by Breiman et al. (1984). The concept is basically the following:
instead of stopping the tree growth at some point, e.g., at a maximum tree level, the
tree is over-constructed, allowing the tree to overfit the data. Then, nodes and
“sub-branches” are removed from the overgrown tree, which do not contribute to
the general accuracy.
Growing the tree to its full size and then cutting it back is usually more effective
than stopping at a certain point, since determination of the optimal tree depth is
difficult without growing it first. As this method allows us to better identify
important structures in the data, this pruning approach generally improves the
generalization and prediction performance.
For more information on the pruning process and further pruning methods, see
Esposito et al. (1997).
The split in each node is selected with the Gini coefficient, sometimes also
called the Gini index. The Gini coefficient is an impurity measure and describes the
dispersion of a split. The Gini coefficient should not be confused with the Gini
index that measures the performance of a classifier, see Sect. 8.2.5. The Gini
coefficient at node σ is defined as
XN ðσ; jÞ 2
Giniðσ Þ ¼ 1 ;
j
N ðσ Þ
where j is a category of the target variable, N(σ, j) the number of data in node σ with
category j, and N(σ) the total number of data in node σ. In other words,
.
N ðσ; jÞ
N ðσ Þ
is the relative frequency of category j upon the data in node σ. The Gini coefficient
reaches its maximum, when the data in the node are equally distributed across the
categories. If, on the other hand, all data belong to the same category in the node,
then the Gini equals 0, its minimum value. A split is now measured with the Gini
Gain
N ðσ L Þ N ðσ R Þ
GiniGainðσ; sÞ ¼ Giniðσ Þ Giniðσ L Þ Giniðσ R Þ;
N ðσ Þ N ðσ Þ
where σ L and σ R are the two child nodes of σ, and s is the splitting criteria. The
binomial split that maximizes the Gini Gain will be chosen. In this case, the child
nodes deliver maximal purity, with regard to category distribution, and so it is best
to partition the data along these categories.
When there are a large number of target categories, the Gini coefficient can
encounter problems. The CART therefore provides another partitioning selection
measure, called twoing. Briefly, twoing divides the data into two groups of equal
size, instead of trying to split the data so that the subgroups are as pure as possible.
We resist a detailed description of this measure within the CART here and refer to
Breiman et al. (1984) and IBM (2015a).
C5.0
The C5.0 algorithm was developed by Ross Quinlan and is an evolution of his own
C4.5 algorithm Quinlan (1993), which itself originated from the ID3 decision tree
Quinlan (1986). Its ability to split is not strictly binary, but allows for partitioning of
a data segment into more than two subgroups. As with the CART, the C5.0 tree
provides a pruning method after the tree has grown, and the splitting rules of the
nodes are selected via an impurity measure. The measure used is the Information
Gain of the Entropy. The entropy quantifies the homogeneity of categories into a
node and is given by
922 8 Classification Models
X N ðσ; jÞ
N ðσ; jÞ
Entropyðσ Þ ¼ log2 ;
j
N ðσ Þ N ðσ Þ
where σ symbolizes the current node, j is a category, N(σ, j) the number of data
records of category j in node σ, and N(σ) the total number of data in node σ (see the
description of CART). If all categories are equally distributed in a node segment,
the entropy maximizes, and if all data are of the same class then it takes its
minimum as 0. The Information Gain is now defined as
X N ðσ 1 Þ
InformationGainðσ; sÞ ¼ Entropyðσ Þ Entropyðσ 1 Þ;
σ1
N ðσ Þ
with σ 1 being one of the child nodes of σ resulting from the split and measuring the
change in purity in the data segments of the nodes. The splitting criteria s, which
maximizes the Information Gain, is selected for this particular node.
In 2008, the C4.5 was picked as one of the top ten algorithms for data mining Wu
et al. (2008). More information on the C4.5 and C5.0 decision trees can be found in
Quinlan (1993) and Lantz (2013).
Table 8.9 Decision tree algorithms with the corresponding node in the modeler
Decision tree Method SPSS Modeler node
CART – Gini coefficient C&R Tree
– Twoing criteria
C5.0 Information Gain (Entropy) C5.0
CHAID Chi-squared statistics CHAID
QUEST Significance statistics QUEST
In Table 8.9, the decision tree algorithms, their splitting methods, and the
corresponding nodes in the Modeler are displayed.
For additional information and more detailed descriptions of these decision
trees, we refer the interested reader to Mingers (1989); Rokach and Maimon.
Boosting (AdaBoost)
In Sect. 5.3.6, the technique of ensemble modeling and particularly Boosting was
discussed. Since the concept of Boosting originated with decision trees and is still
mostly used for classification models, we hereby explain the technique once again,
but in more detail.
Boosting was developed to increase prediction accuracy by building a sequence
of models. The key idea behind this method is to give misclassified data records a
higher weight and correct classified records a lower weight, to point the focus of the
next component onto the incorrectly predicted records. With this approach, the
classification problem is then shifted to the data records, which usually perish in the
analysis, and the records that are easy to handle and correctly classified anyway are
neglected. All component models in the ensemble are built on the entire dataset, and
the weighted models are aggregated into a final prediction.
This process is demonstrated in Fig. 8.280. In the first step, the unweighted data
(circles and rectangles) are divided into two groups. With this separation, two
rectangles are located on the wrong side of the decision boundary, hence, they
are misclassified. See the circled rectangles in the top right graph. These two data
rectangles are now more heavily weighted, and all other points are down-weighted.
This is symbolized by the size of the points in the bottom left graph in Fig. 8.280.
Now, the division is repeated with the weighted data. This results in another
decision boundary, and thus, another model. See the bottom right graph in
Fig. 8.280. These two separations are now combined through aggregation, which
results in perfect classification of all points. See Fig. 8.281.
The most common and best-known boosting algorithm is the AdaBoost or
adaptive boosting Zhou (2012). We refer to Sect. 5.3.6 and Lantz (2013), Tuffery
(2011), Wu et al. (2008), Zhou (2012) and James et al. (2013) for further informa-
tion on boosting and other ensemble methods, such as bagging.
924 8 Classification Models
Fig. 8.281 The concept of boosting. Aggregation of the models and the final model
There are four nodes that can be used to build one of the above described decision
trees in the SPSS Modeler. See Table 8.9. As their selection options are relatively
similar, we only present the C5.0 in this section and the CHAID node in the
subsequent chapter and refer to the exercises for usage of the remaining nodes.
We show how a C5.0 tree is trained, based on credit rating data “tree_credit”, which
contains demographic and historic loan data from bank customers and their related
credit rating (“good” or “bad”).
1. First, we open the stream “000 Template-Stream tree_credit”, which imports the
tree_credit data and already has a Type node attached to it. See Fig. 8.282. We
save the stream under a different name.
2. To set up a validation of the tree, we add a Partition node to the stream and place
it between the source and the Type node. Then, we open the node and define
70 % of the data as training data and 30 % as test data. See Sect. 2.7.7 for a
description of the Partition node.
3. Now we add a C5.0 node to the stream and connect it to the Type node.
4. In the Fields tab of the C5.0 node, the target and input fields have to be selected.
As in the other model nodes, we can choose between a manual setting and
automatic identification. The latter is only applicable if the roles of the variables
have already been defined in a previous Type node. Here, we select the “Credit
rating” variable as the target and “Partition” as the partition defining field. All
the other variables are chosen as the input. See Fig. 8.283.
5. In the Model tab, the parameters of the tree building process are set. We first
enable the “Use partitioned data” option, in order to build the tree on the training
data and validate it with the test data. Besides this common option, the C5.0 tree
offers two other output types. In addition to the decision tree, one can choose
926 8 Classification Models
“Rule set” as the output. In this case, the set of rules is derived from the tree and
contains a simplified version of the most important information of the tree. Rule
sets are handled a bit differently, as now, multiple rules, or no rule at all, can
apply to a particular data record. The final classification is thus done by voting,
8.8 Decision Trees 927
Fig. 8.284 The options for the C5.0 tree training process are set
see IBM (2015b). Here, we select a decision tree as our output; see arrow in
Fig. 8.284. For “Rule set” building, see exercise 4 in Sect. 8.8.4.
For the training process, three additional methods, which may improve the
quality of the tree, can be selected. See Fig. 8.284. These are:
In the bottom area of the Model tab, the parameters for the pruning process are
specified. See Fig. 8.284. We can choose between a “Simple” mode, with many
predefined parameters, and an “Expert” mode, in which experienced users are
able to define the pruning settings in more detail. We select the “Simple” mode
and declare accuracy as more important than generality. In this case, the pruning
process will focus on improving the models accuracy, whereas if the
928 8 Classification Models
“Generality” option was selected, trees that are less susceptible to the problem
would be favored.
If the proportion of noisy records in the training data is known, this informa-
tion can be included in the model building process and will be considered while
fitting the tree. For further explanation of the “Simple” options and the “Expert”
options, we refer to IBM (2015b).
6. In the Cost tab, one can specify the cost when a data record is misclassified. See
Fig. 8.285. With some problems, misclassifications are more costly than others.
For example, in the case of a pregnancy test, a diagnosis of non-pregnancy of a
pregnant woman might be more costly than the other way around, since in this
case the woman might return to drinking alcohol or smoking. To incorporate this
into the model training, the error costs can be specified in the misclassification
cost matrix of the Costs tab. By default, all misclassification costs are set to 1. To
change particular values, enable the “Use misclassification costs” option and
enter new values into the matrix below. Here, we stay with the default misclassi-
fication settings.
7. In the Analyze node, we further select the “Calculate predictor importance”
option.
8. Now we run the stream and the model nugget appears. Views of the model
nugget are presented in the Sect. 8.8.3.
9. We add the usual Analysis node to the stream, to calculate the accuracy and Gini
for the training set and test set. See Sect. 8.2.6 for a detailed description of the
Analysis node options. The output of the Analysis node is displayed in
Fig. 8.286. Both the training set and testing set have an accuracy of about
80 % and a Gini of 0.704 and 0.687, respectively. This shows quite good
prediction performance, and the tree doesn’t seem to be overfitting the training
data.
8.8 Decision Trees 929
Fig. 8.286 Accuracy and Gini in the C5.0 decision tree on the “tree_credit” data
The model nuggets of all four decision trees C5.0, CHAID, C&R Tree, and QUEST
are exactly the same with the same views, graphs, and options. Here, we present the
model nuggets and graphs of these four trees, by inspecting the model nugget of the
C5.0 model built in the previous Sect. 8.8.2 on the credit rating data.
Fig. 8.287 Model view in the model nugget. The tree structure is shown on the left and the
predictor importance in the right panel
Fig. 8.288 Part of the rule tree of the C5.0 tree, which classifies customers into credit score
groups
that the properties of a data record have to fulfill, in order to belong in this branch.
Figure 8.288 shows part of the tree structure in the left panel. Behind each rule, the
mode of the branch is displayed, that is, the majority target category of the branch
belonging to this element. If the element ends in a leaf, the final classification is
further added, symbolized by an arrow. In Fig. 8.288, for example, the last two rules
define the splitting of a node using the age of a customer. Both elements end in a
leaf, where one assigns the customers a “Bad” credit rating (Age
29.28) and the
other a “Good” credit rating (Age > 29.28).
Fig. 8.289 Visualization of the tree in the Viewer tab of a tree model nugget
node. First, the absolute and relative frequencies of the whole training data belong-
ing to this node are shown at the bottom. In our example, 49.94 % of the training
data are in node 3, which is 839 data records in total. Furthermore, the distribution
of the target variable of the training data in this node is displayed. More precisely,
for each category in the target, the absolute and relative frequencies of the data in
the node are shown. In this example, 70.56 % of the training data from node 3 are
from customers with a “Bad” credit rating. That is 592 in total. See first graph in
Fig. 8.290. Besides statistics visualization, the node can also be chosen to show only
a bar-graph, visualizing the distribution of the data’s target variable in the node. See
the second graph in Fig. 8.290. As a third option, the nodes in the tree can combine
both the statistics and the bar-graph visualization and present them in the tree view.
See last graph in Fig. 8.290.
The choice of visualization just depends on the analyst’s preference, the situa-
tion, and the audience for the results.
932 8 Classification Models
Here, we present how to build a classifier with the CHAID node. To do this we
reuse the credit rating data, from Sect. 8.8.2, while building a C5.0 decision tree.
The options and structure of the CHAID node are similar to the C&R Tree and
QUEST node and are introduced as representative of these three nodes. The C&R
Tree and QUEST node are described in more detail in the exercises later.
the new data; it will save us from building a completely new one. Furthermore,
we can select to build a single decision tree or to use an ensemble model to train
several trees and combine them into a final prediction. The CHAID node
provides a boosting and bagging procedure for creating a tree. For a description
of ensemble models, boosting, and bagging, we refer to Sects. 5.3.6 and 8.8.1.
Here, we select to build a single tree. See Fig. 8.292.
If an ensemble algorithm is selected as the training method, the finer options
of these models can be specified in the Ensemble view. There, the number of
models in the ensemble, as well as the aggregation methods, can be selected. As
this is the same for other classification nodes that provide ensemble techniques,
we refer to Fig. 8.185 and Table 8.7 in the chapter on neural networks (Sect. 8.6)
for further details.
In the Basics view, the tree growing algorithm is defined. See Fig. 8.293. By
default, this is the CHAID algorithm, but the CHAID node provides another
algorithm, the “Exhaustive CHAID” algorithm, which is a modification of the
basic CHAID. We refer to Biggs et al. (1991) and IBM (2015a). Furthermore, in
this panel the maximum tree depth is defined. See bottom arrow in Fig. 8.293.
The Default tree depth is five, which means the final tree has only five levels
beneath the root node. The maximum height of the decision tree can be changed
by clicking in the Custom option and inserting the favored height.
934 8 Classification Models
Fig. 8.292 Selection of the general model process, i.e., a single or ensemble tree
Fig. 8.293 The tree growing algorithm is selected in the CHAID node
8.8 Decision Trees 935
In the Stopping Rules panel, we define the criteria for when a node stops
splitting and is defined as a leaf. See Fig. 8.294. These criteria are based on the
number of records in the current or child nodes, respectively. If these are too low
in absolute numbers relative to the total data size, the tree will stop branching in
that particular node. Here, we stay with the default settings, which are pertaining
to the relative number of data records on the whole dataset in the current node
(2 %) and child node (1 %). The latter stopping rules come into effect if one of
the child nodes would contains less than 1 % of the whole dataset.
In the Cost panel, the misclassification cost can be adjusted. This is useful if
some classification errors are more costly than others. See Fig. 8.295 for the Cost
panel and Sect. 8.8.2 for a description of the misclassification problem and how
to change the default misclassification cost.
In the Advanced view (see Fig. 8.296), the tree building process can be fine-
tuned, mainly the parameters of the algorithm that selects the best splitting
criteria. As these options should only be applied by experienced users, and the
explanation of each option would be far beyond the scope of this book, so we
omit a detailed description here, and refer the interested reader to IBM (2015b).
5. We are now finished with the parameters setting in the mode building process
and can run the stream. The model nugget appears.
6. Before inspecting the model nugget, we add an Analysis node to the nugget and
run it to view the accuracy and Gini of the model. See Sect. 8.2.6 for a
description of the Analysis node. The output of the Analysis node is displayed
in Fig. 8.297. We notice that the accuracy of the training set and testing set are
both slightly over 80 %, and the Gini is 0.787 and 0.77, respectively. That
indicates quite precise classification with no overfitting.
936 8 Classification Models
Fig. 8.296 Advanced option in the tree building process of the CHAID node
8.8 Decision Trees 937
Fig. 8.297 Accuracy and Gini of the CHAID decision tree on the “tree_credit” data
Fig. 8.298 Model view of the CHAID model nugget on the credit rating data
The complete, large tree structure can be viewed in the View tab of the CHAID
model nugget but it is too large to show here.
8.8.5 Exercises
1. Import the “DRUG1n.sav” dataset and divide it into a training (70 %) and testing
(30 %) set.
2. Add a C&R Tree node to the stream. Inspect the node and compare the options
with the CHAID node as described in Sect. 8.8.4. What settings are different?
Try to find out what their purpose is, e.g., by looking them up in IBM (2015b).
3. Build a single CART with the Gini impurity measure as the splitting selection
method. What are the error rates for the training and testing set?
4. Try to figure out why the accuracy of the above tree differs so much between the
training and testing set. To do so, inspect the K and Na variables, e.g., with a
scatterplot. What can be done to improve the models precision?
5. Create a new variable that describes the ratio of the variables Na and K and
discard Na and K from the new stream. Why does this new variable improve the
prediction properties of the tree?
6. Add another C&R Tree node to the stream and build a model that includes the
ratio variable of Na and K as input variable instead of Na and K separately. Has
the accuracy changed?
1. Import the chess endgame data with a proper Source node and reclassify all
“Result for White” variable into a binary field that indicates whether white wins
the game or not. What is the proportion of “draws” in the dataset? Split the data
into 70 % training and 30 % test data. See Exercise 1 in Sect. 8.6.4.
2. Add a QUEST node to the stream and inspect its options. Compare them to the
CHAID node, as described in Sect. 8.8.4, and the C&R Tree node introduced in
Exercise 1. What settings are different? Try to find out what their purpose is, e.g.,
by looking them up in IBM (2015b).
8.8 Decision Trees 939
3. Train a decision tree with the QUEST node. Determine the accuracy and Gini
values for the training and test set.
4. Build a second decision tree with another QUEST node. Use the boosting
method to train the model. What are accuracy and Gini values for these models?
Compare them with the first QUEST model.
1. Import the diabetes data and set up a cross-validation stream with training,
testing, and validation set.
2. Build three decision trees with the C&R Tree, CHAID, and C5.0 node.
3. Compare the structures of the three trees to each other. Are they similar to each
other or do they branch completely differently?
4. Calculate appropriate evaluation measures and graphs to measure their predic-
tion performance.
5. Comprise these three decision trees in an ensemble model by using the Ensemble
node. What is the accuracy and Gini of this ensemble model?
1. Import the data and divide it into 70 % training and 30 % test data.
2. Use two C5.0 nodes to build, respectively, a decision tree and a rule set that
predict the income of a citizen based on the variables collected in the census
study.
3. Compare both approaches to each other by calculating the accuracy and Gini of
both models. Then draw the ROC for each model.
8.8.6 Solutions
The final stream for this exercise should look like the stream in Fig. 8.299.
940 8 Classification Models
Fig. 8.300 Enabling and pruning setting in the C&R Tree node
1. First we import the dataset with the Statistics File node and connect it with a
Partition node, followed by a Type node. We open the Partition node and define
70 % of the data as training and 30 % as test set. See Sect. 2.7.7 for a detailed
description of the Partition node.
2. Next, we add a C&R Tree node to the stream and connect it to the Type node.
We open it to inspect the properties and options. We see that compared to the
CHAID node, there are three main differences in the node Build Options.
In the Basics panel, the splitting method selection is missing, but the pruning
process with its parameter can be set. See Fig. 8.300. Remember that pruning
8.8 Decision Trees 941
cuts back the fully grown tree in order to face the problem of overfitting. To
manipulate the pruning algorithm, the maximum risk change can be defined.
Furthermore, the maximum number of surrogates can be changed. Surrogates are
used for handling missing values. For each split, the input field that is most
similar to the splitting is set as its surrogate. If a data record with missing values
has to be classified, the surrogate’s value is used as input for the missing value.
See IBM (2015b) for more details. Increasing the number of surrogates will
generate flexibility of the model but increase memory usage.
In the Cost & Priors panel, priors can be set for each target category, apart
from the misclassification costs. See Fig. 8.301. The prior probability describes
the relative frequency of the target categories of the total population from which
the training data is drawn. It gives prior information about the target variable and
can be changed if, e.g., the distribution of the target variable in the training data
does not equal the distribution of the population. There are three possible
settings:
– Base on training: This is the default setting and is the distribution of the target
variable in the training data.
942 8 Classification Models
Fig. 8.302 Setting of the impurity measure in the C&R Tree node
– Equal for all classes: All target categories appear equally in the population
and therefore have the same prior probability.
– Custom: Specification of customized probabilities for each target category.
The probabilities have to add up to 1. Otherwise, an error will occur.
Fig. 8.303 Tree structure and variable importance of the CART for the drug exercise
Fig. 8.304 Accuracy of the first CART for the drug exercise
944 8 Classification Models
4. To figure out the reason for this discrepancy in the accuracy of the training and
testing set, we add a Plot node to the Type node to draw a scatterplot of the
variables Na and K. We further want to color the data of each drug (target class)
differently. See Sect. 4.2 for how to create a scatterplot with the Plot node. The
scatterplot is shown in Fig. 8.305. We see that the drug Y can be separated from
all other drugs by a line. However, this line is not parallel to the axis. As we
learned in Sect. 8.8.1, a decision tree is only able to divide the data space parallel
to the axis. This can be a reason for the model to overfit, as it separates the
training data perfectly, but the horizontal and vertical found decision boundaries
are not sufficient to classify the testing data. A ratio value of Na and K might
therefore be more appropriate and can lead to more precise predictions.
5. We add a Derive node to the stream and connect it to the Type node to calculate
the ratio variable of Na and K. See Fig. 8.306 for the formula entered in the
Derive node that calculates the new ratio variable.
To descriptively validate the separation power of the new variable
“Na_K_ratio”, we add a Graphboard node to the stream and connect it with
the Derive node. In this node, we select the new variable and choose the Dot plot.
Furthermore, we select the data of the drugs to be plotted in different colors. See
Sect. 4.2 for a description of the Graphboard node. In the Dot Plot in Fig. 8.307,
8.8 Decision Trees 945
Fig. 8.308 Filter node to discard Na and K variables from the following stream
we can see that the drug Y can be perfectly separated from all other drugs by this
new ratio variable.
We now add a Filter node to the stream and discard the Na and K variable
from the following stream. See Fig. 8.308. Then we add another Type node to
the stream.
6. Then add another C&R Tree node to the stream and connect it with the last Type
node. We chose the same parameter setting as in the C&R Tree node, except for
the input variables. Here, the new variable “Na_K_Ratio” is included instead of
the Na and K variables. See Fig. 8.309.
We run the stream, and the model nugget appears. In the nugget, we can
immediately see that the new variable “Na_K_Ratio” is the most important
predictor, and it is chosen as the field of the root spilt. See Fig. 8.310. In addition,
we notice that the tree is slimmer compared to the first build tree (see Fig. 8.303),
meaning that the tree has fewer branches. See the Viewer tab for a visualization
of the build tree.
Next we add the standard Analysis node to the model nugget (see Sect. 8.2.6
for details on the Analysis node) and run it. In Fig. 8.311, the accuracy of the
decision tree with the “Na_K_Ratio” variable included is shown. In this model,
the accuracy of the testing data has noticeably improved from about 79 %
(see Fig. 8.304) to more than 98 %. The new variable thus contains a higher
separation ability which improves the prediction power and robustness of a
decision tree.
8.8 Decision Trees 947
Fig. 8.309 Tree structure and variable importance of the CART for the drug exercise
Fig. 8.310 Variable importance of the CART for the drug exercise with the new ratio variable
Na_K_Ratio
948 8 Classification Models
Fig. 8.311 Accuracy of the CART for the drug data with the new ratio variable NA_K_Ratio
The final stream for this exercise looks like Fig. 8.312.
1. The first part of the exercise is analog to the first two parts of Exercise
1 Sect. 8.6.4, so a detailed description is omitted here, and we refer to this
solution for the import, reclassify, and partition of the data. See Figs. 8.198,
8.199, and 8.200. We recall that a chess game ends in a tie about 10 % of the
time, while in 90 % of the games, white wins. See Fig. 8.199.
2. We now add a QUEST node to the stream and connect it with the Partition node.
Comparing the model node and its options with the CHAID and C&R Tree node,
we notice that the QUEST node is very similar to both of them, especially to the
C&R Tree node. The only difference to the C&R Tree node options appears in
the Advanced panel in the Build Options tab. See Fig. 8.313. In addition to the
8.8 Decision Trees 949
Fig. 8.312 Stream of the chess endgame prediction with the QUEST node exercise
overfit prevention setting, the significance level for the splitting selection
method can be set. The default here is 0.05. See the solution of Exercise 1 in
Sects. 8.8.5 and 8.8.4 for a description of the remaining options and the differ-
ence to the other tree nodes. We further cite the manual IBM (2015b) for
additional information on the QUEST node.
3. We run the QUEST node and the model nugget appears. In the model nugget, we
see no splits are executed, and the decision tree consists of only one node, the
root node. Thus, all chess games are classified as having the same ending, “white
wins”. This is also verified by the output of an Analysis node (see Sect. 8.2.6),
which we add to the model nugget. See Fig. 8.314 for the performance statistics
of the QUEST model. We see in the coincidence matrix that all data records
(chess games) are classified as white winning games. With this simple
950 8 Classification Models
Fig. 8.314 Performance statistics of the QUEST model for the chess endgame data
Fig. 8.315 Definition of a boosted decision tree modeling in the QUEST node
The final stream for this exercise looks like the stream in Fig. 8.317.
Fig. 8.317 Stream of the diabetes detection with decision trees exercise
20 % as test and validation set, respectively. See Sect. 2.7.7 for the description of
the Partition node.
2. To build the three decision tree models, we add a C&R Tree, a CHAID, and a
C5.0 node to the stream and connect each of them with the Type node. The input
and target variable are automatically detected by the three tree nodes since the
variables roles were already set in the Type node.
Here, we use the default settings provided by the SPSS Modeler to build the
models and therefore run the stream for the decision trees to be constructed. We
align the three model nuggets so the data is passed through the three trees
successively to predict whether a patient suffers from diabetes. See Fig. 8.107
for an example of the rearrangement of the model nuggets in a line.
3. Inspecting the three model nuggets and the structures of the constructed decision
trees, we first see that the CART, built by the C&R Tree, consists of only a single
split and is therefore a very simple tree. See Fig. 8.319. This division is done
based on the variable “glucose_concentration”. So the Gini coefficient partition
criteria is not able to find additional partitions of the data that will improve the
accuracy of the model.
The variable “glucose_concentration” is also the first variable according to
which the data are split in the CHAID and C5.0. The complete structures of these
trees (CHAID and C5.0) are shown in Figs. 8.320 and 8.321. These two trees are
more complex with more branches and sub-trees. When comparing the CHAID
and C5.0 trees to each other, we see that the structure and node splits of the
CHAID can be also found in the C5.0. The C5.0 however contains further splits
and thus divides the data into finer distinctions. So where the CHAID has 3 levels
beneath the root, the C5.0 has 5 tree levels.
4. At the last model nugget, we add an Analysis and an Evaluation node to
calculate and visualize prediction performance of the three models. See
Sect. 8.2.6 and Fig. 8.28 for a description of these two nodes. We run these
two nodes to view the evaluation statistics. The accuracy and Gini values are
954 8 Classification Models
shown in Fig. 8.322. We see that the accuracy, despite the training set in the
C&R Tree, is nearly the same in a dataset among all models. The number of
misclassified patients only differs by a maximal of 2 data records in a dataset
among decision trees. The Gini values paint a similar picture; the Gini coeffi-
cient of the CART model is significantly lower in all datasets than the other two
8.8 Decision Trees 955
Fig. 8.322 Accuracy and Gini values of all three decision trees for diabetes prediction
models. The CHAID and C5.0 however have nearly the same Gini coefficients,
with the C5.0 model slightly ahead in the training and validation set. This
indicates that a more complex tree might increase prediction ability.
In Fig. 8.323, the Gini values are visualized by the ROC for the three models
and sets. This graph confirms our conjecture that CHAID and C5.0 are pretty
similar in prediction performance (the ROCs of these modes have nearly the
same shape), while the CART curve is located far beneath the other two curves,
indicating a less ideal fit to the data.
However, none of the three curves is consistently located above the other two.
So, the three models perform differently in some regions of the data space. Even
the CART outperforms the other two models in some cases. See the curves of the
test set in Fig. 8.323. Thus an ensemble mode of the three trees might improve
the quality of the prediction.
956 8 Classification Models
Fig. 8.323 ROC of all three decision trees for diabetes prediction
5. To compress the three trees into an ensemble model, we add an Ensemble node
to the stream and connect it with the last model nugget. In the Ensemble node,
we set “class_variable” as target for the ensemble and choose “Voting” as the
aggregation method. See Fig. 8.324 for the setting of these options, and Table 8.7
for a list of the aggregation methods available in the Ensemble node.
We add another Analysis node to the stream and connect it with the Ensemble
node. Then, we run the Analysis node. The output statistics are presented in
8.8 Decision Trees 957
Fig. 8.325 Analysis node output statistics for the Ensemble node
Fig. 8.325. We note that the accuracy has not changed much compared to the
other models, but the Gini has increased for the testing and validation set. This
indicates that the ensemble model balances the errors of each individual model
and is thus more precise in the prediction of unknown data.
The final stream for this exercise looks like the stream in Fig. 8.326.
1. First we import the dataset with the Var. File node and connect it with a Type
node. In the Type node, we set the measurement of the variable “income” to Flag
and the role to target. Then we add a Partition node, open it, and define 70 % of
the data as training and 30 % as test set. See Sect. 2.7.7 for a detailed description
of the Partition node.
2. We now add two C5.0 nodes to the stream and connect them with the Partition
node. Since the roles of the variables are already defined in the Type node, the
C5.0 nodes automatically identify the target and input variables. We use the
default model settings provided by the SPSS Modeler and so just have to change
the output type of one node to “Rule Set”. See Fig. 8.327.
Now, we run the stream and the two model nuggets appear. We then rearrange
them in a line, so the models are applied successively to the data. As a result, the
models can be more easily compared to each other. See Fig. 8.107 for an
example of the rearrangement of model nuggets.
958 8 Classification Models
Fig. 8.327 Definition of the rule set output type in the C5.0 node
The final constructed decision tree is very complex and large with a depth of
23, meaning the rule set contains a large number of sole rules. The rule set as
well as the decision tree are too large and complex to describe here and for that
reason have been omitted.
3. To compare the two models to each other, we first add a Filter node to the stream
and connect it with the last model nugget. This node is added simply to rename
8.8 Decision Trees 959
Fig. 8.328 Analysis node output statistics for the Ensemble node
the prediction fields, which are then more easily distinguishable in the Analysis
node. See Sect. 2.7.5 for the description of the Filter node.
Afterwards, we add an Analysis and Evaluation node to the stream and
connect it with the Filter node. See Sect. 8.2.6 and Fig. 8.28 for a description
of the Analysis and Evaluation node options. We then run these two nodes. The
Analysis output with the accuracy and Gini of the two models is shown in
Fig. 8.328. We see that the accuracy of the decision tree C5.0 and rule set
C5.0 model are similar with about 12 % error rate in the training and 14 % error
rate in the test set. However, the decision tree model has a slightly better
performance, as additionally confirmed by the Gini values, which are a bit
higher for both datasets in this case. This indicates that the finding processes
with the rule set or a decision tree are close to each other but have minor
differences. This is evident in the evaluation statistics.
The ROCs of the two models are displayed in Fig. 8.329. As can be seen, the
curve of the decision tree model lies slightly above the curve of the C5.0 rule set
model. Hence, the C5.0 decision tree provides a better prediction power than the
rule set model.
960 8 Classification Models
Fig. 8.329 ROC of the two C5.0 models (decision tree and rule set)
As for the regression and clustering methods (Chaps. 5 and 7), the SPSS Modeler
also provides a node, the Auto Classifier node, which comprises several different
classification methods and can thus build various classifiers in a single step. This
Auto Classifier node provides us with the option of trying out and comparing a
variety of different classification approaches without adding the particular node and
setting the parameter of the algorithm of each model individually, which can
involve a very complex stream with many different nodes. Finding the optimal
parameter of a method, e.g., the best kernel and its parameter of an SVM or the
number of neurons in a Neural Network in particular, can be extremely cumber-
some and is thus reduced to a very clear process in a single node. Furthermore, the
utilization of the Auto Classifier node is an easy way to consolidate several different
classifiers into an Ensemble model. See Sects. 5.3.6 and 8.8.1 and the references
given there for a description of Ensemble techniques and modeling. All models
built with the Auto Classifier node are automatically evaluated and ranked
according to a predefined measure. So the best performing models can be easily
identified and added to the ensemble.
Besides the classification methods and nodes introduced in this chapter, the Auto
Classifier node also comprises the Bayesian Network and Decision List techniques.
We cite Ben-Gal (2008) and Rivest (1987), respectively, for a description of these
classification algorithms and IBM (2015b) for a detailed introduction of their nodes
and options in the SPSS Modeler. See Fig. 8.330 for a list of all nodes included in
the Auto numeric node.
8.9 The Auto Classifier Node 961
Fig. 8.330 Nodes included within the Auto Classifier node. The darker circles are the nodes for
classification models which are described in this chapter. The lighter circles are additional
classification nodes of other models within the Auto Classifier node
Before turning to the description of the Auto Classifier node and how to apply it
to a dataset, we would like to point out that building a huge number of models is
very time consuming. That’s why one must pay attention to the number of different
parameter settings chosen in the Auto Classifier node since a large number of
different parameter values leads to a huge number of models. The building process
may take a very long time to calculate in this case, sometimes hours.
Below, you will learn how to use the Auto Classifier node effectively to build
different classifiers of the same problem in a single step and identify the optimal
models for our data and mining task. A further advantage of this node is its ability to
unite the best classifiers into an ensemble model to combine the strength of different
classification approaches and counteract the weaknesses. Furthermore, cross-
validation to find the optimal parameters of a model can be easily carried out within
the same stream. We introduce the Auto Classifier node by applying it to the
962 8 Classification Models
Wisconsin breast cancer data, to build classifiers that are able to determine benign
from malignant cancer samples.
Fig. 8.332 Definition of target and input variables and the partition field
a previous Type node, these are automatically recognized by the Auto Classi-
fier node if the “Use type node setting” option is chosen in the Fields tab. For
description purposes, we set the variable roles here manually. So we select the
“Use custom settings” option and define “class” as target, “Partition” as
partitioning identification variable, and all remaining variables, except for
“Sample Code”, as input variables. This is shown in Fig. 8.332.
4. In the Model tab, we enable the “Use partitioned of data” option. See the top
arrow in Fig. 8.333. This option will lead the Model to be built based on the
training data alone.
In the “Rank models by” selection field, we can choose the score that
validates the models and compares them to each other. Possible measures are
listed in Table 8.10. Some of the rank measures are only available for a binary
(Flag) target variable. Here, we chose the “Area under the curve” (AUC) rank
measure.
964 8 Classification Models
With the “rank” selection, we can choose whether the models should be
ranked by the training or the test partition, and how many models should be
included in the final ensemble. Here, we elect that the ensemble should have
3 models and is ranked by the calculations on the test set. See the bottom arrow
in Fig. 8.333. At the bottom of the Model tab, we can further set the revenue
and cost values used to calculate the profit. Furthermore, a weight can be
specified to adjust the results. In addition, the percentile considered for the
Lift measure calculations can be set (see Table 8.10 and IBM (2015b)). The
default here is 30.
In the Model tab, we can also choose to calculate the predictor importance,
and we recommend enabling this option each time.
5. The next tab is the “Expert” tab. Here, the classification models, which should
be calculated and compared with each other can be specified. See Fig. 8.334.
We can include a classification method by checking its box on the left. All
Fig. 8.333 Model tab with the criteria that models should be included in the ensemble
8.9 The Auto Classifier Node 965
Fig. 8.335 Parameter setting of the Neural Net node in the Auto Classifier node
models marked in this way are built on the training set, compared to each other,
and the best ranked are selected and added in the final ensemble model.
We can further specify multiple settings for one model type, in order to
include more model variations and to find the best model of one type. Here, we
also want to consider a boosted Neural Network in addition to the standard
approach. How to set the parameter to include this additional Neural Network
in the Auto Classifier node building process is described below.
To include more models of the same type in the building process, we click
on the “Model parameters” field next to the particular model, Neural Net in
this example, and choose the option “Specify” in the opening selection bar
(Fig. 8.334). A window pops up which comprises all options of the particular
node, the Neural Net node in this example. See Fig. 8.335.
In this window, we can specify the parameter combinations which should be
considered in separate models. Each parameter or option can thereby be
assigned multiple values, and the Auto Classifier node then builds a model
for every possible combination of these parameters.
For our case, we also want to consider a boosted neural network. So we click
on the “Options” field next to the “Objectives” parameter and select the
“Specify” option in the drop-down menu. In the pop-up window, we select
the boosting and standard model options. This is shown in Fig. 8.336. Then we
click the OK button. This will enable the Auto Classifier node to build a neural
network with and without boosting. The selected options are shown in the
“Option” field next to the particular “Parameter” field in the main settings
window of the Neural Net node. See Fig. 8.335.
6. Then we go to the “Exert” tab to specify the aggregation method for the
boosting method in the same manner as for the model objective, i.e., boosting
and standard modeling procedure. We chose two different methods here, the
“Voting” and “Highest mean probability” technique. So, a neural network is
constructed for each of these two aggregation methods (Fig. 8.337).
8.9 The Auto Classifier Node 967
Fig. 8.336 Specification of the modeling objective type for a neural network
" The Auto Classifier node takes all specified options and parameters of
a particular node and builds a model for each of the combinations.
For example, in the Neural Net node the modeling objective is chosen
as “standard” and “boosting”, and the aggregation methods “Voting”
and “Highest mean probability” are selected. Although the aggrega-
tion methods are only relevant for a boosting model, 4 different
models are created by the Auto Classifier node:
Fig. 8.337 Specification of the aggregation methods for the boosting model in the Neural
Net algorithm settings
8. Rules that a model has to fulfill to be considered as a candidate for the ensemble
can be specified in the Discard tab of the Auto Classifier node. If a model fails
to satisfy one of these criteria, it is automatically discarded from the subsequent
process of ranking and comparison.
8.9 The Auto Classifier Node 969
Fig. 8.338 Definition of the discard criteria in the Auto Classifier node
The Discard tab and its options is shown in Fig. 8.338. The discarding
criteria comprise the ranking criteria, i.e., Overall accuracy, Number of fields,
Area under the curve, Lift, and Profit. In our example case of the Wisconsin
breast cancer data, we discard all models that have a lower than 80 % accuracy,
so that the final model has a minimum hit rate. See Fig. 8.338.
9. In the Settings tab, the aggregation method can be selected: this combines all
component models in the ensemble model generated by the Auto Classifier
node to a final prediction. See Fig. 8.339. The most important aggregation
methods are listed in Table 8.7. Besides these methods, the Auto Classifier
node provides weighted voting methods like “Confidence-weighted voting”
and “Raw propensity-weighted voting”. See IBM (2015b) for details on these
methods. We select the “Confidence-weighted voting” here. The ensemble
method can also be later changed in the model nugget. See Fig. 8.343.
10. When we have set all the model parameters and Auto Classifier options of our
choice, we run the model, and the model nugget appears in the stream. For each
possible combination of selected model parameter options, the Modeler now
generates a classifier, all of which are compared to each other and then ranked
according to the specified criteria. If a model is ranked high enough under the
top three models here, it is included in the ensemble. The description of the
model nugget can be found in Sect. 8.9.2.
11. We add an Analysis node to the model nugget to calculate the evaluation
statistics, i.e., accuracy and Gini. See Sect. 8.2.6 for the description of the
Analysis node and its options.
Figure 8.340 shows the output of the Analysis node. We see that the accuracy
in both training and test set are pretty high at about 97 %. Furthermore, the Gini
970 8 Classification Models
Fig. 8.339 Definition of the aggregation method for the ensemble model
Fig. 8.340 Analysis output with evaluation statistics from both the training and the test data
8.9 The Auto Classifier Node 971
values are nearly 1, which indicates an excellent prediction ability with the
inserted variables.
In this short section, we will take a closer look at the model nugget generated by the
Auto Classifier node and the graphs and options it provides.
Fig. 8.341 Model tab of the Auto Classifier model nugget. Specification of the models in the
ensemble used to predict target class
972 8 Classification Models
Fig. 8.342 Graph tab of the Auto Classifier model nugget. Predictor importance and bar plot that
shows the accuracy of the ensemble model prediction
In the furthest left column labeled “Use?”, we can choose which of the models
should contribute to the ensemble model. More precisely, each of the enabled
models takes the input data and estimates the target value individually. Then, all
outputs are aggregated according to the specified method in the Auto Classifier
node to one single output. This process of aggregating can prevent overfitting and
minimize the impact of outliers, which will lead to more reliable predictions.
Left of the Models, the distribution of the target variable and the predicted
outcome is shown for each model individually. Each graph can be viewed in a
larger, separate window by double-clicking on it.
Fig. 8.343 Settings tab of the Auto Classifier model nugget. Specification of the ensemble
method
8.9.3 Exercises
Exercise 1: Finding the best models for credit rating with the Auto
Classifier node
The “tree_credit” dataset (see Sect. 10.1.33) comprises demographic and loan data
history of bank customers as well as a prognosis for giving a credit (“good” or
“bad”). Determine the best classifiers to predict the credit rating of a bank customer
with the Auto Classifier node. Use the AUC measure to rank the models. What is the
best model node and its AUC value, as suggested by the Auto Classifier procedure?
Combine the top five models to create an ensemble model. What is its accuracy and
AUC?
types in a new variable value which only indicates that the patient has cancer. Use
the Auto Classifier node to determine the best kernel function to be considered in
the SVM. What are the AUC values and which kernel function should be used in the
final SVM classifier?
8.9.4 Solutions
Exercise 1: Finding the best models for credit rating with the Auto
Classifier node
Name of the solution streams tree_credit_auto_classfier_node
Theory discussed in section Sect. 8.9.1
The final stream for this exercise looks like the stream in Fig. 8.344.
Fig. 8.344 Stream of the credit rating prediction with the Auto Classifier node exercise
2. We add a Partition node to the stream and place it between the Source and Type
node. In the Partition node, we declare 70 % of the data as training and the
remaining 30 % as test data. We then open the Type node and define the
measurement type of the variable “Credit rating” as Flag and its role as target.
3. Now we add an Auto Classifier node to the stream and connect it with the Type
node. The variable roles are automatically identified. This means that nothing
has to be changed in the Field tab settings.
4. In the Model tab, we select the AUC as rank criteria and set the number of
models to use to 5 since the final ensemble model should comprise 5 different
classifiers. See Fig. 8.346.
5. In the Expert tab, we add the SVM to the models that should be considered in the
building and ranking process by checking the box next to the SVM model type.
See Fig. 8.347.
Fig. 8.346 Definition of the ranking criteria and number of used models
976 8 Classification Models
Fig. 8.347 Selection of the models to be considered in the building and ranking process. Adding
of SVM to this list
Fig. 8.349 Top five classifiers to predict the credit rating of a bank customer built by the Auto
Classifier node
978 8 Classification Models
Fig. 8.350 Analysis node with performance statistics of the ensemble model that classifies
customers according to their credit rating
9. To evaluate the performance of the ensemble model that comprises these five
models, we add an Analysis node to the stream and connect it with the model
nugget. We refer to Sect. 8.2.6 for information on the Analysis node options.
Figure 8.350 presents the accuracy and AUC of the ensemble model. As all the
individual components, the accuracy of the ensemble model is a little above
80 % for training and testing set. The AUC for the test set, also at 0.887, is in the
same range as the best ranked model, i.e., the logistic regression.
The stream displayed in Fig. 8.351 is the complete solution of this exercise.
8.9 The Auto Classifier Node 979
Fig. 8.351 Complete stream of the best kernel finding procedure for the SVM leukemia detection
classifier
Fig. 8.352 Sub stream of data preparation of the solution stream of this exercise
1. The first part of this exercise is the same as in Exercise 1 in Sect. 8.5.4. We
therefore omit a detailed description of the importation, partitioning, and reclas-
sification of the data into healthy and leukemia patients and begin referring to the
first steps of the solution of the above-mentioned exercise. After following the
steps of this solution, the stream should look like that in Fig. 8.352. This is our
new starting point.
2. We add an Auto Classifier node to the stream and connect it with the last Type
node. In the Auto Classifier node, we select the “Area under the curve” rank
criteria in the Model tab and set the number of used model to 4, as four kernel
functions are provided by the SPSS Modeler. See Fig. 8.353.
3. In the Expert tab, we check the box next to the SVM model type and uncheck the
boxes of all other model types. See Fig. 8.354. We then click on the model
parameter field of the SVM and select “Specify”. See arrow in Fig. 8.354.
4. The parameter option window of the SVM node opens, and we go to the Expert
tab. There, we change the Mode parameter to “Expert” for all other options to be
changeable. Afterwards, we click on the “Options” field of the Kernel type
parameter and click on “Specify”. In the pop-up window which appears, we
can select the kernel methods that should be considered in the building process
980 8 Classification Models
Fig. 8.353 Selection of the rank criteria and number of considered models
of the Auto Classifier node. As we want to identify the best among all kernel
functions, we check all the boxes: the RBF, Polynomial, Sigmoid, and Linear
kernel. See Fig. 8.355. We now click the OK buttons until we are back at the
Auto Classifier node. For each of these kernels, an SVM is constructed, which
means four in total. This is displayed in the Expert tab, see Fig. 8.354.
5. As the target variable and input variables are already specified in a Type node,
the Auto Classifier node identifies them and we can run the stream without
additional specifications.
6. We open the appeared model nugget to inspect the evaluation and the ranking of
the four SVMs with different kernels in the Model tab (Fig. 8.356). We see that
the SVM named “SVM 2” has the highest AUC value, which is 0.953. This
model is the SVM with a polynomic kernel. The values of the models “SVM 1”
8.9 The Auto Classifier Node 981
Fig. 8.354 Selection of the SVM model in the Auto Classifier node
(RBF kernel) and “SVM 4” (linear kernel) at 0.942 and 0.925, respectively, are
not far away from the one of the polynomic kernel SVM. The AUC value of the
last SVM (sigmoid kernel) however is quite lower at 0.66. Thus, the prediction
quality of this last model is not as good as the other three. By looking at the bar
plot of each model, we see that the “SVM 4” model classifies all patients as
leukemia patients, whereas the other three model are able to recognize healthy
patients. This explains the phenomena of the much lower AUC of “SVM 4”.
7. To recap, the SVM with a polynomial kernel function has the best performance
in detecting leukemia from gene expression data, and the Modeler suggests
using this kernel in a SVM model for this problem. However, the RBF
and Linear kernel models are nearly as good and are thus also appropriate
choices.
982 8 Classification Models
Fig. 8.355 Specification of the kernel functions considered during the model building process of
the Auto Classifier node
Literature
Allison, P. D. (2014). Measures of fit for logistic regression. Accessed 19/09/2015, from http://
support.sas.com/resources/papers/proceedings14/1485-2014.pdf
Azzalini, A., & Scarpa, B. (2012). Data analysis and data mining: An introduction. Oxford:
Oxford University Press.
Ben-Gal, I. (2008). Bayesian Networks. In F. Ruggeri, R. S. Kenett, & F. W. Faltin (Eds.),
Encyclopedia of statistics in quality and reliability. Chichester, UK: Wiley.
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “Nearest Neighbor”
meaningful? In G. Goos, J. Hartmanis, J. van Leeuwen, C. Beeri, & P. Buneman (Eds.),
Database Theory—ICDT’99, Lecture notes in computer science (Vol. 1540, pp. 217–235).
Berlin: Springer.
Biggs, D., de Ville, B., & Suen, E. (1991). A method of choosing multiway partitions for
classification and decision trees. Journal of Applied Statistics, 18(1), 49–62.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression
trees. Boca Raton, FL: CRC Press.
Cheng, B., & Titterington, D. M. (1994). Neural Networks: A review from a statistical perspective.
Statistical Science, 9(1), 2–30.
Cormen, T. H. (2009). Introduction to algorithms. Cambridge: MIT Press.
Esposito, F., Malerba, D., Semeraro, G., & Kay, J. (1997). A comparative analysis of methods for
pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19
(5), 476–493.
Fahrmeir, L. (2013). Regression: Models, methods and applications. Berlin: Springer.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, 7(2), 179–188.
He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowl-
edge and Data Engineering, 21(9), 1263–1284.
Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques, The Morgan
Kaufmann series in data management systems (3rd ed.). Waltham, MA: Morgan Kaufmann.
IBM. (2015a). SPSS Modeler 17 Algorithms Guide. Accessed 18/09/2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/AlgorithmsGuide.pdf
IBM. (2015b). SPSS Modeler 17 Modeling Nodes. Accessed 18/09/2015, from ftp://public.dhe.ibm.
com/software/analytics/spss/documentation/modeler/17.0/en/ModelerModelingNodes.pdf
IBM. (2015c). SPSS Modeler 17 Source, Process, and Output Nodes. Accessed 19/03/2015, from
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerSPOn
odes.pdf
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol. 103). New York: Springer.
Kanji, G. K. (2009). 100 statistical tests (3rd ed.). London: Sage (reprinted).
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data.
Applied Statistics, 29(2), 119.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.
Lantz, B. (2013). Machine learning with R: Learn how to use R to apply powerful machine
learning methods and gain an insight into real-world applications, Open source. Community
experience distilled.
Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica,
7(4), 815–840.
Machine Learning Repository. (1998). Optical recognition of handwritten digits. Accessed 2015,
from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
984 8 Classification Models
Niedermeyer, E., Schomer, D. L., & Lopes da Silva, F. H. (2011). Niedermeyer’s electroencepha-
lography: Basic principles, clinical applications, and related fields (6th ed.). Philadelphia:
Wolters Kluwer/Lippincott Williams & Wilkins Health.
Oh, S.-H., Lee, Y.-R., & Kim, H.-N. (2014). A novel EEG feature extraction method using Hjorth
parameter. International Journal of Electronics and Electrical Engineering, 2(2), 106–110.
Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4, 1883.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Quinlan, J. R. (1993). C4.5: Programs for machine learning, The Morgan Kaufmann series in
machine learning. San Mateo, CA: Morgan Kaufmann.
R Core Team. (2014). R: A Language and Environment for Statistical Computing. http://www.R-
project.org/
Rivest, R. (1987). Learning decision lists. Machine Learning, 2(3), 229–246.
RStudio Team. (2015). RStudio: Integrated Development Environment for R. http://www.rstudio.
com/
Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis.
Wiesbaden: Springer Vieweg.
olkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regulari-
Sch€
zation, optimization, and beyond, Adaptive computation and machine learning. Cambridge,
MA: MIT Press.
Tuffery, S. (2011). Data mining and statistics for decision making, Wiley series in computational
statistics. Chichester: Wiley.
Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31, 218–220.
Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for
medical diagnosis applied to breast cytology. Proceedings of the National Academy of
Sciences, 87(23), 9193–9196.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A.,
Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top
10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.
Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms (Chapman & Hall/CRC
machine learning & pattern recognition series). Boca Raton, FL: Taylor & Francis.
Using R with the Modeler
9
1. Explain how to connect the IBM SPSS Modeler with R and why this can be
helpful,
2. Describe the advantages of implementing R features in a Modeler stream,
3. Extending a stream by using the correct node to incorporate R procedures in a
stream as well as
4. Use R features to determine the best transformation of a variable towards
normality.
The SPSS Modeler provides a wide range of algorithms, procedures, and options to
build statistical models. The Modeler provides options for creating models and
preparing data in most instances appropriately which is easy to understand and also
intuitive. Why does IBM offer the user the option of implementing R functionalities
in the SPSS Modeler graphical environment? There are several answers to this
question:
1. It allows users with R knowledge to switch to the usage of the SPSS Modeler and
present the functionalities in a more structured way the analysis process.
2. The user can sometimes modify data more easily by using the R language.
3. The variety of advanced and flexible graphical functions provided by R can be
used to visualize the data in more significant plots.
IBM SPSS
Modeler
Essentials for R
modelerData
users dataset R language,
R nodes offered by packages/libraries
‘test_scores.sav’
the modeler to extend
modelerDataModel
functionalities
4. As with any other software, the SPSS Modeler and R offer different options to
analyze data and to create models. Sometimes it may also be helpful to assess the
fit of a model by using R.
5. R can be extended by using a wide range of libraries which researchers all over
the world have implemented. In this way, R is constantly updated, easier to
modify, and better at coping with specific modeling challenges.
6. Embedding R in the SPSS Modeler has the overall benefit of combining two
powerful analytics software, so the strengths of each can be used in a analysis.
Each statistical software package has its advantages, and the option to use R
functionalities within the IBM SPSS Modeler as well gives the user the chance to
lock on the same data from different angels and to use the best method provided by
both packages.
The aim of this chapter is to explain the most important steps in how to use R
functionalities and to implement the correct code in the SPSS Modeler. We will
have a look at how to install R and the IBM SPSS R Essentials. Furthermore, we
will discuss the R nodes of the Modeler that uses the R functionalities and present
the results to user. Figure 9.1 depicts the interaction of both programs by accessing
the same dataset “test_scores”.
The authors do not intend to explain the details of the R language here because
there are an overwhelming number of different options and functionalities that are
beyond the scope of this book.
In order to use R with the Modeler, we have to install the IBM SPSS Modeler
Essentials for R. This is the Modeler toolbox to connect with R as shown in Fig. 9.1.
It manages to link the data of the Modeler and of R so that both applications have
access to the data and can exchange data.
9.2 Connecting with R 987
Here, we will present the steps to set up the IBM SPSS R Essentials. Addition-
ally we want to use a stream to test the connection with the R engine. A detailed
description of the installation procedure can also be found in IBM (2014a).
Assumptions
1. The R Essentials and therefore R can only be used with the Professional or
Premium Version of the Modeler.
2. The R Version 3.0.1 must be installed on the computer, and the folder of this
installation must be known, e.g., “C:\Program Files\R\R-3.1.0”.
DOWNLOAD: http://cran.r-project.org/bin/windows/base/old/3.1.0/
3. The folder “C:\Program Files\IBM\SPSS\Modeler\17\ext\bin\pasw.rstats” must
exist.
" In order to use R with the Modeler the “IBM SPSS Modeler—Essentials
for R” must be installed. The reader should distinguish between “IBM
SPSS Statistics—Essentials for R” and “IBM SPSS Modeler—Essentials
for R”. The tool last mentioned must be used. Furthermore, it is
essential to start the setup program as administrator! Details can be
found in the following detailed description and in IBM (2014a).
1. Download the “IBM SPSS Modeler—Essentials for R”. See IBM Website
(2015). The version of the Modeler and the version for the Essentials are
corresponding. So if the Modeler Version 17 is being installed, then Essentials
version 17 should be used too.
Depending on the operating system and the Modeler version, we must make
sure to use the correct 32 or 64 bit version. The correct name for the 64 bit
Microsoft Windows version is “SPSS_Modeler_REssentials_17.0_Win64”.
2. We must make sure not to start the install program after using the IBM download
program. Instead we strongly recommend making a note of the folder where the
download is saved and terminate the original download procedure after the file is
being saved.
Then we navigate to the folder with the setup program “SPSS_Modeler_REs-
sentials_17.0_Win64.exe”. We have to start the setup as administrator. To do so,
we click the file with the right mouse button and then choose the option “Run as
Administrator”.
3. After unzipping the files, the setup program comes up and requests you to choose
the correct language (Fig. 9.2).
4. We read the introduction and accept the license agreement.
5. Then we make sure to define the correct R folder (Fig. 9.3).
988 9 Using R with the Modeler
Fig. 9.2 Setup process “IBM SPSS Modeler—Essentials for R” initial step
Fig. 9.3 Setup process “IBM SPSS Modeler—Essentials for R”—define the R folder
6. As suggested by the setup program, we have to determine the path to the “pasw.
rstats” extension. In the previous steps, we verified that this folder exists.
Figure 9.4 shows an example. The user may find that the offered standard folder
in this dialog window is not correct and must be modified.
7. We carefully check the summary of the settings as shown in Fig. 9.5.
8. At the end, the setup program tells us that the procedure was successfully
completed (Fig. 9.6).
9.2 Connecting with R 989
Fig. 9.4 Setup process “IBM SPSS Modeler—Essentials for R”—define the pasw.rstats folder
Fig. 9.5 Setup process “IBM SPSS Modeler—Essentials for R”—Pre-Installation summary
990 9 Using R with the Modeler
Fig. 9.6 Setup process “IBM SPSS Modeler—Essentials for R”—Installation summary
To test that the R essentials were installed successfully and the R Engine can also
be used from the Modeler, we suggest taking the following steps.
9.3 Test the SPSS Modeler Connection to R 991
1. We open the stream “R_Connect_Test.str”. In Fig. 9.7, we can see that there is a
User input node as well as an R Transform node.
2. In the User Input node, a variable called “Test_Variable” is defined and the value
of the variable is just 1. We used this very simple node because we do not have to
connect the stream to a complicated data source. The link to the data may be
missing and the stream may be harder to use.
3. If we click on the left Table node by right (!) clicking the mouse, we can use
“Run” to see what variables are defined so far (Fig. 9.8).
4. As expected, there is one variable and one record. The value of the variable
“Test_Variable” is 1.
5. We can close the window with “OK”.
6. To finish the procedure, we click on the right Table node (see Fig. 9.9) once
more with a right (!) click of the mouse button and we use “Run” to start the
calculation procedure.
992 9 Using R with the Modeler
7. The table is being modified. It is simply the “Test_Variable + 1”. If we can see
the new value as shown in Fig. 9.10, then the Modeler is successfully connected
with R.
" The R Transform node enables R to grab the SPSS Modeler data and to
modify the values stored in an object called “modelerData” using a
script.
" Not all operations are possible to use in a Modeler node for R, e.g.,
sorting or aggregation. For further details, see also IBM (2014b).
We have not yet covered the usage of the R language in the Transform node
itself. If we double click the R Transform node in Fig. 9.9, we can find the first short
R script as shown in Fig. 9.11.
9.3 Test the SPSS Modeler Connection to R 993
The SPSS Modeler Essentials link the two objects “modelerData” and
“modelerDataModel” from the Modeler to R and back. As shown in Fig. 9.1, the
modeler copies the information to “modelerData”. The R script modifies the value
(s). And in the Modeler, we can see the new values in the stream by looking at the
Table node. All other variables that will be defined in R will not be recognized in
the Modeler. See IBM (2015), p. 4.
The object “modelerDataModel” contains the structure of the object
“modelerData”. See also IBM (2015), p. 10–11. Because the structure of
“modelerData” is not being modified here, the object “modelerDataModel” does
not have to be addressed in the script.
As long as we only modify the variable previously defined and we do not add any
column or changes to the name of a column, we do not have to add more commands
to the script. In the next example, we will show how to deal with modified data
frames by R and how to make sure to get the correct values by showing the results in
the Modeler.
994 9 Using R with the Modeler
Related exercises: 1
By using a new stream, we will now look at the data transport mechanism from
the Modeler to R and back. We will describe the analysis of the stream step-by-step.
1. We show the predefined variables and their values in the dataset “salary_simple.
sav” by double-clicking on the Table node on the left. Then we click “Run”. We
find 5000 values in the column “salary”. See Fig. 9.12. We can close the window
of the Table node now.
9.4 Calculating New Variables in R 995
2. In the Type node, we can see that the variable “salary” is defined as continuous
and metrical and the role as “Input”. This shows also Fig. 9.13.
3. Now we double-click on the R Transform node. In the dialog window shown in
Fig. 9.14, we can see the R script that transforms the data. The commands used
here are explained in Table 9.1.
The tab “Console Output” in the R Transform node shows the internal
calculation results of R. After running the Transform node, the user can find
996 9 Using R with the Modeler
the error message below the last line shown in Fig. 9.15 that helps to identify the
R command to be modified.
4. We can see the values of the three variables “salary”, “bonus”, and “new_salary”
in the Table node at the end of the stream (see Fig. 9.16).
" In the R Transform node, the dialog window in the tab “Console
Output” helps the user to identify R commands that are not correct
and must be modified.
9.4
Related exercises: 2, 3, 4
With this equation, we are now able to predict the outcome of the final exam if we
know the student’s pretest result.
We want to demonstrate here how to use R to get the same result so that we can
learn how to deal with R in the SPSS Modeler and can verify at the same time
whether the determined parameters are equal. We do not want to build the stream
step by step here because to define the necessary R scripts manually would be just a
9.5 Model Building in R 1001
Fig. 9.18 Determined parameter of the linear regression function using the Linear node
copy and paste job. It is more convenient to analyze the nodes used in the final
stream and to understand how it works. Base on this knowledge, we can define other
stream and R scripts correctly too.
Table 9.3 R Script in R model building syntax field of the R Building node in the stream
“R_simple_linear_regression.str”
Command Explanation
1 # correlation of the model variables
cor(modelerData$pretest,modelerData$posttest)
The correlation between the pretest and the posttest score of the students will be
calculated. The result will be shown in the Text Output tab of the R model nugget.
See Fig. 9.23.
2 # fit the model
modelerModel<-lm(posttest~pretest,data¼modelerData)
The parameters of the linear regression model are determined. The model is
assigned to the variable “modelerModel”.
3 # show the results
summary(modelerModel)
A detailed statistic will be shown in the Text Output tab of the R model nugget. See
Fig. 9.23.
4 # diagnostic plots
plot(modelerModel)
Several plots such as the residual plot will be shown in the Graph Output tab of the
R model nugget. See Fig. 9.24.
Table 9.4 R Script in R model scoring syntax field of the R Building node in the stream
“R_simple_linear_regression.str”
Command Explanation
1 # calculating forecast
result<-predict(modelerModel,newdata¼modelerData)
The final scores of the students are determined based on the pretest results by using
the fitted linear regression model. The values are saved using the variable “result”.
2 # attach results
modelerData<-cbind(modelerData,result)
The data frame “modelerData” links the original values of the “test_scores.sav”
dataset from the SPSS Modeler to R. See Sect. 9.4. Here, this data frame is being
extended by the addition of a new column “result” with the predicted values of the
final test results.
3 # define characteristic of new variable
var1<-c(fieldName¼“posttest_prediction”,fieldLabel¼“”,fieldStorage¼“real”,
fieldMeasure¼“”,fieldFormat¼“”,fieldRole¼“”)
The details of the new column “result” are defined. Here, the name is
“posttest_prediction”.
4 # return to the Modeler and import the new data frame
modelerDataModel<-data.frame(modelerDataModel,var1)
The object “modelerDataModel” contains the original data and the new column
“posttest_prediction”. See also IBM (2015), p. 10. The variable or column name
and their scale type are defined by using the information in the previously defined
dummy variable “var1”.
used to predict the final test scores of the students based on the parameter saved
in the variable “modelerModel”. The original dataset is being extended by the
predicted values in the column “posttest_prediction”. Table 9.4 shows the
detailed description of the commands.
1004 9 Using R with the Modeler
5. In the Model Options tab shown in Fig. 9.22, several options of the variable and
output handling can be determined. The options are self-explanatory. A detailed
description can be found in IBM (2015), p. 3.
We can run the scripts of the R Building node and an R model nugget will be
shown in the middle of the stream. See Fig. 9.19. We open the R model nugget
by double-clicking it. Figure 9.23 shows the Text Output tab of the nugget. Here,
all the results that the statistics package R would print out to its console are
shown.
We can see the correlation coefficient of pretest vs final test result. With its
value of +0.9508843, it is a very strong positive linear relationship between both
variables.
Furthermore, we can find the details of the determined linear regression
model printed by the R command “summary(modelerModel)” explained in
row 3 of Table 9.3. The determined model parameters are equal to the results of
9.5 Model Building in R 1005
the SPSS Modeler results previously determined by using a Linear node. See
Fig. 9.18.
6. By activating the Graph Output tab of the R model nugget, we can find several
residual plots as shown in Fig. 9.24. We can close the R model nugget node by
clicking OK.
7. The predicted test scores are shown in the Table node at the end of the stream.
See Fig. 9.25.
8. Beside all the nodes mentioned so far, we can find the R Output node at the
bottom of the stream in Fig. 9.17. Here, the same commands as explained in
Table 9.3 are used to determine the parameter of the linear regression model and
print the results (see Fig. 9.26). The residual plots are also created. Running this
node, we get the same results in the Text Output tab and in the Graph Output tab
as in the R Building node. See Fig. 9.27.
By fitting the linear regression model to predict the final test scores, we
analyzed the R Building node, the R model nugget, as well as the R Output
node. We found that the statistical parameter determined by R was equal to
those determined by the Modeler itself. The statistics produced by R are detailed
and the residual plots are also useful. So R can help the user to assess a model in
more detail.
1006 9 Using R with the Modeler
" The R Output node can be used to calculate several statistics or to fit
models. The output is similar to the text and graphics output of the R
Building node, but calculated results are not linked back to the
modeler.
9.5 Model Building in R 1007
9.6 Exercises
In Sect. 2.7.2, we introduced the Derive node of the SPSS Modeler to do more or
less simple calculations. We created the stream “simple_calculations.str”. It is
based on the dataset “IT_user_satisfaction.sav”. Here, respondents should assess
the quality of an IT-system. They stated the number of training days they had last
year (variable “training_days_actual”) and the number of days they would like to
add (variable “training_days_to_add”) to improve their skills in using the IT
resources. By adding both variables to the stream, a variable “training_expec-
ted_total_1” was derived as the number of the days the user expected overall.
The same is done using another method for the variable “training_expec-
ted_total_2”. In this exercise, the stream should be extended by using an R node
to get the same results but calculated in R.
2. Save the stream using a different name. The final solution stream is named
“R_simple_calculations.str”.
3. Add an appropriate R node mentioned in Table 9.2 and connect it with the
Type node.
4. Calculate values of a new variable “training_expected_total_3”.
5. Show the results in a Table node.
1. Open the stream “pca_nutrition_habits.str”. The stream shown in Fig. 9.28 was
created and explained to perform a PCA in Sect. 6.3.2. We want to use this stream
as a starting point to determine the correlation matrix of the input variables.
2. Save the stream using another name. The final stream here is “R_correlation_
nutrition_habits.str”.
3. Now remove all nodes in the rectangle in Fig. 9.28. They are used to perform the
PCA which we do not need here.
4. The Pearson’s correlations can be calculated if the input variables are defined as
metrical/continuous. So this scale type must be assigned to all variables even if
they are ordinal in this case. Add a Type node on the right and make sure the
variables are defined as metrical/continuous.
5. Now add an appropriate R node as mentioned in Table 9.2 to calculate the
correlation matrix in R.
Remarks:
Using the command
round(cor(modelerData,modelerData),3)
determine the values of the correlation matrix.
The function “cor(modelerData,modelerData)” calculates the correlations
between all variables defined in the dataset modelerData. They are equal and
linked to the original dataset “nutrition_habites.sav” without the “ID” filtered in
the Filter node.
The function “round(xxx,3)” rounds all the correlations with a precision of
three digits after the comma.
6. Execute the different sub-streams and compare the correlations determined by
the Sim Fit and the R node.
Researchers can create new libraries and offer them as new packages of
functions and procedures to all other R users. They can download them and use
them to perform similar calculations without having to program the same
procedures again. The library “Hmisc” contains useful functions for data analysis.
See CRAN—Package Hmisc (2015) for more details.
We will show you how to add the library to R and additionally how to define our
own functions and how to use them in an R script.
Now paste the R script in the R Output syntax script of the predefined R
Output node as shown in Fig. 9.32. The old command “round(cor(modelerData,
modelerData),3)” is also included in the first row so the difference of both
calculations are easier to see.
5. Run the R Output node and compare the results of both correlation matrices.
1014 9 Using R with the Modeler
Fig. 9.32 R Output node with the extended R script to determine correlation matrix
In Sect. 5.3, we built a linear regression model with the Linear node. Based on data from
the real estate market in Boston, the model predicts the median value of owner-occupied
homes based on multiple input variables. For a description of the variables see Sect.
10.1.17. Here, we want to use R to create the same model and to compare the results.
the Linear node and activate the tab “Build Options”. In the section “Basics”
disable the option “Automatically prepare data”. See Fig. 9.34.
6. Run the Linear node to update the model nugget and the parameter of the model
now without a data preparation.
7. Open the model nugget and activate the coefficients view on the left. You
should find the parameters as shown in Fig. 9.35. Close this window.
1016
9 Using R with the Modeler
8. Now add an R Building node and connect it with the existing Type node.
9. Determine the appropriate R commands to . . .
(a) determine the model parameter by using the command
lm(MEDV~CRIM+ xxx,data¼modelerData)
The dummy “xxx” in this command must be substituted by the correct
variable names equal to the variables used in the Linear node.
(b) Save the parameter by using the variable “modelerModel”.
(c) Print a model summary.
(d) Create residual plots.
10. Add R commands so the values of MEDV are predicted by using the MLR
model.
11. Add a Table node to show the predicted values.
12. Compare the determined parameter of the model with those determined by
using the Linear node.
13. Add an appropriate comment to the R nodes.
1. Open the stream “transform_diabetes.str”. This stream should be the basis for
implementation of the Box–Cox transformation and the normality tests.
2. Save the stream using a different name. The final solution stream is named
“R_transform_diabetes.str”.
3. Review the R script “R_transform_diabetes_data.R” in the R scripts folder that
should be the basis for the implementation here.
4. Find and add an R node mentioned in Table 9.2 that is appropriate to implement
the transformation function as well as the normality tests as shown in the R script
mentioned above. Keep in mind that the test results in form of text outputs to the
console as well as in form of diagrams are produced by the R script.
Connect the new node with the existing Type node.
5. Using the R script “R_transform_diabetes_data.R” implement the Box–Cox
transformations and normality tests (Kolmogorov–Smirnov with Lilliefors mod-
ification and Shapiro–Wilk normality test) in the SPSS Modeler stream.
6. Interpret the results.
1018 9 Using R with the Modeler
9.7 Solutions
3. Following the characterization of the different R nodes in Table 9.2, here we use
an R Transform node from the Modeler Record Ops tab. Figure 9.36 shows the
node at the bottom.
4. The R script in Table 9.6 can be used to calculate the new variable “training_ex-
pected_total_2”. The structure of the script equals those in the R Building node
implemented in the stream “R_simple_linear_regression.str”. An explanation of
the commands can be found in Sect. 9.5, e.g., Table 9.4. Figure 9.37 shows the
R transform node with the R script.
5. To show the results in a Table node, we add a node at the end of the stream.
Figure 9.38 shows the final stream and Fig. 9.39 shows the calculated new values
of the variable “training_expected_total_3”.
Fig. 9.37 R Transform node with the R script to calculate the new variable
1020 9 Using R with the Modeler
Fig. 9.40 Added new Type node to redefine the scale of measurement of the variables
5. We can use an R Building node and define the command in the dialog field “R
model building syntax” (see, e.g., Fig. 9.21). But here we do not need to modify
the original dataset and link the modified data frame back to the Modeler. That’s
because here we also can use an R Output node from the Modelers Output tab.
We add it to the stream and paste the command mentioned in the “R Output
syntax” dialog field of the node. See Fig. 9.42.
6. As shown in Figs. 9.43 and 9.44: apart from their order, the determined
correlations are the same.
9.7 Solutions 1023
Fig. 9.44 Correlation Matrix determined with the Sim Fit node
Running the R Output node, we can compare the results of both correlation
matrices determined in R and as shown in Fig. 9.45. The second matrix is obviously
easier to interpret. Significant correlations are marked.
3. We add a Table node and connect it to the model nugget. Figure 9.46 shows the
stream with the new Table node and Fig. 9.47 shows the predicted values.
4. We explained in Sect. 2.4 how to add a comment to a stream. Here, we can’t
assign the comment to a specific node. So we do not mark any node in advance.
We add a comment by using the toolbar. The result is shown below (Fig. 9.48).
5. In the exercise, we described how to disable the automatic data preparation in
the Linear node. See Fig. 9.34.
6. Running the Linear node, we update the model nugget.
7. We get the parameter as shown in Fig. 9.35.
8. We added an R Building node to the stream and connected it with the existing
Type node (Fig. 9.49).
9. The R script in the “R model building syntax” dialog field shows Fig. 9.50 in the
upper dialog field. The correct lm-command to fit the model is:
modelerModel<-lm(MEDV~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE
+DIS+RAD+TAX+PTRATIO+B+LSTAT,data¼modelerData)
9.7 Solutions 1025
Fig. 9.46 Stream with added Table node to show the predicted values
10. To predict the MEDV values by using the determined R model, we can use the
same script as explained in detail in Table 9.4. We only have to modify the
name of the variable from “posttest_prediction” to “MEDV_prediction” as
highlighted with an arrow in Fig. 9.50.
11. The predicted values are saved in the data frame modelerDataModel. We
add a Table node to show the predicted values. See Fig. 9.51. The predicted
values shown in Fig. 9.52 are the same as the results in Fig. 9.47.
12. The determined parameter of the R model in Fig. 9.53 equals the model
parameter in the Linear node shown in Fig. 9.35.
13. Figures 9.54 and 9.55 show the final streams with the comment related to the R
Building, R model nugget, and the Table node.
1026 9 Using R with the Modeler
4. Following Table 9.2, we use here an R Building node. Figure 9.56 shows the
final stream “R_transform_diabetes.str”.
5. To determine the optimal transformation the R scripts in Tables 9.6 and 9.7 are
used. These scripts equal those in the R script “R_transform_diabetes_data.R”.
Some modifications for the handling of the data are necessary. The comments in
the script explain the commands. First all required libraries are installed in
R. Then these libraries are loaded.
In Table 9.7, we define a function “my.bc.transform”. The Box Cox transfor-
mation is used to determine the optimal exponent to transform the data towards
normality. Then the original as well as the transformed variable are tested with
the Kolmogorov–Smirnov test with Lilliefors modification and the Shapiro–
Wilk normality test.
1032 9 Using R with the Modeler
In the main part of the script in Table 9.6, the data from the modeler are copied
to an object “my.data” with the command “my.data<-modelerData”. In the
script, we do not have to address the object “modelerDataModel”. That’s
because we didn’t modify the object “modelerData”. So the R Building node
does only perform calculations as well as the tests and writes the output to the
console. There are no variables that should be returned to the SPSS Modeler.
Finally, in Table 9.8 the function “my.bc.transform” is being applied to the
variables “glucose_concentration”, “blood_pressure”, “serum_insulin”, “BMI”,
and “diabetes_pedigree”. In comparison to the given R script “R_transform_-
diabetes_data” the variable names are here slightly different. That’s because the
function “spss.system.file” used in the R script to import the SPSS file cuts the
longer variable names. In the SPSS Modeler the variable names must be used as
they can be found in the Table node.
6. A detailed description of the functionalities can be found in Sect. 3.2.5. The
results mentioned there for the variable “Serum Insulin” are also shown in
9.7 Solutions 1033
Table 9.7 Function “my.bc.transform” in the R Script to perform the Box–Cox transforma-
tion and the normality tests in the stream “R_transform_diabetes”
# user-defined function for transformation and tests
my.bc.transform <- function(org.var, var.name¼"")
{
print(var.name)
par(mfrow¼c(1, 1))
# show Log-Likelihood profile
boxcox(org.var~1)
# determine best lambda for box cox
bc.best.power<-powerTransform(org.var)
cat("Estimated transformation parameter: ",bc.best.power$lambda,"\n\n")
# transform original variable with box–cox
bc.best.data<-bcPower(org.var,bc.best.power$lambda)
par(mfrow¼c(1, 2))
# create histograms
hist(org.var, main¼var.name)
hist(bc.best.data, main¼paste(var.name," transformed", sep¼""))
# normal probability plot for original variable
qqnorm(org.var,main¼var.name)
qqnorm(bc.best.data, main¼paste(var.name," transformed", sep¼"")) # ¼¼¼ test normality
# Lilliefors test
# H0: normally distributed
print("original variable: ")
print(lillieTest(org.var))
print("transformed variable: ")
print(lillieTest(bc.best.data)) # ¼¼¼ perform Shapiro Wilk tests
# H0: normally distributed
print("original variable: ")
print(shapiro.test(org.var))
print("transformed variable: ")
print(shapiro.test(bc.best.data))
# restore old parameter
par(mfrow¼c(1, 2))
}
Figs. 9.57 and 9.58. Additionally here the Kolmogorov–Smirnov test result can
be found. The null hypothesis is that the values are normally distributed. We can
reject this hypothesis based on the result “p<2.2E-16” for the original variable.
The p-value for the transformed values of “Serum Insulin” is much better
(“p ¼ 0.4443”) (not shown in Fig. 9.57) and so the transformed variable is
normally distributed. The Shapiro–Wilk normality test shows in general the
same results for this variable.
1034 9 Using R with the Modeler
Table 9.8 Main part of R Script to perform the Box–Cox transformation and the normality
tests in the stream “R_transform_diabetes”
# Automatically install all necessary packages in R
# Source: https://gist.github.com/benmarwick/5054846
list.of.packages <- c("Matrix", "stats","car","MASS","fBasics") # replace xx and yy with
package names
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(new.packages, require, character.only¼T) # loading the required libraries
require(Matrix)
require(stats)
require(car)
require(MASS)
require(fBasics) # ¼¼¼ determine variable to transform
my.data<-modelerData
my.bc.transform(my.data$glucose_concentration, var.name¼"Glucose")
my.bc.transform(my.data$blood_pressure, var.name¼"Blood Pressure")
my.bc.transform(my.data$serum_insulin,var.name¼"Serum Insulin")
my.bc.transform(my.data$BMI, var.name¼"BMI")
my.bc.transform(my.data$diabetes_pedigree, var.name¼"Diabetes Pedigree")
Fig. 9.57 Text Output of the R Building node for the variable “Serum Insulin”
Literature 1035
Fig. 9.58 Graph Output of the R Building node for the variable “Serum Insulin”
Literature
CRAN – Package Hmisc. (2015). Package Hmisc. Accessed 13/08/2015, from https://cran.r-
project.org/web/packages/Hmisc/index.html
IBM. (2014a). SPSS Modeler 16 Essentials for R: Installation Instructions. Accessed 18/09/2015,
from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/mod
eler_r_plugin_install_project_book.pdf
IBM. (2014b). SPSS Modeler 16 R Nodes. Accessed 18/09/2015, from ftp://public.dhe.ibm.com/
software/analytics/spss/documentation/modeler/16.0/en/modeler_r_nodes_book.pdf
IBM. (2015). SPSS Modeler 17 R Nodes. Accessed 18/09/2015, from ftp://public.dhe.ibm.com/
software/analytics/spss/documentation/modeler/17.0/en/ModelerRnodes.pdf
IBM Website. (2015). Downloads for IBM® SPSS® Modeler. Accessed 06/08/2015, from https://
www.ibm.com/developerworks/community/wikis/home?lang¼en#!/wiki/We70df3195ec8_
4f95_9773_42e448fa9029/page/Downloads%20for%20IBM%C2%AE%20%20SPSS%C2%
AE%20%20Modeler
myowelt.blogspot.de. (2015). R script correlation matrix improved. Accessed 13/08/2015, from
http://myowelt.blogspot.de/2008/04/beautiful-correlation-tables-in-r.html.
Appendix
10
10.1.1 adult_income_data.txt
The dataset was downloaded from UCI Machine Learning Repository (1996)
and contains census data of 32,651 people. The data were originally
extracted from census bureau database. The variables in the dataset are listed in
Table 10.1.
10.1.2 beer.sav
10.1.3 benchmark.xlsx
This dataset includes the performance test results of personal computer processors
published in c’t Magazine for IT Technology (2008). The names of the processors
can be found alongside the names of the manufacturers Intel and AMD. The
processor speed was determined using the “Cinebench” benchmark test (see
Table 10.3).
10.1.4 car_simple.sav
The data represent the prices of six cars in different size categories. The dataset
includes the name of the manufacturer, the type of car, and the price. We formally
declare that the prices are not representative of the models and types mentioned.
Table 10.4 shows the values.
This dataset was created based on an idea presented in Handl (2010), p. 364–383.
10.1.5 car_sales_modified.sav
10.1.6 chess_endgame_data.txt
The dataset was downloaded from the UCI Machine Learning Repository Machine
Learning Repository (1994) and contains data of 28,056 chess endgame black-to-
move positions with a white king and rook against a black king. For all not familiar
with chess, in this position only two outcomes are possible: white winning or a
draw. The dataset comprises seven variables, where six of them describe the
positions of the three pieces on the board and the last one the number of moves
white needs to win or the game ends with a draw. Thereby, when white hasn’t won
the game within 16 moves, the game ends automatically with a draw (Table 10.6).
10.1.7 customer_bank_data.csv
This dataset was created by merging data found on the IBM Website (2014). The
records describe several characteristics of the customers of a bank. Additionally,
the customers are marked as having defaulted, or not. See Table 10.7 for details.
10.1.8 diabetes_data_reduced.sav
This dataset comes from the Machine Learning Repository (1990) and originally
from the National Institute of Diabetes and Digestive and Kidney Diseases. It
represents the data from a study of the Pima Indian population. The Pima Indians
are affected by higher rates of diabetes and obesity than the average. See Schulz
et al. (2006).
The included variables, as well as their meaning, can be found in Table 10.8. We
removed all records with missing values in any of the variables and converted the
data into an SPSS-data file. 392 records are included. All patients are female and at
least 21 years old.
10.1.9 DRUG1n.sav
The dataset contains data of a drug treatment study. The patients in this study suffer
all from the same illness but respond different medications. The data is provided by
the SPSS Modeler as basis of a demo, see IBM (2015), p. 73. The variables in the
dataset are listed in Table 10.9.
10.1.10 EEG_Sleep_Signals.csv
The dataset contains EEG signal data of a single person in a drowsiness and awake
state. The electric impulse of the brain is thereby measured every 10 ms, and the
data is split into segments of 30 s. For information on EEG, we refer to
Niedermeyer et al. (2011) (Table 10.10).
These Microsoft Excel datasets were generated for demonstration purposes only.
The sample sizes are three records each. Tables 10.11 and 10.12 show the structure
of the sets.
The source of these data is the UK Office for National Statistics and its website
NOMIS UK (2014). These data are based on an annual survey of hours and
earnings—workplace analysis coming from the Annual Survey of Hours and
Earnings (ASHE).
The median of the weekly or hourly payments has been downloaded. The sum of
different variables cannot be represented by another variable also included in the
dataset, e.g., the median of the weekly payment excluding overtime plus the median
of the overtime payment does not equal the median of the weekly gross payment.
Table 10.13 shows the variables from the CSV file for 2013. Table 10.14 shows the
variables for 2014.
The payments for female and male employees in 2014 are included in the
Microsoft Excel files “england_payment_fulltime_female_2014” and “england_
payment_fulltime_male_2014”. The variables are the same as those described in
Tables 10.13, 10.14, and 10.15.
The coefficient of variation is described on the website NOMIS UK (2014) as
follows:
10.1.13 Features_eeg_signals.csv
The dataset contains aggregated data which are obtained from the EEG dataset
“EEG_Sleep_Signal.csv” (Sect. 10.1.10). The features were calculated for each
EEG signal segment of the mentioned dataset. The first three features are called
Hjorth, see Niedermeyer et al. (2011) and Oh et al. (2014) (Table 10.17).
10.1.14 gene_expression_leukemia.csv
The dataset contains gene expression of various leukemia patients. The data were
measured on 851 positions on the human genome which refer to the genes from a
list of cancer related genes, which were consolidated in Futreal et al. (2004). The
dataset here is a subset of the huge leukemia data that were the basis of the study
Haferlach et al. (2010). In this study, however, gene expression measurements of
more genes were included. The subset here was downloaded from the open source
10.1 Data Sets Used in This Book 1045
project Leukemia Gene Atlas Hebestreit et al. (2012), which is an open repository
of leukemia datasets and studies (Table 10.18).
The term gene expression and its relation to gene regulation is explained in
O’Connor and Adams (2010). For basic background information on leukemia, we
refer to National cancer Institute (2013).
10.1.15 gene_expression_leukemia_short.csv
10.1.16 gravity_constant_data.csv
This Microsoft Excel dataset has been generated for demonstration purposes
only. The sample sizes are three records each. Table 10.20 shows the structure of
the sets.
1046 10 Appendix
10.1.17 Housing.data.txt
The dataset was downloaded from the Machine Learning Repository (1993). It
contains housing values for certain Boston suburbs. Details can be found in
Harrison and Rubinfeld (1978) and Gilley and Pace (1996). The variables included
are described in Table 10.21.
10.1.18 Iris.csv
This dataset contains the sepal and petal width and length of 50 flowers for each of
the three iris species. It is included in the R, version 3.1.0, and can be shown with
the R command “Index: data(iris)”. For details, see Fisher (1936) and Longley
(1967). Table 10.22 explains the variables and their meaning.
10.1 Data Sets Used in This Book 1047
10.1.19 IT-projects.txt
This dataset includes several variables related to IT Projects. For details, see
Table 10.23.
This dataset represents the opinions of IT users in a particular firm. 180 users were
asked to assess the quality of the IT system. The questionnaire used to collect the
data is based on Heinrich (2002b). A detailed description can be found in Heinrich
(2002a). The dataset used here is a modified version, first presented by Wendler
(2004). For details, see Table 10.24.
10.1.21 longley.csv
This dataset is included in the R, version 3.1.0, and can be shown with the R
command “Index: data(longley)”. For details, see Longley (1967).
The data are macroeconomics data with several yearly, observed economic
variables from 1947 to 1962. See Table 10.25.
1048 10 Appendix
10.1.22 LPGA2009.csv
This data from the Journal of Statistical Education Data Archive (2009) represents
performance and success statistics for golfers on the LPGA tour in 2009. See
Table 10.26 for more details.
1050 10 Appendix
10.1.23 Mtcars.csv
This dataset is included in the R package datasets, version 3.1.0, and can be shown
with the R command “Index: data(mtcars)”. For details, see Henderson and
Velleman (1981).
Data represent ten performance and design parameters, as well as the fuel
consumption of 32 automobiles from 1974 (see Table 10.27).
10.1 Data Sets Used in This Book 1051
10.1.24 nutrition_habites.sav
The key idea of this dataset and some of the steps and interpretations that can be
found in this book are based on explanations from Bühl (2012). The authors created
a completely new dataset of their own, however.
In relation to the “diet types”:
– Vegetarian
– Low meat
– Fast food
– Filling
– Hearty
the consumers were asked “Please indicate which of the following dietary
characteristics describe your preferences. How often do you eat . . .”.
The respondents had the chance to rate their preferences on a scale “(very)
often”, “sometimes”, and “never”.
The ID is an ordinal variable because the values can be ordered, but because of
the way the role “none” has been defined, the scale type in fact does not matter. All
the other variables are coded as follows: “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼
(very) often”. The variables are ordinally scaled.
10.1.26 Orthodont.csv
The dataset contains measures of the orthodontic change (distance) over time of
27 teenagers. It is included in the package “nlme” of R, version 3.1.0, and can be
shown with the R command “Index: data(Orthodont)”. See Potthoff and Roy (1964)
for details. Table 10.29 explains the variables and their meaning.
10.1.27 Ozone.csv
This dataset contains meteorology data and ozone concentration from the Los
Angeles Basin in 1976. It is included in the package “faraway” of R, version
3.1.0, and can be shown with the R command “Index: data(faraway::ozone)”. See
Breiman and Friedman (1985) for details. Table 10.30 explains the variables and
their meaning.
10.1.28 pisa2012_math_q45.sav
OECD (2012b). To reduce the number of records and variables, the authors of this
book preprocessed the data as follows:
The sample size is 551. The responses are related to the question “Thinking
about mathematical concepts: how familiar are you with the following terms?”
Details can be found in Table 10.31. Table 10.32 shows the coding of the answers.
10.1.29 sales_list.sav
This dataset was created by the authors of this book, based on an idea presented in
IBM (2014b), p. 57. For details, see Table 10.33.
10.1.30 ships.csv
This dataset is included in the R package MASS, version 3.1.0, and can be shown
with the R command “Index: data(ships)”. For details, see McCullagh and
Nelder (1983).
Data give the number of incidents, the year of construction, the aggregated
month of service, and the period of operations for 40 ships. For details, see
Table 10.34.
10.1.31 test_scores.sav
This dataset “test_scores.sav” comes with the IBM SPSS Modeler Version
16 (Table 10.35).
Table 10.35 shows the fields and a short description. See also IBM (2014c).
10.1.32 Titanic.xlsx
The dataset contains information about the Titanic passengers including the sur-
vival status of the sinking. The data does not contain crew information. The data
were collected by Thomas Cason. See Vanderbilt University School of Medicine
(2004). Table 10.36 lists the variables of the dataset with their meaning.
10.1.33 tree_credit.sav
10.1.34 wine_data.txt
The dataset was downloaded from the UCI Machine Learning Repository Machine
Learning Repository (1991) and contains chemical analysis data from three Italian
wines from different cultivators. In the analysis of the wines, 13 indicators are
determined for each of the three wines (Table 10.38).
10.1.35 WisconsinBreastCancerData.csv
The dataset was downloaded from the UCI Machine Learning Repository Machine
Learning Repository (1992). The dataset originally from William H. Wolberg
(2003) represents medically related values determined to diagnose breast cancer.
See also Wolberg and Mangasarian (1990) for details. The variables are described
in Table 10.39.
Literature 1057
10.1.36 z_pm_customer1.sav
Literature
Beer-Shop-Hamburg. (2014). Beer from all over the world. Accessed 26/08/2014, from http://
www.biershop-hamburg.de/Biere-aus-aller-Welt-17
1058 10 Appendix
Breiman, L., & Friedman, J. H. (1985). Estimating optimal transformations for multiple regression
and correlation. Journal of the American Statistical Association, 80(391), 580–598.
Bühl, A. (2012). SPSS 20: Einf€ uhrung in die moderne Datenanalyse, Scientific tools (13th ed.).
München: Pearson.
c’t Magazine for IT Technology. (2008). CPU-Wegweiser: x86-Prozessoren im Überblick, Vol.
2008 No. 7, pp. 178–182.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, 7(2), 179–188.
Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., Rahman, N., &
Stratton, M. R. (2004). A census of human cancer genes. Nature Reviews Cancer, 4(3),
177–183.
Gilley, O. W., & Pace, R. (1996). On the Harrison and Rubinfeld Data. Journal of Environmental
Economics and Management, 31(3), 403–405.
Haferlach, T., Kohlmann, A., Wieczorek, L., Basso, G., Kronnie, G. T., Béné, M.-C., de Vos, J.,
Hernández, J. M., Hofmann, W.-K., Mills, K. I., Gilkes, A., Chiaretti, S., Shurtleff, S. A.,
Kipps, T. J., Rassenti, L. Z., Yeoh, A. E., Papenhausen, P. R., Liu, W.-M., Williams, P. M., &
Foà, R. (2010). Clinical utility of microarray-based gene expression profiling in the diagnosis
and subclassification of leukemia: report from the International Microarray Innovations in
Leukemia Study Group. Journal of Clinical Oncology Official Journal of the American Society
of Clinical Oncology, 28(15), 2529–2537.
Handl, A. (2010). Multivariate Analysemethoden: Theorie und Praxis multivariater Verfahren
unter besonderer Ber€ ucksichtigung von S-PLUS, Statistik und ihre Anwendungen (2nd ed.).
Heidelberg: Springer.
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air.
Journal of Environmental Economics and Management, 5(1), 81–102.
Hebestreit, K., Gr€ottrup, S., Emden, D., Veerkamp, J., Ruckert, C., Klein, H.-U., Müller-Tidow,
C., Dugas, M., & Speletas, M. (2012). Leukemia Gene Atlas – A Public Platform for Integra-
tive Exploration of Genome-Wide Molecular Data. PLoS One, 7(6), e39148.
€
Heinrich, L. J. (2002a). Informationsmanagement: Planung, Uberwachung und Steuerung der
Informationsinfrastruktur, Wirtschaftsinformatik (7th ed.). München: Oldenbourg.
Heinrich, L. J. (2002b). Questionnaire for a success factor analysis in SME.
Henderson, H. V., & Velleman, P. F. (1981). Building multiple regression models interactively.
Biometrics, 37, 391–411.
Hoffmann-Beverages. (2014). Beverage-details. Accessed 27/08/2014, from http://www.
getraenke-hoffmann.de/download/durstexpress/DurstExpress_Katalog.pdf
IBM. (2014a). SPSS Modeler 16 Applications Guide. Accessed 18/09/2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/16.0/en/modeler_applications_-
guide_book.pdf
IBM. (2014b). SPSS Modeler 16 Source, Process, and Output Nodes. Accessed 18/09/2015, from
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/
modeler_nodes_general.pdf
IBM. (2014c). Test scores dataset. Accessed 18/09/2015, from http://www-01.ibm.com/support/
knowledgecenter/SSLVMB_22.0.0/com.ibm.spss.statistics.cs/components/glmm/glmm_tests-
cores_intro.htm
IBM. (2015). SPSS Modeler 17 Applications guide. ftp://public.dhe.ibm.com/software/analytics/
spss/documentation/modeler/17.0/en/ModelerApplications.pdf
IBM Website. (2014). Customer segmentation analytics with IBM SPSS. Accessed 08/05/2015,
from http://www.ibm.com/developerworks/library/ba-spss-pds-db2luw/index.html
Journal of Statistical Education Data Archive. (2009). LPGA Performance Statistics for 2009.
Accessed 12/06/2015, from http://www.stat.ufl.edu/~winner/data/lpga2009.dat
Longley, J. W. (1967). An appraisal of least squares programs for the electronic computer from the
point of view of the user. Journal of the American Statistical Association, 62(319), 819–841.
Lichman, M. (2013). UCI Machine learning repository. http://archive.ics.uci.edu/ml
Literature 1059
Machine Learning Repository. (1990). Pima Indians Diabetes Data Set. Accessed 18/09/2015,
from http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Machine Learning Repository. (1991). Wine Data Set. Accessed 2015, from http://archive.ics.uci.
edu/ml/datasets/Wine
Machine Learning Repository. (1992). Breast Cancer Wisconsin (Original) Data Set. Accessed
29/10/2015, from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%
28Original%29
Machine Learning Repository. (1993). Boston Housing Data Set. Accessed 12/06/2015, from
https://archive.ics.uci.edu/ml/datasets/Housing
Machine Learning Repository. (1994). Chess Endgame Database for White King and Rook against
Black King (KRK). Accessed 2015, from https://archive.ics.uci.edu/ml/datasets/Chess+(King-
Rook+vs.+King)
Machine Learning Repository. (1998). Optical Recognition of Handwritten Digits. Accessed 2015,
from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
McCullagh, P., & Nelder, J. A. (1983). Generalized linear models, Monographs on statistics and
applied probability. London: Chapman and Hall.
National cancer Institute. (2013). What you need to know about leukemia, NIH publication,
no. 13-3775. Revised September 2013, digital edition.
Niedermeyer, E., Schomer, D. L., & Lopes da Silva, F. H. (2011). Niedermeyer’s electroencepha-
lography: Basic principles, clinical applications, and related fields (6th ed.). Philadelphia:
Wolters Kluwer/Lippincott Williams & Wilkins Health.
NOMIS UK. (2014). Official Labour Market Statistics – Annual Survey of Hours and Earnings –
Workplace Analysis. Accessed 18/09/2015, from http://nmtest.dur.ac.uk/
O’Connor, C. M., & Adams, J. U. (2010). Essentials of cell biology. Cambridge, MA: NPG
Education.
OECD. (2012a). PISA 2012 Technical Report.
OECD. (2012b). Programm for International Student Assessment (PISA) 2012. Accessed 02/03/
2015, from http://pisa2012.acer.edu.au/downloads.php
Oh, S.-H., Lee, Y.-R., & Kim, H.-N. (2014). A novel EEG feature extraction method using Hjorth
parameter. International Journal of Electronics and Electrical Engineering, 2(2), 106–110.
Potthoff, R. F., & Roy, S. N. (1964). A generalized multivariate analysis of variance model useful
especially for growth curve problems. Biometrika, 51, 313–326.
Schulz, L. O., Bennett, P. H., Ravussin, E., Kidd, J. R., Kidd, K. K., Esparza, J., & Valencia, M. E.
(2006). Effects of traditional and western environments on prevalence of type 2 diabetes in
Pima Indians in Mexico and the U.S. Diabetes Care, 29(8), 1866–1871.
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the
ADAP learning algorithm to forecast the onset of diabetes mellitus. Proceedings of the Annual
Symposium on Computer Application in Medical Care, 261–265.
Stacey, K., & Turner, R. (2015). Assessing mathematical literacy: The PISA experience.
UCI Machine Learning Repository. (1996). UCI Machine Learning Repository – Adult Data Set.
Accessed 12/09/2015, from https://archive.ics.uci.edu/ml/datasets/Adult
Vanderbilt University School of Medicine. (2004). Department of Biostatistics – Titanic Data.
Accessed 12/09/2015, from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.
html
Wendler, T. (2004). Modellierung und Bewertung von IT-Kosten: Empirische Analyse mit Hilfe
multivariater mathematischer Methoden, Wirtschaftsinformatik. Wiesbaden: Deutscher
Universitäts-Verlag.
Wolberg, W. H. (2003). Wisconsin breast cancer data. Accessed 12/06/2015, from http://www.
stat.yale.edu/~pollard/Courses/230.spring03/WBC/
Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for
medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA, 87(23), 9193–9196.