Vous êtes sur la page 1sur 1068

Tilo Wendler · Sören Gröttrup

Data Mining
with SPSS
Modeler
Theory, Exercises and Solutions
Data Mining with SPSS Modeler
ThiS is a FM Blank Page
€ren Gro
Tilo Wendler • So €ttrup

Data Mining with SPSS


Modeler
Theory, Exercises and Solutions
Tilo Wendler S€oren Gr€ottrup
HTW Berlin Stuttgart
University of Applied Sciences Germany
Berlin
Germany

ISBN 978-3-319-28707-2 ISBN 978-3-319-28709-6 (eBook)


DOI 10.1007/978-3-319-28709-6

Library of Congress Control Number: 2016941509

Mathematics Subject Classification (2010): 62-07; 62Hxx; 62Pxx; 62-01

# Springer International Publishing Switzerland 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG Switzerland
Preface

Data Analytics, Data Mining and Big Data are terms often used in everyday
business. Companies collect more and more data and store it in databases, with
the hope of finding helpful patterns that can improve business. Shortly after
deciding to more use of such data, managers often confess that analysing these
datasets is resource-consuming and anything but easy. Involving the firm’s
IT-experts leads to a discussion regarding which tools to use. Very few applications
are available in the marketplace that are appropriate for handling large datasets in a
professional way. Two commercial products worth mentioning are ‘Enterprise
Miner’ by SAS and ‘SPSS Modeler’ by IBM.
At first glance, these applications are easy to use. After a while, however, many
users realize that more difficult questions require deeper statistical knowledge.
Many people are interested in gaining such statistical skills and applying them,
using one of the data mining tools offered by the industry.
This book will help users to become familiar with a wide range of statistical
concepts or algorithms and apply them to concrete datasets. After a short statistical
overview of how the procedures work and what assumptions to keep in mind, step-
by-step procedures show how to find the solutions with the SPSS Modeler.

Features of This Book


– Easy to read
– Standardised chapter structure, including exercises and solutions
– All datasets are provided as downloads and explained in detail
– Template streams help the reader focus on the interesting parts of the stream and
leave out recurring tasks
– Complete solution streams are ready to use
– Each example includes step-by-step explanations
– Short explanations of the most important statistical assumptions used when
applying the algorithms are included
– Hundreds of screenshots are included, to ensure successful application of the
algorithms to the datasets
– Exercises teach how to secure and systematise this knowledge
– Explanations and solutions are provided for all exercises
– Skills acquired through solving the exercises allow the user to create his/her own
streams
v
vi Preface

The authors of the book, Tilo Wendler and S€oren Gr€ottrup, want to thank all the
people who supported the writing process. These include IBM support experts who
dealt with some of the more difficult tasks, discovering more efficient ways to
handle the data. Furthermore, the authors want to express gratitude to Jeni
Ringland, Katrin Minor and Maria Sabottke for their outstanding efforts and their
help in professionalising the text, layout, figures and tables.

Berlin, Germany Tilo Wendler


Stuttgart, Germany S€oren Gr€ottrup
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Concept of the SPSS Modeler . . . . . . . . . . . . . . . . . . . . . 2
1.2 Structure and Features of This Book . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Prerequisites for Using This Book . . . . . . . . . . . . . . 5
1.2.2 Structure of the Book and the Exercise/Solution
Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Using the Data and Streams Provided with the
Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Datasets Provided with This Book . . . . . . . . . . . . . . 9
1.2.5 Template Concept of This Book . . . . . . . . . . . . . . . 10
1.3 Introducing the Modeling Process . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Basic Functions of the SPSS Modeler . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Defining Streams and Scrolling Through a Dataset . . . . . . . . . 25
2.2 Switching Between Different Streams . . . . . . . . . . . . . . . . . . 32
2.3 Defining or Modifying Value Labels . . . . . . . . . . . . . . . . . . . 34
2.4 Adding Comments to a Stream . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7 Data Handling and Sampling Methods . . . . . . . . . . . . . . . . . . 49
2.7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7.2 Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.3 String Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7.4 Extracting/Selecting Records . . . . . . . . . . . . . . . . . . 61
2.7.5 Filtering Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.7.6 Data Standardization: Z-Transformation . . . . . . . . . . 73
2.7.7 Partitioning Datasets . . . . . . . . . . . . . . . . . . . . . . . . 82
2.7.8 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.7.9 Merge Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

vii
viii Contents

2.7.10 Append Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 124


2.7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.7.12 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Literature . . . . ....................................... . 184
3 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
3.1.1 Discrete Versus Continuous Variables . . . . . . . . . . . 185
3.1.2 Scales of Measurement . . . . . . . . . . . . . . . . . . . . . . 187
3.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.1.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
3.2 Simple Data Examination Tasks . . . . . . . . . . . . . . . . . . . . . . . 194
3.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.2.2 Frequency Distribution of Discrete Variables . . . . . . 194
3.2.3 Frequency Distribution of Continuous Variables . . . . 199
3.2.4 Distribution Analysis with the Data Audit Node . . . . 202
3.2.5 Concept of “SuperNodes” and Transforming a
Variable to Normality . . . . . . . . . . . . . . . . . . . . . . . 207
3.2.6 Reclassifying Values . . . . . . . . . . . . . . . . . . . . . . . . 224
3.2.7 Binning Continuous Data . . . . . . . . . . . . . . . . . . . . 236
3.2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
3.2.9 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
4 Multivariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
4.2 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
4.3 Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
4.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
4.5 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
4.6 Exclusion of Spurious Correlations . . . . . . . . . . . . . . . . . . . . . 314
4.7 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
4.9 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
5 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
5.1 Introduction to Regression Models . . . . . . . . . . . . . . . . . . . . . 348
5.1.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . 348
5.1.2 Concept of the Modeling Process
and Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . 350
5.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
5.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
5.2.2 Building the Stream in SPSS Modeler . . . . . . . . . . . 356
5.2.3 Identification and Interpretation of the Model
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
5.2.4 Assessment of the Goodness of Fit . . . . . . . . . . . . . 362
Contents ix

5.2.5 Predicting Unknown Values . . . . . . . . . . . . . . . . . . 365


5.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
5.2.7 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
5.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 390
5.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
5.3.2 Building the Model in SPSS Modeler . . . . . . . . . . . . 392
5.3.3 Final MLR Model and Its Goodness of Fit . . . . . . . . 397
5.3.4 Prediction of Unknown Values . . . . . . . . . . . . . . . . 404
5.3.5 Cross-Validation of the Model . . . . . . . . . . . . . . . . . 404
5.3.6 Boosting and Bagging (for Regression Models) . . . . 406
5.3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
5.3.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
5.4 Generalized Linear (Mixed) Model . . . . . . . . . . . . . . . . . . . . 448
5.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
5.4.2 Building a Model with the GLMM Node . . . . . . . . . 450
5.4.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . 455
5.4.4 Cross-Validation and Fitting a Quadric
Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 458
5.4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
5.4.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
5.5 The Auto Numeric Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
5.5.1 Building a Stream with the Auto Numeric Node . . . . 490
5.5.2 The Auto Numeric Model Nugget . . . . . . . . . . . . . . 497
5.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
5.5.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
6 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
6.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
6.2 General Theory of Factor Analysis . . . . . . . . . . . . . . . . . . . . . 515
6.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 519
6.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
6.3.2 Building a Model in SPSS Modeler . . . . . . . . . . . . . 520
6.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
6.3.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
6.4 Principal Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
6.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
6.4.2 Building a Model . . . . . . . . . . . . . . . . . . . . . . . . . . 573
6.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
6.4.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
7 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
7.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
7.2 General Theory of Cluster Analysis . . . . . . . . . . . . . . . . . . . . 589
x Contents

7.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596


7.2.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
7.3 TwoStep Hierarchical Agglomerative Clustering . . . . . . . . . . . 601
7.3.1 Theory of Hierarchical Clustering . . . . . . . . . . . . . . 601
7.3.2 Characteristics of the TwoStep Algorithm . . . . . . . . 614
7.3.3 Building a Model in SPSS Modeler . . . . . . . . . . . . . 615
7.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
7.3.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
7.4 K-Means Partitioning Clustering . . . . . . . . . . . . . . . . . . . . . . 640
7.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
7.4.2 Building a Model in SPSS Modeler . . . . . . . . . . . . . 642
7.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
7.4.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
7.5 Auto Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
7.5.1 Motivation and Implementation of the Auto
Cluster Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
7.5.2 Building a Model in SPSS Modeler . . . . . . . . . . . . . 687
7.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699
7.5.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
8 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
8.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
8.2 General Theory of Classification Models . . . . . . . . . . . . . . . . 716
8.2.1 Process of Training and Using a Classification
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
8.2.2 Classification Algorithms . . . . . . . . . . . . . . . . . . . . 718
8.2.3 Classification vs. Clustering . . . . . . . . . . . . . . . . . . 720
8.2.4 Making a Decision and the Decision Boundary . . . . . 721
8.2.5 Performance Measures of Classification Models . . . . 723
8.2.6 The Analysis Node . . . . . . . . . . . . . . . . . . . . . . . . . 725
8.2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
8.2.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
8.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733
8.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
8.3.2 Building the Model in SPSS Modeler . . . . . . . . . . . . 736
8.3.3 Optional: Model Types and Variable Interactions . . . 743
8.3.4 Final Model and Its Goodness of Fit . . . . . . . . . . . . 746
8.3.5 Classification of Unknown Values . . . . . . . . . . . . . . 750
8.3.6 Cross-Validation of the Model . . . . . . . . . . . . . . . . . 751
8.3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
8.3.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
Contents xi

8.4 Linear Discriminate Classification . . . . . . . . . . . . . . . . . . . . . 776


8.4.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776
8.4.2 Building the Model with SPSS Modeler . . . . . . . . . . 779
8.4.3 The Model Nugget and the Estimated Model
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
8.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
8.4.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
8.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
8.5.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809
8.5.2 Building the Model with SPSS Modeler . . . . . . . . . . 810
8.5.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . 820
8.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
8.5.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
8.6 Neuronal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
8.6.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
8.6.2 Building a Network with SPSS Modeler . . . . . . . . . . 846
8.6.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . 856
8.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860
8.6.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
8.7 k-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
8.7.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
8.7.2 Building the Model with SPSS Modeler . . . . . . . . . . 882
8.7.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . 891
8.7.4 Dimensional Reduction with PCA for Data
Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
8.7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
8.7.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
8.8 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917
8.8.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917
8.8.2 Building a Decision Tree with the C5.0 Node . . . . . . 925
8.8.3 The Model Nugget . . . . . . . . . . . . . . . . . . . . . . . . . 929
8.8.4 Building a decision tree with the CHAID node . . . . . 932
8.8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938
8.8.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939
8.9 The Auto Classifier Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
8.9.1 Building a Stream with the Auto Classifier Node . . . 961
8.9.2 The Auto Classifier Model Nugget . . . . . . . . . . . . . . 971
8.9.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973
8.9.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983
9 Using R with the Modeler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985
9.1 Advantages of R with the Modeler . . . . . . . . . . . . . . . . . . . . . 985
9.2 Connecting with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
9.3 Test the SPSS Modeler Connection to R . . . . . . . . . . . . . . . . . 990
9.4 Calculating New Variables in R . . . . . . . . . . . . . . . . . . . . . . . 994
xii Contents

9.5 Model Building in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999


9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008
9.7 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1018
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035
10 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037
10.1 Data Sets Used in This Book . . . . . . . . . . . . . . . . . . . . . . . . . 1037
10.1.1 adult_income_data.txt . . . . . . . . . . . . . . . . . . . . . . . 1037
10.1.2 beer.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037
10.1.3 benchmark.xlsx . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037
10.1.4 car_simple.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039
10.1.5 car_sales_modified.sav . . . . . . . . . . . . . . . . . . . . . . 1039
10.1.6 chess_endgame_data.txt . . . . . . . . . . . . . . . . . . . . . 1039
10.1.7 customer_bank_data.csv . . . . . . . . . . . . . . . . . . . . . 1040
10.1.8 diabetes_data_reduced.sav . . . . . . . . . . . . . . . . . . . . 1040
10.1.9 DRUG1n.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041
10.1.10 EEG_Sleep_Signals.csv . . . . . . . . . . . . . . . . . . . . . 1042
10.1.11 employee_dataset_001 and employee_dataset_002 . . . . 1042
10.1.12 England Payment Datasets . . . . . . . . . . . . . . . . . . . 1042
10.1.13 Features_eeg_signals.csv . . . . . . . . . . . . . . . . . . . . . 1044
10.1.14 gene_expression_leukemia.csv . . . . . . . . . . . . . . . . 1044
10.1.15 gene_expression_leukemia_short.csv . . . . . . . . . . . . 1045
10.1.16 gravity_constant_data.csv . . . . . . . . . . . . . . . . . . . . 1045
10.1.17 Housing.data.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046
10.1.18 Iris.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046
10.1.19 IT-projects.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047
10.1.20 IT user satisfaction.sav . . . . . . . . . . . . . . . . . . . . . . 1047
10.1.21 longley.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047
10.1.22 LPGA2009.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049
10.1.23 Mtcars.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
10.1.24 nutrition_habites.sav . . . . . . . . . . . . . . . . . . . . . . . . 1051
10.1.25 optdigits_training.txt, optdigits_test.txt . . . . . . . . . . . 1051
10.1.26 Orthodont.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1052
10.1.27 Ozone.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1052
10.1.28 pisa2012_math_q45.sav . . . . . . . . . . . . . . . . . . . . . 1052
10.1.29 sales_list.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
10.1.30 ships.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
10.1.31 test_scores.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
10.1.32 Titanic.xlsx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055
10.1.33 tree_credit.sav . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055
10.1.34 wine_data.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056
10.1.35 WisconsinBreastCancerData.csv . . . . . . . . . . . . . . . 1056
10.1.36 z_pm_customer1.sav . . . . . . . . . . . . . . . . . . . . . . . . 1057
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057
Introduction
1

The amount of collected data has risen exponentially over the last decade, as
companies worldwide store more and more data on customer interactions, sales,
logistics, and production processes. For example, Walmart handles up to ten million
transactions and 5000 items per second (see Walmart 2012). According to The
Economist (2010a), the company feeds 2.5 petabyte of databases. To put that in
context, that is the same as the total number of letters sent within the US over
6 months.
In recent years, companies have discovered that understanding their collected
data is a very powerful tool and can reduce overheads and optimize work-flows,
giving their firms a big advantage in the market.
The challenge in this area of data analytics is to consolidate data from different
sources and analyze the data, to find new structures or patterns and rules that can
help predict the future of the business more precisely, or create new fields of
business activity. The job of a data scientist is predicted to become one of the
most interesting jobs in the twenty-first century (see Davenport and Patil 2012), and
there is currently strong competition for the brightest and most talented analysts in
this field.
A wide range of applications, such as R, SAS, MATLAB, and SPSS Statistics,
provide a huge toolbox of methods to analyze large data and can be used by experts
to find patterns and interesting structures in the data. Many of these tools are mainly
programming languages, which assumes the analyst has deeper programming skills
and an advanced background in IT and mathematics. Since this field is becoming
more important, graphic user-interfaced data analysis software is starting to enter
the market, providing “drag and drop” mechanisms for career changers and people
who are not experts in programming or statistics.
One of these easy to handle, data analytics applications is the IBM SPSS
Modeler. This book is dedicated to the introduction and explanation of its data
analysis power, delivered in the form of a teaching course.

# Springer International Publishing Switzerland 2016 1


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_1
2 1 Introduction

1.1 The Concept of the SPSS Modeler

IBM’s SPSS Modeler offers more than a single application. In fact it is a family of
software tools based on the same concept. These are:

– SPSS Modeler (Personal, Professional, Premium, Gold)


– SPSS Modeler Server
– SPSS Modeler Gold on Cloud
– SPSS Predictive Analytics Enterprise (combination of several products)
– IBM SPSS Modeler for Linux on System z

The difference between the versions of the Modeler is in the implemented


functionalities or models (see Table 1.1). IBM has excellently managed the imple-
mentation and deployment of the application concept through different channels.
Here we will show how to use the SPSS Modeler Premium software functionalities.
The IBM SPSS Modeler is a powerful tool, which enables the user access to a
wide range of statistical procedures for analyzing data and developing predictive
models. Upon finishing the modeling procedure, the models can be deployed and
implemented in data analytics production processes. The SPSS Modeler’s concepts
help companies to handle data analytics processes in a very professional way. In
complex environments however, the Modeler cannot be installed as a stand-alone
solution. In these cases the Modeler Server provides more power to users. Still, it is
based on the same application concept.

Table 1.1 Features of the different SPSS Modeler Editions


SPSS modeler SPSS modeler SPSS modeler
Key feature personal professional premium SPSS modeler gold
Deployment Desktop only Desktop/server Desktop/server Desktop/server
Techniques Classification Classification Classification Classification
Segmentation Segmentation Segmentation Segmentation
Association Association Association Association
Capabilities Text analytics Text analytics
Entity analytics Entity analytics
Social network Social network
analysis analysis
Enhancements Analytical decision
management
Collaboration and
deployment services
Source: IBM Website (2015a)
1.1 The Concept of the SPSS Modeler 3

" SPSS Modeler is a family of software tools provided for use with a wide
range of different operating systems/platforms. Besides the SPSS
Modeler Premium used here in this book, other versions of the
Modeler exist. Most of the algorithms discussed here are available in
all versions.

The Modeler’s Graphical User-Interface (GUI)


To start the SPSS Modeler we click on
Start > Programs > IBM SPSS Modeler 17 > IBM SPSS Modeler 17
Figure 1.1 shows the Modeler workspace with a stream. We can load the
particular stream by using the toolbar item and navigating to the folder with the
stream files provided in the book. The stream name here is “cluster_diabe-
tes_K_means.str”. The stream is the solution of an exercise using cluster analysis
methods discussed later in this book.
At the bottom of the workspace, we can find the nodes palette. Here all available
nodes are organized in tabs by topic. First we have to click the proper tab and then
select the nodes we need to build the stream. We will show how to build a stream in
Sect. 2.1.
If the tab “Streams” is activated in the upper part of the Modeler’s manager, on
the right in Fig. 1.1, we can switch to one of the streams we opened before. We can
also inspect the so-called SuperNodes. We explain the SuperNode concept in
Sect. 3.2.5.
In the “Outputs” tab of the Modeler manager, we can find the relevant outputs
created by a stream.
Additionally, an important part of the Modeler manager is the tab “Models”.
Here, we find all the models created by one of the open streams. If necessary, these
models can be added to a stream.

7RROEDU
0RGHOHU0DQDJHU

6WUHDP

1RGHV3DOHWWH

Fig. 1.1 SPSS Modeler GUI


4 1 Introduction

For now, these are the most important details of the IBM SPSS Modeler editions
and the workspace. We will go into more detail in the following chapters. To
conclude this introduction to the Modeler, we want to present a list of the
advantages and challenges of this application, by way of a summary of our findings
while working with it.

Advantages of using IBM SPSS Modeler

– The Modeler supports data analysis and the model building process, with its
accessible graphical user-interface and its elaborate stream concept. Creating a
stream allows the data scientist to very efficiently model and implement the
several steps necessary for data transformation, analysis, and model building.
The result is very comprehensible; even other statisticians that are not involved
in the process can understand the streams and the models very easily.
– Due to its powerful statistics engine, the Modeler can be used to handle and
analyze huge datasets, even on computers with restricted performance. The
application is very stable.
– The Modeler offers a connection with the statistics program R. We can link to
the dataset, doing several calculations in R, and send the results back to the
Modeler. Users wishing to retain the functionalities of R have the chance to
implement them in the Modeler streams. We will show how to install and how to
use R in Chap. 9.
– Finally, we have to mention the very good IBM support that comes as part of the
professional implementation of the Modeler in a firm’s analytics environment.
Even the best application sometimes raises queries that must be discussed with
experts. IBM support is very helpful and reliable.

Challenges with IBM SPSS Modeler

– IBM’s strategy is to deliver a meticulously thought-out concept in data mining


tools, to users wishing to learn how to apply statistics to real-world data. At first
glance, the SPSS Modeler can be used in a very self-explanatory way. The
potential risk is that difficult statistical methods can be applied in cases where
the data does not meet all the necessary assumptions, and so inaccuracies can
occur.
– This leads us to a fundamental criticism: the Modeler focuses on handling large
datasets in a very efficient way, but it does not provide that much detailed
information or statistics on the goodness of the data and the models developed.
A well-trained statistician may argue that other applications better support
assessment of the models built. As an example, we will stress here the cluster
analysis method. Statistics to assess the correlation matrix, such as the KMO
Bartlett test or the Anti-image correlation matrix, are not provided. Also a scree
plot is missing. This is particularly hard to understand as the Modeler is
obviously based on the program IBM SPSS Statistics, in which often more
details are provided.
1.2 Structure and Features of This Book 5

– Furthermore, results calculated in a stream, e.g., factor loadings, must often be


used in another context. The Modeler shows the calculation results in a well-
structured output tab, but the application does not provide an efficient way to
access the results with full precision from other nodes. To deal with the results in
other calculations, often the results must be copied manually.
– Data transformation and aggregation for further analysis is hideous. The calcu-
lation of new features by aggregating other variables is challenging and can
greatly increase the complexity of a stream. Outsourcing these transformations
into a different statistical program, e.g., using the R node, is often more efficient
and flexible.

All in all the IBM SPSS Modeler, like any other application, has its advantages
and its drawbacks. Users can analyze immense datasets in a very efficient way.
Statisticians with a deep understanding of “goodness of fit” measures and sophisti-
cated methods for creating models will not always be satisfied, because of the
limited amount of output detail available. However, the IBM SPSS Modeler is
justifiably one of the leading data mining applications on the market.

1.2 Structure and Features of This Book

1.2.1 Prerequisites for Using This Book

This book can be read and used with minimal mathematical and statistical knowl-
edge. Beside interest in dealing with statistical topics, the reader is required to have
a general understanding of statistics and its basic terms and measures, like fre-
quency, frequency distribution, mean, and standard deviation. Deeper statistics and
mathematics are shortly explained in the book when needed, or references to the
relevant literature are given, where the theory is explained in an understandable
way. The following books focussing on more basic statistical analyses are
recommended to the reader Herkenhoff and Fogli (2013) and Weiers et al. (2011).
Since the main purpose of this book is the introduction of the IBM SPSS
Modeler and how it can be used for data mining, the reader needs a valid IBM
SPSS Modeler licence in order to properly work with this book and solve the
provided exercises.
For readers that are not completely familiar with the theoretical background, the
authors explain at the beginning of each chapter the most relevant and needed
statistical and mathematical fundamentals and give examples where the outlined
methods can be applied in practice. Furthermore, many exercises are related to the
explanations in the chapters and are intended for the reader to recapitulate the
theory and gain more advanced statistical knowledge. The detailed explanations of
and comments to these exercises clarify more essential terms and procedures used
in the field of statistics.
6 1 Introduction

" It is recommended that the reader of this book has a valid IBM SPSS
Modeler licence and should be interested in dealing with current
issues in statistical and data analysis applications. This book facilitates
the easy familiarization with statistical and data mining concepts
because . . .

– No advanced mathematical or statistical knowledge is required


besides some general statistical understanding and knowledge of
basic statistical terms.
– Chapters start with explanations of the necessity of the procedures
discussed.
– Exercises cover all level of complexity, from basics of statistics to
advanced predictive modeling.
– Detailed explanations of the exercises help the reader to understand
the terms and concepts used.

" The following books focusing on more basic statistical analyses are
recommended to the reader:

Herkenhoff, L. and Fogli, J. (2013), Applied statistics for business and


management using Microsoft Excel, Springer, New York
Weiers, R.M., Gray, J.B. and Peters, L.H. (2011), Introduction to business
statistics, 7th ed., South-Western Cengage Learning, Australia, Mason,
OH

1.2.2 Structure of the Book and the Exercise/Solution Concept

The goal of this book is to help users of the Modeler to become familiar with the
wide range of data analysis and modeling methods offered by this application. To
this end, the book has the structure of a course or teaching book. Figure 1.2 shows
the topics discussed, allowing the user easy access to different focus points and the
information needed for answering his own particular questions.
Each section has the following structure:

1. An introduction to the basic principles of using the Modeler nodes.


2. A short description of the theoretical or statistical background.
3. Learning how to use the statistical methods by applying them to an example.
4. Figuring out the most important parameters and their meaning, using “What-If?”
scenarios.
5. Solving exercises and reviewing the solution.
1.2 Structure and Features of This Book 7

Fig. 1.2 Topics discussed and structure of the book

Table 1.2 Overview of stream and data structure used in this book
Stream name distribution_analysis_using_data_audit_node
Based on dataset tree_credit.sav
Stream structure

Important additional remarks


It is important to define the scale type of each variable correctly, so the Data Audit node
applies the proper chart (bar chart or histogram) to each variable. For discrete variables, the SPSS
Modeler uses a bar chart, whereas for continuous/metric variables, the distribution is visualized
with a histogram
Related exercises: 10

Introducing the Details of the Streams Discussed


At the beginning of each section, an overview can be found, as shown in Table 1.2,
for the reader to identify the necessary dataset and the streams that will be discussed
in the section.
8 1 Introduction

Table 1.3 Example of a Name of the solution streams File name of the solution
solution
Theory discussed in section Section XXX

The names of the dataset and stream are listed before the final stream is depicted,
followed by additional important details. At the end, the exercise numbers are
shown, where the reader can test and practice what he/she has learned.

Exercises and Solutions


To give the reader a chance to assess what he/she has learned, exercises and their
solutions can be found at the end of each section. The text in the solution usually
explains any details worth bearing in mind with regard to the topic discussed. To
link more extensive discussion in the theoretical part with the questions at the end
of the book, each solution begins with a table, as shown in Table 1.3. Here, a cross-
reference to the theoretical background can be found.

1.2.3 Using the Data and Streams Provided with the Book

The SPSS Modeler streams need access to datasets that can then be analyzed. In the
so-called “Source” nodes, the path to the dataset folder must be defined. To work
more comfortably the streams provided with this book are based on the following
logic:

– All datasets are copied into one folder in the “C:” drive
– There is just one file “registry_add_key.bat” in the folder “C:
\SPSS_MODELER_BOOK\”
– The name of the dataset folder is “C:\SPSS_MODELER_BOOK\001_Datasets”
– The streams are normally copied to “C:\SPSS_MODELER_BOOK
\002_Streams”, but the folder names can be different.

If other folders are to be used, then the procedure described here, in particular the
Batch file, must be modified slightly.

" All datasets, IBM SPSS Modeler streams, R scripts, and Microsoft Excel
files discussed in this book are provided as downloads on the website:

" http://www.statistical-analytics.net

" Password: “spssmodelerspringer”

" For ease, the user can add a key to the registry of Microsoft Windows.
This is done using the script “registry_add_key.bat” provided with
this book.
1.2 Structure and Features of This Book 9

" Alternatively the command:

" REG ADD “HKLM\Software\IBM\IBM SPSS Modeler\17.0\Environment”/


v “BOOKDATA”/t REG_SZ/d “C:\SPSS_MODELER_BOOK\001_Datasets”,
where “C:\SPSS_MODELER_BOOK\001_Datasets” is the folder of
datasets to be used.

To work with this book, we recommend the following steps:

1. Install SPSS Modeler Premium (or other version) on your computer


2. Download the ZIP file with all files related to the book from the website
http://www.statistical-analytics.net. The password is “spssmodelerspringer”.
3. Move the folder “C:\SPSS_MODELER_BOOK” from the ZIP file to your disk
drive “C:”
4. Add a key to the Microsoft Windows Registry that will allow the SPSS Modeler
to find the datasets. To do so:
(a) Navigate to the BATCH file named “registry_add_key.bat” in the folder
“C:\SPSS_MODELER_BOOK\”
(b) Right-click on the file and choose the option “Run as Administrator”. This
allows the script to add the key to the registry.
(c) After adding the key, restart the computer.

The key can also be added to the Windows Registry manually by using the
command:
REG ADD “HKLM\Software\IBM\IBM SPSS Modeler\17.0\Environment”/v
“BOOKDATA”/t REG_SZ/d “C:\SPSS_MODELER_BOOK\001_Datasets”
Instead of using the original Windows folder name, e.g., “C:
\SPSS_MODELER_BOOK\001_Datasets”, to address a dataset, the shortcut
“$BOOKDATA” defined in the Windows registry should now be used (see also
IBM Website 2015b).
We should pay attention to the fact that the backslash in the path must be
substituted with a slash. So the path
“C:\SPSS_MODELER_BOOK\001_Datasets\car_sales_modified.sav”
equals
“$BOOKDATA/car_sales_modified.sav”.

1.2.4 Datasets Provided with This Book

With this book, the reader has access to more than 30 sets which can be downloaded
from the author’s website using the given password (see Sect. 1.3.1). The data are
available in different file formats, such as:
10 1 Introduction

– TXT file with spaces between the values


– CSV files with values separated by a comma

Additionally, the reader can download extra files in R or in Microsoft Excel


format. This enables the user to do several analysis steps in other programs too, e.g.,
R or Microsoft Excel. Furthermore, some calculations are presented in Microsoft
Excel files.

" All datasets discussed in this book can be downloaded from the authors’
website using the password provided. The datasets have different file
formats, so the user learns to deal with different nodes to load the
sets for data analysis purposes.

" Additionally, R- and Microsoft Excel Files are provided to demonstrate


some calculations.

" Since version 17, SPSS Modeler does not support Excel file formats
from 1997 to 2003. It is necessary to use file formats from at
least 2007.

1.2.5 Template Concept of This Book

In this book, we show how to build different streams and how to use them for the
analysis of datasets. We will explain now how to create a stream from scratch.
Despite outlining this process, it is unhelpful and time-consuming to always have to
add the data source and some suggested nodes, e.g., the Type node, to the streams.
To work more efficiently, we hereby would like to introduce the concept of
so-called template streams. Here, the necessary nodes for loading the dataset and
defining the scale types for the variables are implemented. So the users don’t have
to repeat these steps in each exercise. Instead, they can focus on the most important
steps and learn the new features of the IBM Modeler. The template streams can be
extended easily by adding new nodes. Figure 1.3 shows a template stream.
We should mention the difference between the template streams and the solution
streams provided with this book. In the solution streams, all the necessary features
and nodes are implemented, whereas in the template streams only the first steps,
e.g., the definition of the data source, are incorporated. This is depicted in Fig. 1.4.
So the solution is not only a modification of the template stream. This can be seen
by comparing Fig. 1.5 with the template stream in Fig. 1.3.
1.2 Structure and Features of This Book 11

Fig. 1.3 Stream “Template-Stream_Car_Simple”

Fig. 1.4 Template-stream concept of the book


12 1 Introduction

Fig. 1.5 Final Stream “car_clustering_simple”

" A template-stream concept is used in this book. In the template streams,


the datasets will be loaded and the scale types of the variables are
defined.

" The template streams can be found in the sub-folder “Template_Streams”,


provided with this book.

" Template streams access the data by using the “$BOOKDATA” short-
cut defined in the registry. Otherwise the folder in the Source nodes
would need to be modified manually before running the stream.

" The template-stream concept allows the user to concentrate on the


main parts of the functionalities presented in the different sections. It
also means that when completing the exercises, it is not necessary to
deal with dataset handling before starting to answer the questions.
Simply by adding specific nodes, the user can focus on the main tasks.

The details of the datasets and the meaning of the variables included are
described in Sect. 10.1.
Before the streams named above can be used, it is necessary to take into account
the following aspects of data directories. The streams are created based on the
concept presented in Sect. 1.2. As shown in Fig. 1.6, the Windows registry shortcut
“$BOOKDATA” is being used in the Source node. Before running a stream,
the location of the data files should be verified and probably adjusted. To do
this, double-click on the Source node and modify the file path if necessary (see
Fig. 1.6).
1.3 Introducing the Modeling Process 13

Fig. 1.6 Example of verifying the data directory in a Statistics node

1.3 Introducing the Modeling Process

Before we dive into statistical procedures and models, we want to address some
aspects relevant to the world of data from a statistical point of view. Data analytics
are used in different ways, such as for (see Ville 2001, pp. 12–13):

– Customer acquisition and targeting in marketing


– Reducing customer churn by identifying their expectations
– Loyalty management and cross-selling
– Enabling predictive maintenance
– Planning sales and operations
– Managing resources
– Reducing fraud
14 1 Introduction

This broad range of applications lets us assume that there are an infinite number
of opportunities for the collection of data and for creating different models. So we
have to focus on the main aspects that all these processes have in common. Here we
want to first outline the characteristics of the data collection process and the main
steps in data processing.
The results of a statistical analysis should be correct and reliable, but this always
depends on the quality of the data the analysis is based upon. In practice, we have to
deal with the effects that dramatically influence the volume of data, the quality, and
the data analysis requirements. The following list, based on text from the IBM
Website (2015c), gives an overview:

1. Scale of data:
New automated discovery techniques allow the collection of huge datasets.
There has been an increase in the number of devices that are able to generate data
and send it to a central source.
2. Velocity of data:
Due to increased performance in all business processes, data must be analyzed
faster and faster. Managers and consumers expect results in minutes or seconds.
3. Variety of data:
The time has long passed since data were collected in a structured form,
delivered, and used more or less directly for data analysis purposes. Data are
produced in different and often unstructured or less structured forms, such as
through social networks comment, information on websites, or through stream-
ing platforms.
4. Data in doubt:
Consolidated data from different sources enable statisticians to draw a more
accurate picture of which entities to analyze. The data volume increases dramat-
ically, but improved IT performance allows the combination of many datasets
and the use of a broader range of sophisticated models.

The source of data determines the quality of the research or analysis. Figure 1.7
shows a scheme to characterize datasets by source and size. If we are to describe a
dataset, we have to use two terms, one from either side of the scheme. For instance,
collected data that relate to consumer behavior, based on a survey, is typically a
sample. If the data are collected by the researcher his/herself, it is called primary
data because he/she is responsible for the quality.
Once the data are collected, the process of building a statistical model can start,
as shown in Fig. 1.8. As a first step, the different data sources must be consolidated,
using a characteristic that is unique for each object. Once the data can be combined
in a table, the information must be verified and cleaned. This means removing
duplicates or finding spelling mistakes or semantic failures.
At the end of the cleaning process, the data are prepared for statistical analysis.
Typically, further steps such as normalization or re-scaling are used and outliers are
detected. This is to meet the assumptions of the statistical methods for prediction or
pattern identification. Unfortunately, a lot of methods have particular requirements
1.3 Introducing the Modeling Process 15

Fig. 1.7 Characteristics of datasets

Fig. 1.8 Steps to create statistical model

that are hard to achieve, such as normal distributed values. Here the statistician has
to know the consequences of deviation from theory to practice. Otherwise, the
“goodness of fit” measures or the “confidence intervals” determined, based on the
assumptions, are often biased or questionable.
A lot of further details regarding challenges in the data analysis process could be
mentioned here. Instead of stressing theoretical facts however, we would like to
dive in to the data handling and model building process with the SPSS Modeler. We
recommend the following exercises to the reader.
16 1 Introduction

1.3.1 Exercises

Exercise 1: Data Measurement Units


If we wish to deal with data, we have to have a perception of data measurement
units. The units for measuring data volume, from a bit up to a terabyte, are well
known. As a recap, answer the following questions:

1. Write all the data measurement units in the correct order and explain how to
convert them from one to another.
2. After a terabyte of data, come the units petabyte, exabyte, and zetabyte. How do
they relate to the units previously mentioned?
3. Using the Internet, find examples that help you to imagine what each unit means
in terms of daily live data volume.

Exercise 2: Details and Examples for Primary and Secondary Data


In Fig. 1.7, we distinguished between primary and secondary data. Primary data can
be described in short as data that are generated by the researcher his/herself. So
he/she is responsible for the data. Often also data that comes from one’s own
department/company are referred to as primary data. Secondary data is collected
by others. The researcher (or more often the firm of the researcher) is unable to
judge the quality in detail.

1. Name and briefly explain at least three advantages and three drawbacks of a
personal interview, a mail survey, or an internet survey from a data collection
perspective.
2. Name various possible sources of secondary data.

Exercise 3: Distinguishing Different Data Sources


The journal The Economist (2014, p. 5), in its special report “Advertising and
Technology” in 2014, describes how . . .

The advertising industry obtains its data in two ways. “First-Party” data are collected by
firms with which the user has a direct relationship. Advertisers and publishers can compile
them by requiring users to register online. This enables the companies to recognize
consumers across multiple devices and see what they read and buy on their site.
“Third-party” data are gathered by thousands of specialist firms across the web [. . .] To
gather information about users and help servers appropriate ads, sites often host a slew of
third parties that observe who comes to the site and build up digital dossiers about them.

Using the classification scheme of data sources shown in Fig. 1.7, list the correct
terms for describing the different data sources named here. Explain your decision,
as well as the advantages and disadvantages of the different source categories.

Exercise 4: Data Preparation Process


In the article “How Companies Learn Your Secrets”, The New York Times
Magazine (2012) describes the ability of data scientists to identify pregnant female
1.3 Introducing the Modeling Process 17

customers. They do this so that stores, such as Target and Walmart, can generate
significantly higher margins from selling their baby products. So if the consumer
realizes, through reading personalized advertisements, that these firms offer inter-
esting products, the firm can strengthen its relationship with these consumers. The
earlier a company can reach out to this target group, the more profit can be
generated later. Here we want to discuss the general procedure for such a data
analysis process.
The hypothesis is that consumer habits of women and their male friends change
in the event of a pregnancy. By the way, the article also mentioned that according to
studies, consumers also change their habits when they marry. They become more
likely to buy a new type of coffee. If they divorce they start buying different brands
of beer and if they move into a new house there is an increased probability they will
buy a new kind of breakfast cereal.
Consider you have access to the following data:

• Primary data
– Unique consumer ID generated by the firm
– Consumers’ credit card details
– Purchased items from the last 12 months, linked to credit card details or
customer loyalty card
– An internal firm registry, including the parents personal details collected in a
customer loyalty program connected with the card
• Secondary data:
– Demographic information (age, number of kids, marital status, . . .)
– Details, e.g., ingredients in lotions, beverages, types of breads, etc. bought
from external data suppliers

Answer the following questions:

1. Describe the steps in a general data preparation procedure, to enable identifica-


tion of changing consumer habits.
2. Explain how you can learn if and how purchase patterns of consumers change
over time.
3. Now you have a consolidated dataset. Additionally, you know how pregnant
consumers are more likely to behave. How can you then benefit from this
information and generate more profit for the firm that hired you?
4. Find out the risks in trying to contact the identified consumers, e.g., by mail,
even if you can generate more profit afterwards.

Exercise 5: Data Warehousing Versus Centralized Data Storage


Figure 1.9 shows the big picture of an IT system implemented within a firm (see
also Abts and Mülder 2009, p. 68). The company sells shoes from a website. The
aim of analyzing this system is to find out if concentrating data in a centralized
database is appropriate in terms of security and data management.
Answer the following questions:
18 1 Introduction

Fig. 1.9 IT system with central database in a firm [Figure adapted from Abts and Mülder (2009,
p. 68)]

1. Describe the main characteristics of the firm’s IT system in your own words.
2. List some advantages of data consolidation realized by the firm.
3. Discussing the actual status, describe the drawbacks and risks that the firm faces
from centralizing the data.
4. Summarize your findings and make a suggestion of how to implement data
warehousing within a firm’s IT landscape.

1.3.2 Solutions

Exercise 1: Data measurement Units


Theory discussed in section Section 1.1

The details of a possible solution can be found in Table 1.4. This table is based on
The Economist (2010b).

Exercise 2: Details and Examples of Primary and Secondary Data

1. Table 1.5 shows possible answers. Interested readers are referred to


Lavrakas (2008).
2. Related to Fig. 1.7, we name here different sources for secondary data based on
Weiers et al. (2011, pp. 112–113).
1.3 Introducing the Modeling Process 19

Table 1.4 Data measure units and their interpretation


Unit Size Example
Bit Two options 0 or 1 Abbreviation of “Binary digit”. Smallest unit
for storing data
Byte (B) 8 bits Can create a simple alphabet with up to
256 characters or symbols
Kilobyte (KB) 210 ¼ 1024 Bytes One page of text equals 2 KB
Megabyte (MB) 210 KB ¼ 220 Bytes A pop song is about 4–5 MB, Shakespeare’s
work needs 5 MB
Gigabyte (GB) 210 MB ¼ 220 KB ¼ 230 A 2-h movie is about 1–2 GB on the hard disk
Bytes of a laptop
Terabyte (TB) 210 GB ¼ 220 MB, etc. All the books in America’s Library of Congress
is about 15 TB
Petabyte (PB) 210 TB ¼ 220 GB, etc. All letters sent in the US per year equals about
5 PB. Google processes 5 PB per hour
Exabyte (EB) 210 PB ¼ 220 TB, etc. 50,000 years of high quality video
Zetabyte (ZB) 210 EB ¼ 220 PB, etc. All information coded in the world in 2010
equals about 1.2 ZB

Table 1.5 Comparison of primary data collection methods


Method for
collection of
primary data Advantages Drawbacks
Personal – Respondents often cooperate – Not anonymous
interview because of the influence of the – High time pressure
interviewer – Respondents are influenced by the
– Open-ended questions can be interviewers appearance and behavior
asked – Relatively expensive in comparison to
– Depending on the cultural other methods
background, several topics
should not be discussed
Mail survey – Respondents feel more – Coding of answers is necessary
anonymous – Number of questions must be limited
– Nearly all topics can be because of often lower motivation to
addressed answer
– Distance between firms and – Postal charges
respondents not essential
Internet – Low costs per respondent – Only limited number of open-ended
survey – Answers are coded by the questions can be asked
respondents – Internet access necessary for
– Usage of visualizations is respondents/firewalls in firms must be
possible considered
– Respondents are annoyed by the
overwhelming number of internet surveys
and therefore less motivated to answer
– Professional design of the survey
requires experience
– Internet does not ensure confidentiality
20 1 Introduction

Examples of official statistics sources:

– Census data, e.g., Census of Population or Economic Censuses


http://www.census.gov/
http://ec.europa.eu/eurostat/data/database
– Central banks, e.g., Federal Reserve and European Central Bank
http://www.federalreserve.gov/econresdata/default.htm
http://sdw.ecb.europa.eu/
– Federal Government and Statistical Agencies
https://nces.ed.gov/partners/fedstat.asp

Examples of nonofficial statistics sources:

– Firms and other commercial suppliers, e.g., Bloomberg, Reuters, etc.


– NGO’s
– University of Michigan Consumer Sentiment Index

Exercise 3: Distinguishing Different Data Sources


Theory discussed in section Section 1.4

Table 1.6 shows the solution. More details can be found, e.g., in Ghauri and
Grønhaug (2005).

Table 1.6 Advantages and disadvantages of data sources


Term in
the citation Term in Fig. 1.7 Advantages Drawbacks
First party Primary data ¼ data – The researcher can – Takes time and needs
collected by the same influence the quality own resources to collect
researcher, department, of the data and the the data, e.g., through
or firm itself. The firm is sample size conducting a survey
responsible for the quality
Third party Secondary data ¼ data – Often easy and – Explanation of the
collected by other firms, faster to get variables is less precise
departments, or – Probably more – Answering the original
researchers. The firm expensive than research question may be
itself cannot influence primary data more difficult. That’s
the quality of the data because the third party had
another focus/reason for
collecting the data
– Quality of the data
depends on the know-how
of the third party
1.3 Introducing the Modeling Process 21

Exercise 4: Data Preparation Process

1. It is necessary to clean up and consolidate the data of the different sources. The
consumer ID is the primary key in an interesting database table. Behind these
keys, the other information must be merged. In the end, e.g., demographic
information including the pregnancy status calculated based on the baby register
is linked to consumer habits in terms of purchased products. The key to
consolidating the purchased items of a consumer is the credit card or the
consumer loyalty card details. As long as the consumer shows one of these
cards and doesn’t pay cash, the purchase history becomes more complete
each time.
2. Assuming we have clean personal consumer data linked to the products bought,
we can now determine the relative frequency of purchases over time. Moreover,
the product itself may change, and then the percentage of specific ingredients in
each of the products may be more relevant. Pregnant consumers buy more
unscented lotion at the beginning of their second trimester and sometimes in
the first 20 weeks they buy more scent-free soap, extra-big bags of cotton balls
and supplements such as calcium, magnesium, and zinc, according to the The
New York Times Magazine (2012) article.
3. We have to analyze and check the purchased items per consumer. If the pattern
changes are somehow similar to the characteristics we found from the data
analysis for pregnancy, we can reach out to the consumer, e.g., by sending
personalized advertisements to the woman or her family.
4. There is another interesting aspect to the data analytics process. Analyzing the
business risk is perhaps more important than missing a chance to do correct
analysis at the correct point in time. There are at least some risks: how do
consumers react when they realize a firm can determine their pregnancy status?
Is contacting the consumer by mail a good idea? One interesting issue the firm
Target had to deal with was the complaints received from fathers who didn’t
even know that their daughters were pregnant. They only discovered upon
receipt of the personalized mail promotion. So it is necessary to determine the
risk to business before implementing, for, e.g., a pregnancy-prediction model.

Exercise 5: Data Warehousing Versus Centralized Data Storage

1. All data is stored in one central database. External and internal data are
consolidated here. The incoming orders from a website, as well as the manage-
ment of relevant information, are saved in the database. A database management
system allows for restriction of user access. By accessing and transforming the
database content, complex management reports can be created.
2. If data is stored in a central database, then redundant information can be reduced.
Also each user can see the same data at a single point in time. Additionally, the
database can be managed centrally, and the IT staff can focus on keeping this
system up and running. Furthermore, the backup procedure is simpler and
cheaper.
22 1 Introduction

3. There are several risks, such as . . .


– Within the database, different types of information are stored, such as trans-
actional data and managerial and strategic information. Even if databases
allow only restricted access to information and documents, there remains the
risk that these restrictions will be circumvented, e.g., by hacking a user
account and getting access to data that should be confidential.
– Running transactions to process the incoming orders is part of the daily
business of a firm. These transactions are not complex and the system
normally won’t break down, as long as the data processing capacities are in
line with the number of incoming web transactions.
Generating managerial-related data is more complex, however. If an appli-
cation or a researcher starts a performance-consuming transaction, the data-
base service can collapse. If this happens, then the orders from the consumer
website won’t be processed either. In fact, the entire IT system will
shut down.

4. Overall, however, consolidating data within a firm is a good idea. In particular,


providing consistent data improves the quality of managerial decisions.
Avoiding the risk of complete database shutdown is very important, however.
That is why data relevant to managerial decision-making, or data that are the
basis of data analysis procedures, must be copied to a separate data warehouse.
There the information can be accumulated day by day. A time lag of 24 h, for
example, between data that is relevant to the operational process and the data
analysis itself is in most firms not critical. An exception would be firms such as
Dell that sell products with a higher demand price elasticity.

Literature
Abts, D., & Mülder, W. (2009). Grundkurs Wirtschaftsinformatik: Eine kompakte und praxisor-
ientierte Einf€
uhrung, STUDIUM (6th ed.). Wiesbaden: Vieweg + Teubner.
Davenport, T., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century. Harvard
Business Review, 90(10), 70–76.
de Ville, B. (2001). Microsoft data mining: Integrated business intelligence for e-Commerce and
knowledge management. Boston: Digital Press.
Ghauri, P. N., & Grønhaug, K. (2005). Research methods in business studies: A practical guide
(3rd ed.). New York: Financial Times Prentice Hall.
Herkenhoff, L., & Fogli, J. (2013). Applied statistics for business and management using Microsoft
Excel. New York: Springer.
IBM Website. (2015a). SPSS modeler edition comparison. Accessed June 16, 2015, from http://
www-01.ibm.com/software/analytics/spss/products/modeler/edition-comparison.html
IBM Website. (2015b). Why does $CLEO_DEMOS/DRUG1n find the file when there is no
$CLEO_DEMOS directory? Accessed June 22, 2015, from http://www-01.ibm.com/support/
docview.wss?uid¼swg21478922
IBM Website. (2015c). Big data in the cloud. Accessed June 24, 2015, from http://www.ibm.com/
developerworks/library/bd-bigdatacloud/
Lavrakas, P. (2008). Encyclopedia of survey research methods. London: Sage Publications.
Literature 23

The Economist. (2010a). Data, data everywhere: A special report on managing information (Vol.
2010, No. 3).
The Economist. (2010b). All too much. Accessed June 26, 2015, from http://www.economist.com/
node/15557421
The Economist. (2014). Data – Getting to know. Economist, 2014(13), 5–6.
The New York Times Magazine. (2012). How companies learn your secrets. Accessed June
26, 2015, from http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r¼0
Walmart. (2012). Company thanks and rewards associates for serving millions of customers.
Accessed July 1, 2015, from http://news.walmart.com/news-archive/2012/11/23/walmart-us-
reports-best-ever-black-friday-events
Weiers, R. M., Gray, J. B., & Peters, L. H. (2011). Introduction to business statistics (7th ed.).
Mason, OH: South-Western Cengage Learning.
Basic Functions of the SPSS Modeler
2

After finishing this chapter, the reader is able to . . .

1. Define and run streams.


2. Explain the necessity of value labels and know how to implement them in the
SPSS Modeler.
3. Use data handling methods, e.g., filtering, to extract specific information needed
to answer research questions.
4. Explain in detail the usage of methods to handle datasets, e.g., such as sampling
and merging and finally.
5. Explain why splitting data sets in different partitions is required to create and
assess statistical models.

So the successful reader will gain proficiency in the use of computer and the
IBM SPSS Modeler to prepare data and handle even more complex datasets.

2.1 Defining Streams and Scrolling Through a Dataset

Description of the model


Stream name Read SAV file data.str
Based on dataset tree_credit.sav
Stream structure

Related exercises: 1, 2

# Springer International Publishing Switzerland 2016 25


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_2
26 2 Basic Functions of the SPSS Modeler

Theoretical Background
The first step in an analytical process is getting access to the data. We assume that
we have access to the data files provided with this book and we know the folder
where the data files are stored. Here we would like to give a short outline of how to
import the data into a stream. Later in Sect. 3.1.1, we will learn how to distinguish
between discrete and continuous variables. In Sect. 3.1.2, we will refer to the
procedure for determining the correct scale type in more detail.

Getting Access to a Data Source


The dataset “tree_credit.sav” should be used here as an example. The file extension
“sav” shows us that it is an SPSS Statistics file. Now we will describe how to add
such a data source to the stream and define value labels. Afterwards we will add a
Table node, which enables us to scroll through the records.

1. We open a new and empty stream by using the shortcut “Ctrl+N” or the toolbar
item “File/New”.
2. Now we save the file to an appropriate directory. To do this, we use “File/Save
Stream”, as shown in Fig. 2.1.
3. The source file used here is an SPSS-Data file with the extension “SAV”. So, we
add a “Statistics File” node from the Modeler tab “Sources”. We then double-
click on the node to open the settings dialog window. We define the folder, as
shown in Fig. 2.2. Here the prefix “$BOOKDATA/” represents the placeholder
for the data directory. We have already explained that option in Sect. 1.2.3. If we

Fig. 2.1 Saving a stream


2.1 Defining Streams and Scrolling Through a Dataset 27

Fig. 2.2 Parameters of the Statistics File node

Fig. 2.3 Unconnected nodes in a stream

choose not to add such a parameter to the Windows registry, this part of the file
path should be replaced with the correct Windows path, e.g., “C:\DATA\” or
similar paths. The filename is “tree_credit.sav” in every case, and we can
confirm the parameters with the “OK” button.
4. Until now, the stream has only included a Statistics File node. In the next step,
we add a Type node from the Modeler toolbar tab “Field Ops”, by dragging and
dropping it into the stream. Using this method, the nodes are unconnected (see
Fig. 2.3).
5. Next, we have to connect the Statistics File node with the Type node. To do this,
we simply left-click on the Statistics file node once and press F2. Then left-click
on the Type node. The result is shown in Fig. 2.4.
28 2 Basic Functions of the SPSS Modeler

Fig. 2.4 Two connected nodes in a stream

Fig. 2.5 Details of a Type node

There is also another method for connecting two nodes: In step four, we added
a Statistics File node to the stream. Normally, the next node has to be connected
to this Statistics File node. For this reason, we would left-click once on the
current node in the stream, e.g., the Statistics File node. This would mark the
node. If we now left-click twice on the new Type node in “Field Ops”, the Type
node will be added and automatically connected to the Statistics File node.
6. So far, we have added the data to the stream and connected the Statistics File
node to the Type node. If we want to see and probably modify the settings, we
double-click on the Type node. As shown in Fig. 2.5, the variable names appear,
their scale type and also their role in the stream. We will discuss the details of the
different scales in Sect. 3.1.1 and 3.1.2. For now we can summarize that a Type
node can be used to define the scale of measurement per variable, as well as their
role, such as with input or target variables. If we use nodes other than a Statistics
File node, the scale types are not predefined and must be determined by the user.
2.1 Defining Streams and Scrolling Through a Dataset 29

Fig. 2.6 Final stream “Read SAV file data”

7. To have the chance to scroll through the records, we should also add a
Table node. We can find it in the Modeler toolbar tab “Output”, and add it to
the stream. Finally, we should connect the Type node to the Table node, as
outlined above. Figure 2.6 shows the three nodes in the stream.

" To extend a stream with a new node, we can use two different
procedures:

1. We add the new node. We connect it with an existing node


afterwards as follows: We click the existing node once. We press F2
and then we click the new node. Both nodes are now connected in
the correct order.
2. Normally, we know the nodes that have to be connected. So before
we add the new node, we should click on the existing node, to
activate it. Then we double-click the new node in the specific tab.
Both nodes are now automatically connected.

" A Type node must be included right after a Source node. Here the
scale of measurement can be determined, as well as the role of a
variable.

" We have to keep in mind that it is not possible to show the results of
two sub-streams in one Table node. To do that, we would have to use a
merge or append operation to consolidate the results. See Sect. 2.7.9
and 2.7.10.

8. If we now double-click on the Table node, the dialog window in Fig. 2.7 appears.
To show the table with the records, we click on “Run”.
9. Now we can scroll through the records as shown in Fig. 2.8. After that, we can
close the window with “OK”.

" Often it is unnecessary to modify the settings of a node, e.g., in a


Table node, when we can right-click on the node in the stream and
select “Run”. (see Fig. 2.9).
30 2 Basic Functions of the SPSS Modeler

Fig. 2.7 Dialog window in the Table node

Fig. 2.8 Records shown with a Table node

Recommended Nodes in a Stream


In Sect. 1.2.3, we explained the template concept used in this book. It is true that we
normally don’t deal with nodes when loading the data into the Modeler and defining
the scale of measurement of the included variables. It is necessary, however, to
discuss which nodes in general must be a part of a stream.
As outlined in Table 2.1, we must use a Source node to open the file with the
given information. After the Source node, we should add a Type node to define
2.1 Defining Streams and Scrolling Through a Dataset 31

Fig. 2.9 Options that appear


by right-clicking on a node

Table 2.1 Important nodes and their functionalities


Source node, e.g., Loads the data in the SPSS Modeler. The type of the Source node
Statistics node depends on the type of file where the dataset is saved. They are . . .
– Variable File node:
Files are separated files by text or a comma (CSV).
– Fixed File node:
Text files with values saved in columns, but no delimiters used
between the values.
– Excel File Node:
Microsoft Excel files in “xlsx” format (not “xls”).
– Statistics File node:
Opens SPSS Statistics files.
Type node A Type node can be used to . . .
– Determine the scale of measurement of the variables.
– Define the role of each variable, e.g., Input or Target variable.
– Define value labels.
Table node Shows the values/information included in the data file opened by a
Source node.
32 2 Basic Functions of the SPSS Modeler

the scale of measurement for each variable. Finally, we recommend adding a


Table node to each stream, which gives us the chance to scroll though the data.
This helps us inspect the data and become familiar with the different variables.
Figure 2.6 shows the final stream we created for loading the dataset, given in
the form of the statistics file “tree_credit.sav”. Here we added a Table node, right
after the Type node. Sometimes we also connect the Table node directly to the
Source node.

2.2 Switching Between Different Streams

Description of the model


Stream name Filter_processor_dataset and Filter_processor_dataset modified
Based on dataset benchmark.xlsx

The aim of this section is to show how to handle different streams at the
same time. We will explain shortly the functions of both streams used here in
Sect. 2.7.5.

1. We start with a clean SPSS Modeler environment, so please make sure each
stream is closed. We will probably need to close the Modeler and open it again.
2. We open the stream “Filter_processor_dataset”. In this stream several records
from a dataset will be selected and others are hidden in the nodes at the end of the
stream.
3. We also open the stream “Filter_processor_dataset modified”.
4. At this time, two nodes are open in the Modeler. We can see that in the Modeler
Managers sidebar on the right. Here on top we can find all the open streams. By
clicking on the stream name, we can switch between the streams (Fig. 2.10).
5. Additionally, it is possible to execute other commands, e.g., open a new stream
or close a stream here. To do so we click with the right mouse button inside the
“Streams” section of the sidebar (see Fig. 2.11).
6. Switching between the streams sometimes helps us find out which different
nodes are being used. As depicted in Fig. 2.12 in the stream

Fig. 2.10 Open streams in the Modeler sidebar


2.2 Switching Between Different Streams 33

Fig. 2.11 Advanced options in the Modelers sidebar

“Filter_processor_dataset”, the Filter node and the Table node in the middle are
being added, in comparison with stream “Filter_processor_dataset modified”.
We will explain the functions of both streams in Sect. 2.7.5.

" Switching between streams is helpful for finding the different nodes
used or for showing some important information in data analysis,
without modifying the actual stream. If the streams are open, then in
the Modeler sidebar on the right we can find them and can switch to
another stream by clicking it once.

" Using the right mouse button in the modelers sidebar on the right, a
new dialog appears that allows us, for example, to close the active
stream.
34 2 Basic Functions of the SPSS Modeler

Fig. 2.12 Stream “Filter_processor_dataset”

" If there are SuperNodes, they would also appear in the Modeler’s
sidebar. For details see Sect. 3.2.5.

2.3 Defining or Modifying Value Labels

Description of the model


Stream name modify_value_labels
Based on dataset tree_credit.sav
Stream structure

Related exercises: 3
2.3 Defining or Modifying Value Labels 35

Theoretical Background
Value labels are useful descriptions. They allow convenient interpretation. By defin-
ing a good value label, we can later determine the axis annotations in diagrams too.

Working with Value Labels

1. Here we want to use the stream we discussed in the previous section. To extend
this stream, let’s open “Read SAV file data.str” (Fig. 2.13).
2. Save the stream under another name.
3. To open the settings dialog window shown in Fig. 2.14, we double-click on the
Type node. In the second column of the settings in this node, we can see that the
Modeler automatically inspects the source file and tries to determine the correct

Fig. 2.13 Stream for reading the data and modifying the value labels

Fig. 2.14 Parameters of the Type node


36 2 Basic Functions of the SPSS Modeler

scale of measurement. In Sects. 3.1.1 and 3.1.2, we will show how to determine
the correct of measurement and to modify the settings. Here we accept the
parameters.
4. Let’s discuss here the so-called “value labels”. If the number of different values
is restricted, it makes sense to give certain values a name or a label. In this
example we will focus on the variable “Credit_cards”. In column 3 of Fig. 2.14,
we can see that at least two different values, 1.0 and 2.0, are possible. If we click
on the Values field that is marked with an arrow in Fig. 2.14, a drop-down list
appears. See step 1 of how to specify a value label, Fig. 2.15.
5. First click the button “Read Values”. See Fig. 2.15! To show and to modify the
value labels, let’s click on “Specify”. Figure 2.16 shows that in the SPSS-data
file, two labels are predefined: 1 ¼ “Less than 5” and 2 ¼ “5 or more” credit
cards. Normally, these labels are passed through the Type node without any
modification (see arrow in Fig. 2.16).

" In a Type node, value labels can be defined or modified in the


“Values” column. First click the button “Read Values”. Then the
option “Specify. . .” in the column “Values” should be used to define
the labels.

6. If we wish to define another label, we set the option to “Specify values and
labels”. We can now modify our own new label. In this stream, we use the label

Fig. 2.15 Step 1 of how to specify a value label in a Type node


2.3 Defining or Modifying Value Labels 37

Fig. 2.16 Step 2 of how to specify a value label in a Type node

“1, 2, 3, or 4” instead of “Less than 5”. Figure 2.17 shows the final step for
defining a new label for the value 1.0 within the variable “Credit_cards”.
7. We can confirm the new label with “OK”. The “<Read+>” text in the
column “Values” shows us that we successfully modified the values labels
(Fig. 2.18).
Usually, the Modeler will determine the correct scale of measurement of the
variables. However, an additional Type node will give us the chance to actually
change the scale of measurement for each variable. We therefore suggest adding
this node to each stream.
The dataset “Credit_cards.sav” includes a definition of the scale of measure-
ment, but this definition is incorrect. So, we should adjust the settings in the
“Measurement” column, as shown in Fig. 2.19. It is especially important to
check variables that should be analyzed with a “Distribution node”. They have
to be defined as discrete! This is the case with the “Credit_cards” variable.
8. We can close the dialog window without any other modification and click on
“OK”.
38 2 Basic Functions of the SPSS Modeler

Fig. 2.17 Step 3 of how to specify a value label in a Type node

Fig. 2.18 Modified value labels in the Type node


2.3 Defining or Modifying Value Labels 39

Fig. 2.19 Parameters of a Type node

Fig. 2.20 Records without value labels in a Table node

9. If we want to scroll through the records and wish to have an overview of the
new value labels, we should right-click on the Table node. Then we can use
“Run” to show the records. Figure 2.20 shows the result.
10. To show the value labels defined in step 6, we use the button in the middle of
the toolbar of the dialog window. It is marked with an arrow in Fig. 2.20.
11. Now we can see the value labels in Fig. 2.21. We can close the window with
“OK”.
40 2 Basic Functions of the SPSS Modeler

Fig. 2.21 Records with value labels in a Table node

2.4 Adding Comments to a Stream

Description of the model


Stream name tree_credit_commented.str
Based on dataset tree_credit.sav
Related exercises: 3

Good and self-explanatory documentation of a stream can help reduce the time
needed to understand the stream or at least to identify the nodes that have to be
modified to find the correct information. We want to start, therefore, with a simple
example of how to add a comment to a template stream.

1. Open the template stream “Template-Stream tree_credit”.


2. To avoid any changes to the original template stream, we save the stream under
another name, e.g., “tree_credit_commented.str”.
3. Now we comment on the two nodes in the stream by assigning them a box with
the comment that they are part of the template stream. So modifications after-
wards will be easier to find.
To comment on a specific part of a node, without assigning the comment to a
specific node, we have to make sure that no node is active. To do this, we should
click on the background of the stream and not on a specific node. This
deactivates any nodes that could be active. Now we click on the icon “Insert
new comment” on the right-hand side of the Modelers toolbar, shown in
Fig. 2.22.
4. A new empty comment appears and we can define the text or the description,
e.g., “Template stream nodes”. Figure 2.23 shows the result.
2.4 Adding Comments to a Stream 41

Fig. 2.22 Toolbar icon “Add


new comment”

Fig. 2.23 A comment is


added to the stream

Fig. 2.24 The comment is


moved and resized into the
background of two nodes

5. Now we move and resize the comment so that it is in the background of both
nodes (see Fig. 2.24).
6. If we want to assign a comment to more than one node, we have to first mark the
nodes with the mouse by clicking them once. We will probably have to use the
Shift key to mark more than one node.
7. Now once more, we can add a comment by using the icon in the toolbar (see
Fig. 2.22). Alternatively, we can right-click and choose “New Comment . . .”.
(see Fig. 2.25).
8. We define the comment, e.g., as shown in Fig. 2.26, with “source node to load
dataset”. If there is no additional comment in the background, a dotted line
appears to connect the comment to the assigned node.
42 2 Basic Functions of the SPSS Modeler

Fig. 2.25 Context dialog following a right-click

Fig. 2.26 A comment is assigned to a specific node

" Commenting on a stream helps us work more efficiently and can describe
the functionalities of several nodes. There are two types of comments:

1. Comments assigned to one or more node, by activating the nodes


before adding the comment.
2. Remarks that are in the background, used for commenting on
several parts of a stream, but that are not associated with one or
more nodes.

" Comments can be added by using the “Add new comment . . .” toolbar
icon or by right-clicking the mouse.
2.5 Exercises 43

" Due to print-related restrictions with this book, we did not use comments
in the streams. Nevertheless, we strongly recommend this procedure.

2.5 Exercises

Exercise 1: Fundamental Stream Details

1. Name the nodes that should, at a minimum, be included in each stream. Explain
their role in the stream.
2. Explain especially the different roles of a Type node in a stream.
3. In Fig. 2.6 we connected the Table node to the Type node.
(a) Open the stream “Read SAV file data.str”.
(b) Remove the connection of this node with a right click on it. Then connect it
instead directly to the Source/Statistics File node.

Exercise 2: Using Excel Data in a Stream


In this exercise, we would like you to become familiar with the Excel File node.
Often we download data in Excel format from websites, or we have to deal with
secondary data that we’ve gotten from other departments. Unfortunately, the SPSS
Modeler cannot deal with the Excel 2003 file format or previous versions. So we
have to make sure to get the data in the 2010 file format at least or convert it. For
this, we have to use Excel itself.
Figure 2.27 shows some official labor market statistical records from the UK
Office for National Statistics and its “Nomis” website. This Excel file has some
specific features:

– There are two worksheets in the workbook; the “data_modified” worksheet


includes the data we would like to import here. We can ignore the second
spreadsheet “Data”.
– The first row includes a description of the columns below. We should use these
names as our variable names.
– The end of the table has no additional information. The SPSS Modeler’s import
procedure can stop at the last row that contains values.

Fig. 2.27 Part of the dataset “england_payment_fulltime_2014_reduced.xls”


44 2 Basic Functions of the SPSS Modeler

Please create a stream, as described in the following steps:

1. Open a new stream.


2. Add an Excel node. Modify its parameters so that the data described can be
analyzed in the stream.
3. Add a Table node to scroll through the records.

Exercise 3: Defining Value Labels and Adding Comments to a Stream


In this exercise, we want to show how to define value labels in a stream, so that the
numbers are more self-explanatory.

1. Open the stream “Template-Stream_Customer_Bank.str”. Save the stream under


another name.
2. Within the dataset exists a “DEFAULTED” variable. Define labels so that
0 equals “no” and 1 equals “yes”, as well as $null$ equals “N/A” (not available)
3. Assign a new comment to the Type node, which shows the text “modifies values
labels”.
4. Verify the definition of the value labels using a Table node.

2.6 Solutions

Exercise 1: Fundamental Stream Details


Name of the solution streams Read SAV file data modified.str
Theory discussed in section Section 2.1

1. The recommended nodes that should, at a minimum, be included in each stream


can be found with a description in Table 2.1.
2. See also Table 2.1 for the different roles of a Type node in a stream.
3. The modified stream can be found in “Read SAV file data modified.str”.
To remove the existing connection between a Type node and a Table node, we
right-click on the connection and use “Delete connection”. Then with a left-
click, we activate the Statistics file node. We press the F2 key and finally click on
the Table node. Now the Source node and the Table node should be connected as
shown in Fig. 2.28.

Exercise 2: Using Excel Data in a Stream


Name of the solution streams using_excel_data.str’
Theory discussed in section Section 2.1

1. The final stream can be found in “using_excel_data.str”. Figure 2.29 shows the
stream and its two nodes.
2. To get access to the Excel data, the parameters of the Excel node should be
modified, as shown in Fig. 2.30.
2.6 Solutions 45

Fig. 2.28 Stream “Read SAV file data modified.str”

Fig. 2.29 Nodes in the stream “using_excel_data”

Fig. 2.30 Parameters of the Excel node


46 2 Basic Functions of the SPSS Modeler

Fig. 2.31 Paste only values


in Excel

3. The path to the file can be different, depending on the folder where the file is
being stored. Here we used a placeholder “$BOOKDATA”. (see Sect. 1.2.3).
4. In particular, we would like to draw attention to the options “Choose worksheet”
and “On blank rows”.

" The Modeler does not always import calculations included in an Excel-
Worksheet correctly. Therefore, the new values should be inspected,
with a Table node for example. If NULL-values occur in the Excel-
Worksheet, all cells should be marked, copied, and pasted using the
function “Paste/Values Only” (see Fig. 2.31).

Exercise 3: Defining Value Labels and Adding Comments to a Stream


Name of the solution streams Stream_Customer_Bank_modified.str
Theory discussed in section Section 2.3

In this exercise, we want to show how to define value labels in a stream, so that
the numbers are more self-explanatory.
2.6 Solutions 47

Fig. 2.32 User-defined value labels are specified in a Type node (step 1)

1. The solution stream can be found in “Stream_Customer_Bank_modified.str”.


2. To define the new value labels, we open the Type node with a double-click. Then
we use “specify” in the second column for the “DEFAULTED” variable (see
Fig. 2.32).
First the option “Specify values and labels” must be activated. Then the
recommended value labels can be defined, as shown in Fig. 2.33.
3. To assign a comment to the Type node, we activate the Type node with a single
mouse click. Then we use the toolbar icon “Add new comment”, as shown in
Fig. 2.22, and add the text. Figure 2.34 shows the result.
4. To verify the definition of the value labels, the existing Table node can’t be used.
That’s because it is connected to the Source node. The Type node only changes
the labels later anyway. So it is better to add a new Table node and connect it to
the Type node (see Fig. 2.35).
To verify the new labels, we double-click the second Table node and click
“Run”. To show the value labels, we press the button “Display field and value
labels” in the middle of the toolbar. The last column in Fig. 2.36 shows the new
labels.
48 2 Basic Functions of the SPSS Modeler

Fig. 2.33 User-defined value labels are specified in a Type node (step 2)

Fig. 2.34 Comment is assigned to a Type node


2.7 Data Handling and Sampling Methods 49

Fig. 2.35 Final stream “Stream_Customer_Bank_modified.str”

Fig. 2.36 Dataset “customer_bank_data” with value labels

2.7 Data Handling and Sampling Methods

2.7.1 Theory

Data mining procedures can only be successfully applied to well-prepared datasets.


In Chap. 3, we will discuss methods to analyze variables one by one. These
procedures can be used, e.g., to calculate a measure of central tendency and
volatility, as well as to combine both types of measures to identify outliers.
In the first part of the chapter “Multivariate Statistics”, we then discuss methods
to combine two variables, e.g., to analyze their (linear) correlation.
But before, we want to show how to calculate new variables, as well as to deal
with variables that represent text. Furthermore, in this section we will discuss
50 2 Basic Functions of the SPSS Modeler

Fig. 2.37 Data handling topics discussed in the following section

methods that help us to generate specific subsets within the original dataset. By so
doing, more methods, e.g., cluster analysis, can be applied. This is necessary
because of the complexity of these multivariate techniques. If we can separate
representative subsets, then these methods become applicable despite any
limitations of time or hardware restrictions.
Figure 2.37 outlines the big picture for methods discussed in the following
sections. The relevant section has been listed alongside each method. We will use
various datasets to discuss the different procedures. To have the chance to focus on
particular sets and use different methods to deal with the values or records included,
we have reordered the methods applied.

2.7.2 Calculations

Description of the model


Stream name simple_calculations
Based on dataset IT_user_satisfaction.sav
Stream structure

Related exercises: 2, 3, 4, 5 and exercise “Creating a flag variable” in Sect. 3.2


2.7 Data Handling and Sampling Methods 51

Table 2.2 Questionnaire items related to the training days


training_days_actual How many working days per year do you actually spend on IT-relevant
training?
Answer options:
0 ¼ none, 1 ¼ 1 day, 2 ¼ 2 days, 3 ¼ more than 2 days
training_days_to_add How many working days should be available per year for training and
further education, additional to the above mentioned existing time
budget for training during working hours?
Answer options:
0 ¼ none, 1 ¼ 1 day, 2 ¼ 2 days, 3 ¼ more than 2 days

Theory
Normally, we want to analyze all the variables that are included in our datasets.
Nevertheless, often it is not enough to calculate measures of central tendency, etc.,
or to determine the frequency of several values. In addition, we also have to
calculate other measures, e.g., performance indicators. We will explain here how
to normally do that using the Modeler.

Simple Calculations Using a Derive Node


In the dataset “IT_user_satisfaction”, we find two questions regarding the actual
and the additional training days the user in a firm gets or expects. Table 2.2 displays
the actual questions and their coding.
If we now want to become familiar with the total number of training days the IT
user expect, we can simply add the values or codes of both variables. The only
uncertainty is that code “3” represents “more than 2 days”, so a user could express
to expect 4 or even 5 days. Hopefully, the probabilities for these options are
relatively small, so the workaround to calculate the sum should suffice.

1. We open the “Template-Stream IT_user_satisfaction” to get access to the dataset


“IT_user_satisfaction”.
2. Behind the Type node we add from the Field Ops tab a Derive node and connect
both nodes (see Fig. 2.38).
3. Now we want to name the new variable, as well as define the formula to calculate
it. We double-click on the Derive node. In the dialog window that opens we can
choose in “Derive field” the name of the new variable. We use here
“training_expected_total_1”.

" With a Derive node new variables can be calculated. We suggest using
self-explaining names for those variables. Short names may be easier
to handle in a stream but often it is hard to figure out what they
stand for.
52 2 Basic Functions of the SPSS Modeler

Fig. 2.38 Derive node is


added to the initial stream

" After a Derive node always a Type node should be implemented to


assign the correct scale to the new variable.

4. To show the results, we add a Table node behind the Derive node. We connect
both nodes and run the Table node (see Fig. 2.41).
The last column of Fig. 2.40 shows the result of the calculation.
5. To interpret the results more easily we can use a frequency distribution, so we
add a Distribution node behind the Derive node. We select the new variable
“training_expected_total_1” in the Distribution node to show the results. Fig-
ure 2.41 depicts the actual status of the stream. As we can see in Fig. 2.42, more
than 30 % of the users expect to have more than 3 days sponsored by the firm to
become familiar with the IT system.
6. Now we want to explain another option for calculating the result, because the
formula “training_days_actual+training_days_to_add” used in the first Derive
node shown in Fig. 2.39 can be substituted with a more sophisticated version.
Using the predefined function “sum_n” is simpler in this case, and we can also
learn how to deal with a list of variables.
The new variable name is “training_expected_total_2”. Figure 2.43 shows the
formula.
7. To define the formula, we double-click on the Derive node. A dialog window
appears, as shown in Fig. 2.43. With the little calculator symbol on the left-hand
side (marked in Fig. 2.43), we can start using the “expression builder”. It helps us
to select the function and to understand the parameter each function expects.
Figure 2.44 shows the expression builder. In the middle of the window, we can
select the type of function we want to use. Here, we choose the category
“numeric”. In the list below we select the function “sum_n”, so that we can
find out the parameters this function expects. An explanatory note below the
table tells us that “sum_n(list)” defines a list. The most important details are the
brackets [. . .] used to create the list of variables. The final formula used here is:
sum_n([training_days_actual,training_days_to_add])
2.7 Data Handling and Sampling Methods 53

Fig. 2.39 Parameters of a Derive node

Fig. 2.40 Calculated values in the Table node


54 2 Basic Functions of the SPSS Modeler

Fig. 2.41 Distribution node to show the results of the calculation

Fig. 2.42 Distribution of the total training days expected by the user

" The expression builder in the Modeler can be used to define formulas.
It offers a wide range of predefined functions.

" In functions that expect lists of variables, we must use brackets [].

" An example is “sum_n([var1,var2])” to calculate the sum of both


variables.

8. To show the result, we add a new Table node behind the second Derive node.
Figure 2.45 shows the final stream. The results of the calculation are the same, as
shown in Fig. 2.40.
2.7 Data Handling and Sampling Methods 55

Fig. 2.43 Derive node with the expression builder symbol

Fig. 2.44 Expression builder used in a Derive node

We want to add another important remark here: we defined two new variables
“training_expected_total_1” and “training_expected_total_2” in the Derive nodes.
Both have unique names. Even so, they cannot be shown in the same table node. To
connect one table node with both Derive nodes would make sense here. We should
56 2 Basic Functions of the SPSS Modeler

Fig. 2.45 Final stream “simple_calculations”

divide a stream, however, if we want to calculate different measures, or we want to


analyze different aspects. It is inadvisable to join both parts together.
" When defining sub-streams for different parts of an analysis, we have
to keep in mind that it is not possible (or error prone and inadvisable)
to show the results of two sub-streams in one Table node.

2.7.3 String Functions

Description of the model


Stream name string_functions
Based on dataset england_payment_fulltime_2014_reduced.xls
Stream structure

Related exercises: 6
2.7 Data Handling and Sampling Methods 57

Fig. 2.46 Expression builder of a Derive node

Theory
Until now we used the Derive node to deal with numbers, but this node type can also
deal with strings. If we start the expression builder in a Derive node (Fig. 2.43 shows
how to do this), we find the category “string” on the left-hand side (see Fig. 2.46). In
this section, we want to explain how to use these string functions generally.

Separating Substrings
The dataset “england_payment_fulltime_2014_reduced” includes the median of
the weekly payments in different UK regions. The source of the data is the UK
Office for National Statistics and its website NOMIS UK (2014). The data are based
on an annual workplace analysis coming from the Annual Survey of Hours and
Earnings (ASHE). For more details see Sect. 10.1.12. Figure 2.47 shows some
records, and Table 2.3 shows the different region or area codes.
We do not want to examine the payment data, however. Instead, we want to
extract the different area types from the first column. As shown in Fig. 2.47, in the
first column the type of the region and the names are separated by an “:”. Now we
use different string functions of the Modeler to extract the type. We will explain
three “calculations” to get the region type. Later in the exercise we want to extract
the region names.
58 2 Basic Functions of the SPSS Modeler

Fig. 2.47 Weekly payments in different UK regions

Table 2.3 Area codes ualad09 District


pca10 Parliamentary constituencies
gor Region

Fig. 2.48 Derive node is added to the template stream

1. We open the stream‚ “Template-Stream England payment 2014 reduced”, to get


access to the data. The aim of the steps that then follow is to extend this stream.
At the end there should be a variable with the type of the area each record
represents.
2. After the Type node we add a Derive node and connect both nodes (see
Fig. 2.48).
3. Double-clicking on the Derive node, we can define the name of the new variable
calculated there. We use the name “area_type_version_1” to distinguish the
different results from each other (Fig. 2.49).
2.7 Data Handling and Sampling Methods 59

Fig. 2.49 Parameters of the first Derive node

4. The formula here is:


startstring(locchar(“:”, 1, admin_description)-1,admin_description)
The function “startstring” needs two parameters. As the extraction procedure
always starts on the first character of the string, we need to only define the
number of characters to extract. This is the first parameter. Here we use “locchar
(‘:’,1, admin_description)” to determine the position of the “:” We subtract one
to exclude the colon from the result.
The second parameter of “startstring” tells this procedure to use the string
“admin_description” for extraction.
5. After the Derive node, we add a Table node to show the results (see the upper
part of Fig. 2.51). If we run the Table node, we get a result as shown in Fig. 2.50.
We can find the area type in the last column.
6. Of course we could now calculate the frequencies of each type here. Instead of
this, we want to show two other possible ways to extract the area types. As
depicted in the final stream in Fig. 2.51, we therefore add two Derive nodes and
two Table nodes.
60 2 Basic Functions of the SPSS Modeler

Fig. 2.50 Table node with the extracted area type in the last column

Fig. 2.51 Final stream “string_functions”

7. In the second Derive node, we use the formula


substring(1,locchar(“:”,1, admin_description)-1,admin_description)
The function “substring” also extracts parts of the values represented by the
variable “admin_description”, but the procedure needs three parameters:
(1) Where to start the extraction: in this case, position 1
(2) The number of characters to extract: “locchar(‘:’,1, admin_description)-1”
(3) Which string to extract from: admin_description
8. In the third Derive node, we used a more straightforward approach. The function
is
“textsplit(admin_description,1,‘:’)”
It separates the first part of the string and stops at the colon.
2.7 Data Handling and Sampling Methods 61

2.7.4 Extracting/Selecting Records

Description of the model


Stream name selecting_records
Based on dataset benchmark.xlsx
Stream structure

Related exercises: 4

Theoretical Background
Datasets with a large number of records are common. Often not all of the records
are useful for data mining purposes, so there should be a way to determine records
that meet a specific condition. Therefore, we want to have a look at the Modeler’s
Select node.

Filtering Processor Data Depending on the Manufacturer’s Name


The file “benchmark.xlsx” contains a list of AMD and Intel processors. Table 2.4
explains the variables. For more details see also Sect. 10.1.3. Here, we want to
extract the processors that are produced by the firm Intel.

1. We use the template stream “Template-Stream_processor_benchmark_test” as a


starting point. Figure 2.52 shows the structure of the stream.
2. Running the table node we can find 22 records. The first column of Fig. 2.53
shows the processors are manufactured by Intel or AMD.
3. To extract the records related to Intel processors, we add from Modeler’s
Records Ops tab a Select node and connect it with the Excel File node (see
Fig. 2.54).

Table 2.4 Variables in Firm Name of the processor company


dataset “benchmark.xlsx”
Processor type Name of the processor
EUR Price of the processor
CB Score of the Cinebench 10 test
62 2 Basic Functions of the SPSS Modeler

Fig. 2.52 Structure of “Template-Stream_processor_benchmark_test”

Fig. 2.53 Processor data in “benchmark.xlsx”

Fig. 2.54 Select node is added to the initial stream


2.7 Data Handling and Sampling Methods 63

4. To modify the parameters of the Select node, we double-click on it. In the dialog
window we can start the expression builder using the button on the right-hand
side. It is marked with an arrow in Fig. 2.55.
In the expression builder (Fig. 2.56), we first double-click on the variable “firm”
and add it to the command window. Then we can extend the statement manually
by defining the complete condition as: firm ¼ “Intel”.

Fig. 2.55 Dialog window of the Select node

Fig. 2.56 Expression Builder with the Selection statement


64 2 Basic Functions of the SPSS Modeler

Figure 2.56 shows the complete statement in the expression builder. We can
confirm the changes with “OK” and close all Select node dialog windows.

5. Finally, we should add a Table node behind the Select node to inspect the
selected records. Figure 2.57 shows the final stream.

Running the Table node at the end of the stream, we can find the selected
12 records of processors from Intel (see Fig. 2.58).

Fig. 2.57 Stream “selecting_records”

Fig. 2.58 Selected Intel processor data


2.7 Data Handling and Sampling Methods 65

" A Select node can be used to identify records that meet a specific
condition. The condition can be defined by using the expression
builder.

2.7.5 Filtering Data

Description of the model


Stream name Filter_processor_dataset
Based on dataset benchmark.xlsx
Stream structure

Related exercises: 7

Theoretical Background
In data mining we often have to deal with many variables, but usually not all of
them should be used in the modeling process. The record ID or the names of objects
that are often included in each dataset are good examples for that. With the ID we
can specify certain objects or records, but neither the ID nor the name are useful for
the statistical modeling process itself.
66 2 Basic Functions of the SPSS Modeler

To reduce the number of variables or to assign another name to a variable, the


Filter node can be used. This is particularly necessary if we would like to cut down
the number of variables in a specific part of a stream.

Filtering the ID’s of PC Processors


Benchmark tests are used to identify the performance of computer processors. The file
“benchmark.xlsx” contains a list of AMD and Intel processors. Alongside the price in
Euro and the result of a benchmark test performed with the test application Cinebench
10 (“CB”), the name of the firm and the type of processor are also included. For more
details see Sect. 10.1.3.
Of course we do not need the type of the processor to examine the correlations
between price and performance etc. So we can eliminate this variable from the
calculations in the stream. The name of the firm is important though, as it will be
used to create a scatterplot for the processors coming from Intel or from AMD. We
therefore should not remove this variable completely.

" A Filter node can be used to (1) reduce the number of variables in a
stream and (2) to rename variables. Indeed in the Source nodes of the
Modeler, the same functionalities are available, but an extra Filter
node should be used for transparency reasons, or if the number of
variables should be reduced in a specific part of the stream (and not
in the whole one).

" The Filter node does not filter the records! It only reduces the number
of variables.

We use the stream “Correlation_processor_benchmark”. A detailed description


of this stream and its functionalities can be found in the Exercises 2 and 3 of Sect.
4.8. Figure 2.59 shows the stream. Here we calculated the Pearson Correlation
Coefficient for the variables “Euro” (Price) and “CB” (Cinebench result).
To reduce the number of variables in the analytical process, we now want to
integrate a Filter node between the Excel node and the Type node.

1. We open the stream “Correlation_processor_benchmark”, which is to be


modified.
2. We use “File/Stream Save as . . .” to save the stream with another name, e.g.,
“Filter_processor_dataset”.
3. Now we remove the connection between the Excel node and the Type node by
right-clicking on it (on the connection—not on the nodes themselves!). We use
the option “Delete Connection”.
4. Now we insert from the “Field Ops” tab a Filter node and place it between the
Excel node and Type node (see Fig. 2.60).
2.7 Data Handling and Sampling Methods 67

Fig. 2.59 Stream “Correlation_processor_benchmark”

5. Finally, we connect the new nodes in the right directions: we connect the Excel
node with the Filter node and the Filter node with the Type node. Figure 2.60
shows the result.
6. To have the chance to understand the functionality of the Filter node, we should
add another Table node and connect it with the Filter node (see Fig. 2.60).
7. To exclude some variables, we now double-click on the Filter node. Figure 2.61
shows the dialog window with all four variables.
68 2 Basic Functions of the SPSS Modeler

Fig. 2.60 Stream “Filter_processor_dataset”

8. To exclude the variable “processor type” from the part of the stream behind the
Filter node and the analysis process, we click on the second arrow. Figure 2.62
shows the result.
9. We now can click “OK” to close the window.
10. To have an idea and to check the functionalities of the Filter node, we should
now inspect the dataset before and after usage of the Filter node. To do this
we double-click on the left as well as on the right Table node, as Fig. 2.63
visualizes. Figures 2.64 and 2.65 show the results.

" A Filter node can be used to exclude variables and to rename a


variable without modifying the original dataset.

" If we instead would like to identify records that meet a specific


condition, then we have to use the Select node!

The names of the variables are sometimes hard to figure out. If we would like to
improve the description and to modify the name, we can use the Filter node too. We
open the Filter node and overwrite the old names. Figure 2.66 shows a example.
2.7 Data Handling and Sampling Methods 69

Fig. 2.61 Parameters of a Filter node

Fig. 2.62 Excluding variables using a Filter node


70 2 Basic Functions of the SPSS Modeler

Fig. 2.63 Table nodes to compare the variables in the original dataset and those behind the
Filter node

Fig. 2.64 Variables of the original “benchmark.xlsx” dataset


2.7 Data Handling and Sampling Methods 71

Fig. 2.65 Filtered variables of the “benchmark.xlsx” dataset

Fig. 2.66 Using a Filter node to modify variable names


72 2 Basic Functions of the SPSS Modeler

Fig. 2.67 Excel Input node options with a disabled variable

Unfortunately, now this stream won’t work correctly anymore! We have to also
adjust this variable name in the following nodes or formulas of the stream. In the
end, we can summarize that renaming a variable makes sense and should be done in
the Filter node.
Here, we have explained how to use the Filter node in general, but there are also
other options for reducing the number of variables. If we would like to reduce them
for the whole stream, we can also use options in the Sources node itself, e.g., in an
Excel Input node or a Variable File node. Figure 2.67 shows how to disable the
variable “processor type” directly in the Excel File node.

" The parameters of the Source nodes can be used to reduce the
number of variables in a stream. We suggest, however, adding a
separate Filter node right behind a Source node, to create easy to
understand streams.
2.7 Data Handling and Sampling Methods 73

2.7.6 Data Standardization: Z-Transformation

Description of the model


Stream name verifying_z_transformation
Based on dataset england_payment_fulltime_2014_reduced.xls
Stream structure

Related exercises: 9

Theory
In Chap. 4, we will discuss tasks in multivariate statistics. In such an analysis more
than one variable is used to assess or explain a specific effect. Here—and also in
the following chapter—we are interested in determining the strength a variable
contributes to the result/effect. For this reason, and also to interpret the variables
themselves more easily, we should rescale the variables to a specific range.
In statistics we can distinguish between normalization and standardization. To
normalize a variable, all values will be transformed with

xi  xmin
xnorm ¼
xmax  xmin
to an interval of [0, 1].
In statistics, however, we use more often the so-called standardization or z-
transformation to equalize the range of each variable. After the transformation all of
them spread around zero. First we have to determine the mean x and the standard
deviation s of each variable. Then we use the formula
74 2 Basic Functions of the SPSS Modeler

xi  x
zi ¼
s
to standardize each value xi. The result is zi. These values zi have interesting
characteristics. First we can compare them in terms of standard deviations. Second,
the mean of zi is always zero and the standard deviation of zi is always one.
In addition, we can identify outliers easily: standardizing the values means that
we can interpret the z-values in terms of multiples of the standard deviations they
are away from the mean. The sign tells us the direction of the deviation from the
mean to the left or to the right.

" Standardized values (z values) are calculated for variables with differ-
ent dimensions/units and make it possible to compare them.
Standardized values represent the original values in terms of the
distance from the mean in standard deviations.

" In a multivariate analysis, z-standardized values should be used. It


helps to determine the importance of the variables for the results.

" A standardized value of, e.g., 2.18 means that it is 2.18 standard
deviations away from the mean to the left. The standardized values
itself have always have a mean of zero and a standard deviation
of one.

Standardizing Values
In this chapter, we want to explain the procedure for “manually” calculating the z
values. We also want to explain how the Modeler can be used to do the calculation
automatically. This will lead us to some functionalities of the Auto Data Prep
(aration) node.
We use the dataset “test_scores”, which represents test results in a specific
school. See also Sect. 10.1.31. These results should be standardized and the
calculated values should be interpreted. We use the template stream “Template-
Stream test_scores” to build the stream. Figure 2.68 shows the initial status of the
stream.

1. We open the stream “Template-Stream test_scores” and save it using


another name.
2. To inspect the original data, we add a Data Audit node and connect it with the
Statistics node (see Fig. 2.69).
3. We can now inspect the data by double-clicking on the Data Audit node.
After that, we use the run button to see the details in Fig. 2.70. In the
measurement column, we can see that the variable “pretest” is continuous.
Furthermore, we can find the mean of 54.956 and the standard deviation of
13,563. We will use these measures later to standardize the values. For now, we
can close the dialog window in Fig. 2.70 with “OK”.
2.7 Data Handling and Sampling Methods 75

Fig. 2.68 Template stream “test_scores”

Fig. 2.69 Template stream with added Data Audit node

Fig. 2.70 Details of the results in the Data Audit node


76 2 Basic Functions of the SPSS Modeler

4. We first want to standardize the value of the variable “pretest” manually, so that
we can understand how the standardization procedure works.
As shown in Sect. 2.7.2, we therefore should add a Drive node and connect it
with the Type node. Figure 2.71 shows the name of the new variable in the Drive
node, that is “pretest_manual_standardized”. As explained in the first paragraph
of this section, we usually standardize values by subtracting the mean and then
dividing the difference by the standard deviation. The formula is “(pretest-
54.956)/13.563”.
We should keep in mind that this procedure is for demonstration purposes
only! It is never appropriate to use fixed values in a Derive node—considering
that the values in the dataset can be different each time. Then fixed values would
not be appropriate and the results would be wrong! Unfortunately in this case we
cannot substitute the mean 54.956. As shown in Sect. 2.7.9, the predefined
function “mean_n” calculates the average of values in a row using a list of
variables. Here we would need the mean of a column—respectively a variable.
Figure 2.72 shows the actual status of the stream.
5. To show the results we add a Table node behind the Derive node.

Fig. 2.71 Parameters of the Derive node to standardize the pre-test values
2.7 Data Handling and Sampling Methods 77

Figure 2.73 shows some results. We find that the pretest result of 84 points
equals a standardized value of (8454.956)/13.563 ¼ +2.141. That means that
84 points is 2.141 away from the mean to the right. It is outside the 2 s-interval
(see also Sect. 3.2.7) and therefore a very good test result!

Fig. 2.72 Derive node is added

Fig. 2.73 Standardized Pre-test results


78 2 Basic Functions of the SPSS Modeler

" We strongly suggest not using fixed values in a calculation or a Derive


node! Instead of this, certain measures should be calculated using the
build-in functions of the Modeler.

6. Finally, to check the results we add a Data Audit node to that sub-stream (see
Fig. 2.74).
7. As explained, the standardized values have a mean of zero and a standard
deviation of one. Beside small deviations for the mean that is not exactly zero,
Fig. 2.75 shows the correct results.
8. To use a specific procedure to calculate the z-values, we have to make sure that
the variable “pretest” is defined as continuous and as input variable. To check

Fig. 2.74 Stream with Table and Data Audit node

Fig. 2.75 Data Audit results for the standardized values


2.7 Data Handling and Sampling Methods 79

this, we double-click on the Type node. Figure 2.76 shows that in the template
stream the correct options are being used.
9. As explained above, it is error-prone to use fixed values (e.g., the mean and the
standard deviation of the pretest results) in the Derive node. So there should be
a possibility to standardize values automatically in the Modeler. Here we want
to discuss one of the functionalities of the Auto Data Prep(aration) node for this
purpose.
Let’s select such an Auto Data Preparation node from the Field Ops tab of
the Modeler toolbar. We add it to the stream and connect it with the Type node
(see Fig. 2.77).

Fig. 2.76 Type node settings

Fig. 2.77 Auto Data Preparation node is added to the stream


80 2 Basic Functions of the SPSS Modeler

Fig. 2.78 Auto Data Preparation node parameters to standardize continuous input variables

10. If we double-click on the Auto Data Preparation node, we can find an over-
whelming number of different options (see Fig. 2.78). Here we want to focus on
the preparation of the input variables. These variables have to be continuous.
We checked both assumptions by first extracting the Type node parameters in
step 8.
11. Now we activate the tab “Settings” in the dialog window of the Auto Data
Preparation node (see Fig. 2.78). Additionally, we choose the category “Pre-
pare Inputs and Target” on the left-hand side.
At the bottom of the dialog window, we can activate the option “Put all
continuous fields on a common scale (highly recommended if feature construc-
tion will be performed)”.
12. We can now close the dialog window of the Auto Data Preparation node.
Finally, we should add a Table node as well as a Data Audit node to this part
of the stream. This is to show the results of the standardization process.
Figure 2.79 shows the final stream.
13. Figure 2.80 shows the results of the standardization procedure using the Auto
Data Preparation node. We can see that the variable name is
“pretest_transformed”. We can define the name extension by using the Field
Names settings on the left-hand side in Fig. 2.78.
2.7 Data Handling and Sampling Methods 81

Fig. 2.79 Final stream “verifying_z_transformation”

Fig. 2.80 Table node with the results of the Auto Data Preparation procedure

The standardized values are the same as calculated with the Derive node
before, however. We can find the value 2.141 in Fig. 2.80. It is the same as that
shown in Fig. 2.73.
Scrolling through the table we can see that the variable “pretest” is no longer
presented here. It is being replaced by the standardized variable “pre-
test_transformed”.
82 2 Basic Functions of the SPSS Modeler

" The Auto Data Preparation node offers a lot of options. This node can
also be used to standardize input variables. Therefore, the variables
should be defined as input variables, using a Type node in the stream.

" Furthermore, only continuous values can be standardized here. We


should make sure to check the status of a variable as “continuous
input variable”, before we use an Auto Data Preparation node.

" In the results, transformed variables replace the original variables.

2.7.7 Partitioning Datasets

Description of the model


Stream name partitioning_data_sets
Based on dataset test_scores.sav
Stream structure

Related exercises: 9

Theoretical Background
In data mining, the concept of cross-validation is often used to test the model for its
applicability to unknown and independent data and to determine the optimal
parameters for which the model best fits the data. For this purpose, the dataset
has to be split into several parts: a training dataset, test dataset, and validation
dataset. The training dataset will be used to find the correct parameters, and the
smaller validation and test datasets will be used for finding the optimal parameters
2.7 Data Handling and Sampling Methods 83

and testing the accuracy of these calculated parameters. This is depicted in


Fig. 2.81. The process of cross-validation is described in Sect. 5.1.2 in more detail.

Partitioning a Dataset into Two Subsets


We would like to start by separating two subsets based on dataset “test_scores.sav”.
For this we use the template stream “Template-Stream test_scores” (see Fig. 2.82).

1. We open the stream “Template-Stream test_scores” and save it using


another name.
2. To inspect the original data, we add a Data Audit node and connect it to the
Statistics node (see Fig. 2.83).

Fig. 2.81 Training, validation and test of models

Fig. 2.82 Template stream “test_scores”


84 2 Basic Functions of the SPSS Modeler

Fig. 2.83 Template stream with added Data Audit node

Fig. 2.84 Details of the results in the Data Audit node

3. Let’s have a first look at the data by double-clicking the Data Audit node. After
that we use the Run button to see the details in Fig. 2.84. In the column for the
valid number of values per variable, we find the sample size of 2133 records.
We will compare this value with the sample size of the subsets after the
procedure to divide the original dataset. For now we can close the dialog window
in Fig. 2.84 with “OK”.
4. After the Type node, we must add a “Partition” node. To do so, we first activate
the Type node by clicking on it once. Then double-click on a Partition node in
the Modeler tab “Field Ops”. The new Partition node is now automatically
connected with the Type node (Fig. 2.85).
5. Now we should adjust the parameters of the Partition node. We double-click on
the node and modify the settings as shown in Fig. 2.86. We can define the name
2.7 Data Handling and Sampling Methods 85

Fig. 2.85 Extended template stream with added Partition node

Fig. 2.86 Parameters of the Partition node

of the new field that consists of the “labels” for the new partition. Here we can
use the name “Partition”.
In addition, we would like to create two subsets, “Training and test”. We use
the option with the name “Train and test”.
86 2 Basic Functions of the SPSS Modeler

After that we should determine the percentage of values in each partition.


Normally 80/20 is a good choice.
With the other options, we can determine the label and values of the subsets or
partitions. Unfortunately, the Modeler only allows defining strings as values.
Here, we can choose “1_Training” and “2_Testing”. See Fig. 2.86.
6. In the background, the process will assign the records to the specific partitions
randomly. That means in each trial we get other results for the records assigned and
also for the measures calculated in each partition are slightly different. To avoid
this we could determine the initial value of the random generator by activating the
option “Repeatable partition assignment”, but we do not use this option here.
7. There are no other modifications necessary. So we can close the dialog with
“OK”.

" The Partition node should be used to define:

(i) A training and a test subset or


(ii) A training, test, and validation subset.

" All of them must be representative. The Modeler defines which record
belongs to which subset and assigns a name of the subset.

" Normally, the partitioning procedure will assign the records randomly
to the subset. If we would like to have the same result in each trial, we
should use the option “Repeatable partition assignment”.

8. To understand what happens if we use the partitioning method, we add a


Table node as well as a Distribution node and connect them with the Partition
node. In the Distribution node, we use the new variable “Partition” to get the
frequency of records per subset (See Fig. 2.88).
Figure 2.87 shows the final stream.
9. We can inspect the assignment of the records to the partitions by scrolling
through the records itself. To do this we double-click on the Table node to the
right of the stream. With the Run button we get the result also shown in
Fig. 2.89. In the last column, we can find the name of the partition to which
each record belongs. As outlined above, here we can only define strings as
values. The result is different each time because there is a random selection of
the records in the background. We can now close the window.
10. The Table node helps us to understand the usage of the Partition node.
Nevertheless, we should still analyze the frequency distribution of the variable
“Partition”. We run the Distribution node and we get a result similar to
Fig. 2.90. The frequency has a random value here also, and it is just an
approximation of the defined ratio 80/20, using the parameters in the Partition
node in Fig. 2.86. We can see that each record is being assigned to one subset,
however, because the sum of the frequencies 1721 + 412 equals the sample size
2133, as determined by the Data Audit node in Fig. 2.84.
2.7 Data Handling and Sampling Methods 87

Fig. 2.87 Final stream “partitioning_data_sets”

Fig. 2.88 Distribution node settings


88 2 Basic Functions of the SPSS Modeler

Fig. 2.89 Table with the new variable “Partition” (RANDOM VALUE!)

Fig. 2.90 Frequency distribution for the new variable “Partition” (RANDOM VALUES)

In Sect. 5.3.7, the procedure to divide a set into three subsets (training, valida-
tion, and test) will be discussed.

2.7.8 Sampling Methods

Theory
In the previous chapter, we explained methods to divide the data into several
subsets or partitions. Every record belongs to one of the subsets and no record
(except outliers) is being removed from the sample.
2.7 Data Handling and Sampling Methods 89

Also here, an understanding of the term “representative sample” is particularly


important. The sample and the objects in the sample must have the same
characteristics as the objects in the population. So the frequency distribution of
the variables of interest is the same in the population as in the sample.
In data mining, however, we have to deal with complex procedures, e.g., regression
models or cluster analysis. Normally, the researcher is interested in using all the
information in the data and therefore all the records that are available. Unfortu-
nately, the more complex methods need powerful computer performance to calcu-
late the results in an appropriate time. In addition, the statistical program has to
handle all the data and has also to reorder, sort, or transform the values. Sometimes,
the complexity of these routines exceeds the capacity of the hardware or the time
available. So we have to reduce the number of records using a well thought out and
intuitive selection procedure. Despite a lot of derivatives we can focus on three
general techniques. Table 2.5 gives an overview.
The SPSS Modeler offers a wide range of sampling techniques that help with all
these methods. In the following examples, we want to show how to use the Select
node to realize the sampling methods as outlined.

" The key point of sampling is to allow the usage of complex methods,
by reducing the number of records in a dataset. The subsample
must be unbiased and representative, to come to conclusions
that correctly describe the objects and their relationship to the
population.

Table 2.5 Sampling techniques


Name Short description
Random The records in the dataset all have the same and predefined chance of being
sampling selected.
Stratified An intelligent subtype of the random sampling procedures, but it reduces
sampling sampling error. Stratums are a part of the population that have at least one
characteristic in common—e.g., the gender of respondents. After the most
important stratums are identified, their frequency will be determined. Now
random sampling is being used to create a subsample in which the values are
represented with “approximately” the same proportions. The subset is in this
particular sense a representation of the original dataset. This applies only to
the selected characteristics that should be reproduced.
Systematic Every n-th record is selected. Assuming that the values in the original list
sampling have no hidden order, there is no unwanted pattern in the result.
90 2 Basic Functions of the SPSS Modeler

Simple Sampling Methods in Action


Description of the model
Stream name sampling_data
Based on dataset test_scores.sav
Stream structure

Related exercises: 

Here we want to use the dataset “test_scores.sav”. The values represent the test
scores of students in several schools with different teaching methods. For more
details see Sect. 10.1.31.

1. We start with the template stream “Template-Stream test_scores” shown in


Fig. 2.91.
2. We open the stream “Template-Stream test_scores” and save it using
another name.
3. To inspect the original data, we run the Table node. Then we can activate the
labels by using the button in the middle of the toolbar also shown in Fig. 2.92.
We find some important variables that can be used for stratification. Further-
more, we find the sample size of 2133 records.
4. Later we want to check if the selected records are representative for the original
dataset. There are a lot of methods that can be used to do this. We want to have
a look at the variables “pre test” and “post test”, especially at the mean and the
standard deviation.
We add a Data Audit node to the stream and connect it with the Source node.
Figure 2.93 shows the extended stream and in Fig. 2.94 we can find the mean
and the standard deviation of “pre test” and “post test”.
2.7 Data Handling and Sampling Methods 91

Fig. 2.91 Template stream “test_scores”

Fig. 2.92 Records of “test_scores” dataset

Fig. 2.93 Added Data Audit node to the stream


92 2 Basic Functions of the SPSS Modeler

Fig. 2.94 Mean and standard deviation of “pre-test” and “post-test”

Fig. 2.95 Extended stream to sample data

5. To have the chance to sample the dataset, we add from the Record Ops tab a
Sample node and connect it with the Type node.
In addition, we extend the stream by adding a Table node and a Data Audit
node behind the Sample node. This gives us the chance to show the result of the
sampling procedure. Figure 2.95 shows the actual status of the stream.
So far we have not selected any sampling procedure in the Sample node, but
what we can see is that the Sample node reduces the number of variables by
one. The reason is that in the Type node the role of the “student_id” is defined
as “None” for the modeling process (see Fig. 2.96). It is a unique identifier and
so we do not need it in the subsamples for modeling purposes.
6. Now we can analyze and use the options for sampling provided by the Sample
node. Therefore, double-click on the Sample node. Figure 2.97 shows the
dialog window.
2.7 Data Handling and Sampling Methods 93

Fig. 2.96 Role definition for “student_id” in the Type node

7. The most important option in the dialog window in Fig. 2.97 is the “Sample
method”. Here we have activated the option “Simple”, so that the other
parameters will indeed be simple to interpret.
The mode option checks if the selected records, determined by using the
other parameters, are included or excluded from the sample. Here we should
normally use “Include sample”.
The option “Sample” has three parameters to determine what happens in the
selection process. “First” just cuts off the sample after the number of records
specified. This option is only useful if we can be sure there is definitely no
pattern in the dataset. The first n selected records are representative for the
whole sample also.
The option “1-in-n” selects each n-th record, and the option “Random %”
selects randomly n % of the records.
8. Here we want to reduce the dataset by 50 %. So we have two choices: either we
select every second record or we choose randomly 50 % of the records. We start
with the “1-in-n” option as shown in Fig. 2.98. We can also restrict the number
of records by using the “Maximum sample size”, but this is not necessary here.
We confirm the settings with “OK”.
9. The Table node at the end of the stream tells us that the number of records
selected is 1066 (see Fig. 2.99).
94 2 Basic Functions of the SPSS Modeler

Fig. 2.97 Parameters of a Sample node

10. The corresponding Data Audit node in Fig. 2.100 shows us that the mean and
the standard deviation of “pre test” and “post test” differ slightly in comparison
with the original values in Fig. 2.94.
11. To check the usage of the option “Random %” in the Sample node we add
another Sample node as well as a new Data Audit node to the stream (see
Fig. 2.101).
12. If we run the Data Audit node at the bottom in Fig. 2.101, we can see that the
number of records differs each time. Note the half of the original 2133 records
and therefore 2133/2 ¼ 1066.5 records are selected. Sometimes the new sample
size is with 1034 really different.

" The Sample node can be used to reduce the number of records in a
sample. To create a representative sub-sample, the simple sampling
methods “1-in-n” or “Random %” can be used. The number of records
selected can be restricted. Variables whose roles are defined as
“None” in a Source or Type node are excluded from the sampling
process. Using the option “Random %”, the sample size differs from
the defined percentage.
2.7 Data Handling and Sampling Methods 95

Fig. 2.98 Parameter “1-in-n” of a Sample node

Fig. 2.99 Number of records selected shown in the Table node


96 2 Basic Functions of the SPSS Modeler

Fig. 2.100 Details of the sampled records in the Data Audit node

Complex Sampling Methods


Description of the model
Stream name sampling_data_complex_methods
Based on dataset sales_list.sav
Stream structure

Related exercises: 

Random sampling can avoid unintentional pattern appearing in the sample, but
very often random sampling also destroys patterns that are evidently useful and
necessary in data mining. Recalling the definition of the term “representative
sample”, which we discussed at the beginning of this section, we have to make
sure that “The frequency distribution of the variables of interest is the same in the
population and in the sample”.
Obviously, we cannot be sure that this is the case if we select each object
randomly. Consider the random sampling of houses on sale from an internet
2.7 Data Handling and Sampling Methods 97

Fig. 2.101 Final stream

database; the regions in the sample are not distributed as known from the whole
database.
The idea is to add constraints to the sampling process, to ensure the representa-
tiveness of the sample. The concept of stratified sampling is based on the following
pre-conditions:

– Each object/record of the population is assigned to exactly one stratum.


– All stratums together represent the population and no element is missing.

Stratification helps to reduce sampling error. In the following example, we want to


show more complex sampling methods using shopping data from a tiny shop. This
gives us the chance to easily understand the necessity for the different procedures.
The dataset was created by the authors based on an idea by IBM (2014, p. 57). For
more details see Sect. 10.1.29.

1. Let’s start with the template stream “009_Template-Stream_shopping_data”.


We open it in the SPSS Modeler (see Fig. 2.102).
2. The predefined Table node gives us the chance to inspect the given dataset. We
double-click and run it. We get the result shown in Fig. 2.103. As we can see,
these are the transactions of customers in a tiny shop.
If we use random sampling here—then it is clearly not necessary to reduce the
number of records—we can create a smaller sample but, e.g., the distribution of
the gender, would not be the same or representative.
3. To use stratified sampling, we have to add a Sample node behind the Type node
and connect both nodes (see Fig. 2.104).
98 2 Basic Functions of the SPSS Modeler

Fig. 2.102 Template Stream shopping data

Fig. 2.103 Data of “sales_list.sav”

4. By double-clicking on it, we get the dialog window as shown in Fig. 2.105. We


must then activate the option “Complex” using the radio button in the option
“Sample method”.
Please keep in mind the defined sample size of 0.5 ¼ 50 % in the middle of the
window!
2.7 Data Handling and Sampling Methods 99

Fig. 2.104 Sample node is added

We open another dialog window by clicking on the button “Cluster and


Stratify . . .” (see Fig. 2.106). We can now add “gender” to the “Stratify by”
list. Therefore, we use the button on the right-hand side marked in Fig. 2.106
with an arrow. We can now close the dialog window with “OK”.
5. If we want to add an appropriate label/description to the Sample node, we can
click on the “annotations” tab in Fig. 2.105. We can add a name with the option
“Custom”. (see Fig. 2.107). We used “Stratified by Gender”. Now also close this
window with “OK”.
6. To scroll through the records in the sampled subset, we must add another
Table node at the end of the stream and connect it with the Sample node (see
Fig. 2.108).
7. Running this Table node gives us the result as shown in Fig. 2.109.
8. Remembering the sample size of 50 % as defined in the Sample node in
Fig. 2.105, we can accept the new size of six records. In the original dataset,
we had records (not transactions!) of eight female and four male customers. In
the new one we can find records of four female and two male customers. So the
proportions of the gender—based on the number of records—are the same. This
is exactly what we want to ensure by using stratified sampling.
9. So far the stratified sampling seems to be clear, but we also want to explain the
option of defining individual proportions for the strata in the sample.
To duplicate the existing sub-stream, we can simply copy the Sample and the
Table node. For this we have to first mark both nodes by clicking on them once
while pressing the Shift key. Alternatively, we can mark them with the mouse by
drawing a virtual rectangle around them.
100 2 Basic Functions of the SPSS Modeler

Fig. 2.105 Parameters of a Sample node

Now we can simply copy and paste them. We get a stream as shown in
Fig. 2.110.
10. Now we have to connect the Type node and the new sub-stream, so we click on
the Type node once and press the F2 key. At the end we click on the Target
node—which in this case is the Sample node. Figure 2.111 shows the result.

" Nodes with connections between them can be copied and pasted. To
do this the node or a sub-stream must be marked with the mouse,
then simply copy and paste. Finally, the new components have to be
connected to the rest of the stream.

11. We now want to modify the parameters of the new Sample node. We double-
click on it. In the “Cluster and Stratify . . .” option, we defined the gender as the
variable for the relevant strata (see Fig. 2.105). If we think that records with
2.7 Data Handling and Sampling Methods 101

Fig. 2.106 Cluster and Stratify options of a Sample node

Fig. 2.107 Defining a name for the Sample node


102 2 Basic Functions of the SPSS Modeler

Fig. 2.108 Table node is added to the stream

Fig. 2.109 Result of stratified sampling

women are under-represented, then we must modify the strata proportions. In


the field “Sample Size” we activate the option “Custom”.
12. Now let’s click on the “Specify sizes . . .” button. In the new dialog window, we
can read the values of the variable “gender” by clicking on “Read Values”. This
is depicted in Fig. 2.113.
13. To define an individual proportion of women in the new sample, we can modify
the sample size as shown in Figs. 2.114 and 2.115. We can then close this
dialog window with “OK”.
2.7 Data Handling and Sampling Methods 103

Fig. 2.110 Copied and pasted sub-stream

Fig. 2.111 Extended stream with two sub-streams

14. Finally, we can modify the description of the node as shown in Fig. 2.116. We
can then close this dialog window too.
15. Running the Table node in this sub-stream we get the result as shown in
Fig. 2.117. We can see that the number of female customers increased in
comparison with the first dataset shown in Fig. 2.109.
Now we want to use another option to ensure we get a representative sample.
In the procedure used above, we focused on gender. In the end we found a
representative sample regarding the variable “gender”. If we want to analyze
products that are often sold together, however, we can consider reducing the
sample size, especially in the case of an analysis of huge datasets. Here a
104 2 Basic Functions of the SPSS Modeler

Fig. 2.112 Custom Strata options enabled in the Sample node

stratified sample related to gender is useless. We have to make sure that all the
products sold together are also in the result.
In this scenario, it is important to understand the characteristics of a flat table
database scheme: as shown once more in Fig. 2.118, the first customer bought
three products, but the purchase is represented by three records in the table. So
it is not appropriate to sample the records randomly based on the customer_ID.
If one record is selected then all the records of the same purchase must also be
assigned to the new subset or partition.
Here we can use the customer_ID as a unique identifier. Sometimes it can be
necessary to define another primary key first for this operation.
This is a typical example where the clustering option of the Sample node can
be used. We want to add a new sub-stream by using the same original dataset.
Figure 2.119 shows the actual status of our stream.
2.7 Data Handling and Sampling Methods 105

Fig. 2.113 Definition of individual strata proportions—step 1

Fig. 2.114 Definition of individual strata proportions—step 2

16. Now we add another Sample node and connect it with the Type node (see
Fig. 2.122). In the parameters of the Sample node, we activate complex
sampling (see Fig. 2.112) and click on “Cluster and Stratify . . .”. Here, we
select the variable “customer_ID” in the drop-down list of the “Clusters” option
(see Fig. 2.120).
By using this option, we make sure that if a record of purchase X is selected,
all other records that belong to purchase X will be added to the sample.
We can close this dialog window with “OK”.
106 2 Basic Functions of the SPSS Modeler

Fig. 2.115 Definition of individual strata proportions—step 3

Fig. 2.116 Definition of a new node name

17. Within the parameters of the Sample node, we can define a name for the node as
shown in Fig. 2.121. After that we can close the dialog box for the Sample node
parameters.
18. Now we can add a Table node at the end of the new sub-stream. Figure 2.122
shows the final stream.
2.7 Data Handling and Sampling Methods 107

Fig. 2.117 New dataset with individual strata proportions

Fig. 2.118 Data of “sales_list.sav”


108 2 Basic Functions of the SPSS Modeler

Fig. 2.119 Actual status of the stream

Fig. 2.120 Definition of a cluster variable in the sampling

19. Double-clicking the Table node marked with an arrow in Fig. 2.122, we
probably get the result as shown in Fig. 2.123. Each trial gives a new dataset
because of the random sampling, but the complete purchase of a specific
customer is always included in the new sample, as we can see by comparing
2.7 Data Handling and Sampling Methods 109

Fig. 2.121 Defining a name for the Sample node

the original dataset in Fig. 2.118 and the result in Fig. 2.123. The specific
structure of the purchases can be analysed by using the new dataset.

" The Sample node allows stratified sampling that produces represen-
tative samples. In general, variables can be used to define strata.
Additionally, cluster variables can be defined to make sure that all
the objects belonging to a cluster will be assigned to the new dataset,
in case only one of them is randomly chosen. Furthermore, the
sample sizes of the strata proportions can be individually defined.
110 2 Basic Functions of the SPSS Modeler

Fig. 2.122 Final stream

Fig. 2.123 New dataset produced by clustering using “customer_ID”


2.7 Data Handling and Sampling Methods 111

2.7.9 Merge Datasets

Description of the model


Stream name merge_England_payment_data
Based on dataset england_payment_fulltime_2013.csv
england_payment_fulltime_2014_reduced.xls
Stream structure

Related exercises: 12, 14

In practice, we often have to deal with datasets that come from different sources
or that are divided into different partitions, e.g., by years. As we would like to
analyze all the information, we have to consolidate the different sources first. To do
so, we need a primary key in each source, which we can use for that consolidation
process.
112 2 Basic Functions of the SPSS Modeler

" A primary key is a unique identifier for each record. It can be a


combination of more than one variable, e.g., of the name, surname,
and birthday of a respondent, but we recommend whenever possible
to avoid such “difficult” primary keys. It is hard to deal with them and
most likely error-prone. Instead of combining variables, we should
always try to find a variable with unique values. Statistical databases
usually offer such primary keys.

Considering the case of two datasets, with a primary key in each subset, we can
imagine different scenarios:

1. Merging the datasets to combine the relevant rows and to get a table with more
columns, or
2. Adding rows to a subset by appending the other one.

In this section, we would like to show how to merge datasets. Figure 2.124
depicts the procedure. Two datasets should be combined by using one variable as a
primary key. The SPSS Modeler uses this key to determine the rows that should be
“combined”.
Figure 2.124 shows an operation called inner join. Dataset 1 will be extended by
dataset 2, using a primary key that both have in common, but where there are also
two keys in each dataset (3831 and 6887) that are not in the other one. The
difference between the ways to join lies in how they handle these difficulties.
Table 2.6 shows the join types that can be found in the Modeler.
In the following scenario, we want to merge two datasets coming from the UK.
For more details see Sect. 10.1.12. Figure 2.125 shows the given information for
2013, and Fig. 2.126 shows some records for 2014. In both sets, a primary key

Fig. 2.124 Process to merge datasets (inner join)


2.7 Data Handling and Sampling Methods 113

Table 2.6 Join types


Join type Venn-diagram Description
Inner join Only the rows that both sources have in common are
matched.
Full outer All rows from both datasets are also in the joined table. But a
join lot of values are not available ($null$) (see also Fig. 2.124).
Partial outer Records from the first-named dataset are in the joined table.
join left From the second dataset only those with a key that match
with the first dataset key will be copied.
Partial outer Records from the second-named dataset are in the joined
join right table. From the first dataset, only those with a key that match
with the second dataset key will be copied.
Anti-join Only the records with a key that is not in the second dataset
are in the joined table.

Fig. 2.125 England employee data 2013

Fig. 2.126 England employee data 2014

“area_code” can be identified that is provided by the official database. The primary
key should be unique because there is no area with the same official code.
The relatively complicated variable “admin_description” (administrative
description) stands for a combination of the type of the region with their names.
We will separate both parts in the Exercise 5 “Extract area names” at the end of this
section. Here we want to deal with the original values.
Looking at Figs. 2.125 and 2.126, it is obvious that besides the weekly
payment, the confidence value (CV) for these variables also exists in both subsets,
114 2 Basic Functions of the SPSS Modeler

Fig. 2.127 Template-Stream England payment 2013 and 2014 reduced

but there is no variable for the year. We will solve that issue by renaming the
variables in each subset to make clear which variable represents the values for
which year.
The aim of the stream is to extend the values for 2013 with the additional
information from 2014. In the end, we want to create a table with the area code,
the administrative description of the area, the weekly gross payment in 2013 and
2014 as well as the confidence values for 2013 and 2014.

1. We open the template stream “Template-Stream England payment 2013


and 2014 reduced”. As shown in Fig. 2.127 there is a Source node in the form of
a Variable file node to load data from 2013. Below, the data for 2014 are
imported by using a Excel file node. We can see here that different types of
sources can be used to merge the records afterwards (see Fig. 2.127).
2. If we double-click on the Table nodes on top and at the bottom, we get the values
as shown in Figs. 2.125 and 2.126. We can scroll through the records and then
close both windows.
3. To have the chance to exclude several variables in each subset, we add a
Filter node behind each source node. We can find this node type in the
Field Ops tab of the SPSS Modeler. Figure 2.128 shows the actual status of
the stream.
2.7 Data Handling and Sampling Methods 115

Fig. 2.128 Filter nodes are


added behind each
Source node

4. In the end, we want to create a table with the area code, the administrative
description of the area, the weekly gross payment in 2013 and 2014 as well as the
confidence values for 2013 and 2014. To do so we have to exclude all the other
variables, but additionally we must rename the variables so they get a correct and
unique name for 2013 and 2014.
Figure 2.129 shows the Parameters of the Filter node behind the Source node
for 2013. In rows three and four, we changed the name of the variables
“weekly_payment_gross” and “weekly_payment_gross_CV” by adding the
year. Additionally, we excluded all the other variables. To do so we click once
on the arrow in the column in the middle.
5. In the Filter node for the data of 2014 we must exclude the “admin_description”
and the “area_name”. Figure 2.130 shows the parameters of the second Filter
node. The variable names should also be modified here.
6. Now the subsets should be ready for the merge procedure. First we add from the
“Record Ops” tab a Merge node to the stream. Then we connect both Filter nodes
with this new Merge node (see Fig. 2.131).
116 2 Basic Functions of the SPSS Modeler

Fig. 2.129 Filter node for data in 2013 to rename and exclude variables

7. We double-click on the Merge node. In the dialog window we can find some
interesting options. We suggest taking control of the merge process by using the
method “Keys”. As shown in Fig. 2.132, we use the variable “area_code” as a
primary key. With the four options at the bottom of the dialog window we can
determine the type of the join-operation that will be used. Table 2.6 shows the
different join-types and a description.
8. To check the result we add a Table node behind the Merge node. Figure 2.133
shows the actual stream that will be extended later. In Fig. 2.134 we can find
some of the records. The new variable names and the extended records are
shown for the first four UK areas.
In general we can also rename the variables and exclude several of them in the
“Filter” dialog of the Merge node. We do not recommend this, as we wish to
increase the transparency of the streams functionalities (Fig. 2.135).
2.7 Data Handling and Sampling Methods 117

Fig. 2.130 Filter node for data in 2014 to exclude and rename variables

" A Merge node should be used to combine different data sources. The
Modeler offers five different join types: inner join, full outer join,
partial outer join left/right, and anti-join. There is no option to select
the leading subset for the anti-join operation.

" To avoid misleading interpretations, the number of records in the


source files and the number of records in the merged dataset must
be verified.

" The names of the variables in both datasets must be unique. Therefore
Filter nodes should be implemented before the Merge node is applied.
Renaming the variables in the Merge node is also possible, but to
ensure a more transparent stream, we do not recommend this option.

" In the case of a full outer join, we strongly recommend checking the
result. If there are records in both datasets that have the same
primary key but different values in another variable the result will
be an inconsistent dataset. For example, one employee has the ID
“1000” but different “street_names” in its address.

" Sometimes it is challenging to deal with the scale of measurement of


the variables in a merged dataset. We therefore suggest incorporating
a Type node right after the Merge node, just to check and modify the
scales of measurement.
118 2 Basic Functions of the SPSS Modeler

Fig. 2.131 New Merge node in the stream

" If two datasets are to be combined row by row then the Append node
should be used (see Sect. 2.7.10).

Stream Extension to Calculate Average Income


To show the necessity of merging two different sources, we want to address
calculation of the average gross income. The manual calculation is easy: For
“Hartlepool” in the first row of Fig. 2.134 we get (475.4 + 462.1)/2 ¼ £468.75
per week.
Next we will explain how to calculate the average income per week for all
regions. In Sect. 2.7.2 we outlined the general usage of a Derive node:

9. From the tab “Field Ops” we add a Derive node to the stream and connect it with
the Merge node (see Fig. 2.136).
2.7 Data Handling and Sampling Methods 119

Fig. 2.132 Parameters of a Merge node

10. With a double-click on the Derive node we can now define the name of the new
variable with the average income. We use “weekly_payment_gross_2013_
2014_MEAN” as shown in Fig. 2.137.
11. Finally we have to define the correct formula using the Modelers expression
builder. To start this tool, we click on the calculator symbol on the right-hand
side, as depicted in Fig. 2.137.
12. A new dialog window pops up as shown in Fig. 2.138.
As explained in Sect. 2.7.2, we can use the formula category list to determine
the appropriate function. The category “Numeric” is marked with an arrow in
Fig. 2.138. The correct formula to determine the average weekly income for
2013 and 2014 per UK region is:
mean_n([weekly_payment_gross_2013,weekly_payment_gross_2014])
We can click “OK” and close the Derive node dialog (Fig. 2.139).
13. Finally we add another Table node to show the results. The predicted result of
£468.75 per week for Hartlepool is the last value in the first row of Fig. 2.140.
120 2 Basic Functions of the SPSS Modeler

Fig. 2.133 Stream to merge 2013 and 2014 payment data

Fig. 2.134 Merged data with new variable names


2.7 Data Handling and Sampling Methods 121

Fig. 2.135 Filter and rename dialog of the Merge node

Fig. 2.136 Derive node is added to the stream


122 2 Basic Functions of the SPSS Modeler

Fig. 2.137 Parameters of the Derive node

Fig. 2.138 Using the expression builder to find the correct formula
2.7 Data Handling and Sampling Methods 123

Fig. 2.139 Final stream

Fig. 2.140 Average income 2013 and 2014 per UK area


124 2 Basic Functions of the SPSS Modeler

2.7.10 Append Datasets

Description of the model


Stream name append_England_payment_data
Based on dataset england_payment_fulltime_2013.csv
england_payment_fulltime_2014_reduced.xls
Stream structure

Related exercises: 13

In the previous section we explained how to combine two datasets column by


column. For this we used a primary key. To ensure unique variable names we had to
rename the variables by extending them by the year. Figure 2.141 shows the
parameters of a Filter node used for that procedure.
If datasets have the same columns but represent different years, then it should
also be possible to append the datasets. To distinguish the subsets in the result, we
should extend them by a new variable that represents the year. In the end each row
will belong to a specific year, as shown in Fig. 2.142.
In this example we want to use the same datasets as the previous example: one
CSV-File with the weekly gross payments for different regions in UK and one
Excel-spreadsheet that contains the gross payments for 2014. Now we will explain
how we defined a new variable with the year and appended the tables. Figure 2.142
depicts the procedure and the result.
2.7 Data Handling and Sampling Methods 125

Fig. 2.141 Filter node for data in 2014 to exclude and rename variables

Fig. 2.142 Two appended datasets

1. We open the template stream “Template-Stream England payment 2013 and


2014 reduced”. Figure 2.143 shows a Variable file node for the data from 2013.
Below the data for 2014 are imported using a Excel file node.
126 2 Basic Functions of the SPSS Modeler

Fig. 2.143 Template-Stream England payment 2013 and 2014 reduced

Fig. 2.144 England employee data 2013

2. If we double-click on the Table nodes on the top and the bottom we get the
values as shown in Figs. 2.144 and 2.145.
3. Now we need to add a new variable that represents the year in each subset. We
add a Derive node and connect it with the Variable file node. To name the new
unique variable we use “year” and the formula we define as a constant value
2013, as shown in Fig. 2.146.
4. We add a second Derive node and connect it with the Excel file node. We use
the name “year” here also, but for the formula the constant value is 2014
(Fig. 2.147).
2.7 Data Handling and Sampling Methods 127

Fig. 2.145 England employee data 2014

Fig. 2.146 Formula in the first Derive node

5. To show the results of both operations, we add two Table nodes and connect
them with the Derive nodes. Figure 2.148 shows the actual status of the stream.
6. If we use the Table node at the bottom of Fig. 2.148, we get the records shown
in Fig. 2.149.
7. Here it is not necessary to exclude variables in the dataset for 2014. Neverthe-
less, as shown in Fig. 2.144, we can remove some of them from the dataset for
2013, because they do not match with the other ones from 2014 (Fig. 2.145).
128 2 Basic Functions of the SPSS Modeler

Fig. 2.147 Formula in the first Derive node

Fig. 2.148 Extended template stream to define the “year”


2.7 Data Handling and Sampling Methods 129

Fig. 2.149 Values for 2014 and the new variable “year”

Fig. 2.150 Filter node is added behind the Variable File node

To enable us to exclude these variables from the subset 2013, we add a Filter
node behind the Variable File node. We can find this node in the Field Ops tab
of the Modeler. Figure 2.150 shows the actual status of the stream.
8. Now we can exclude several variables from the result. Figure 2.151 shows us
the parameters of the Filter node.
130 2 Basic Functions of the SPSS Modeler

Fig. 2.151 Filter node parameters

9. Now we can append the modified datasets. We use an Append node from the
Records Ops tab. Figure 2.152 shows the actual status of the stream.
10. In the Append node, we must state that we want to have all the records from
both datasets in the result. Figure 2.153 shows the parameters of the node.
11. Finally, we want to scroll through the records using a Table node. We add a
Table node at the end of the stream (see Fig. 2.154).
2.7 Data Handling and Sampling Methods 131

Fig. 2.152 Append node is added

" The Append node can be used to combine two datasets row by row. It
is absolutely vital to ensure that the objects represented in the
datasets are unique. A Merge node with an inner join can help here.
For details see the Exercise 13 “Append vs. Merge Datasets”.

" We suggest using the option “Match fields by Name” of a Append


node to append the datasets (see Fig. 2.153).

" The option “Tag records by including source dataset in field” can be
used to mark the records with the number of the dataset they come
from. To have the chance of differentiating between both sets and
using user-defined values, e.g., years, we suggest using Derive nodes
to add a new variable with a constant value.

" The disadvantage of the Append node is that it is more difficult to


calculate measures such as the mean. To do this we should use a
Merge node as shown in Sect. 2.7.9.
132 2 Basic Functions of the SPSS Modeler

Fig. 2.153 Parameters of the Append node

12. Running the Table node we will find the records partially shown in Fig. 2.155.
We can find the variable with the year and the expected sample size 2048. As
explained in Fig. 2.142, the rows of variables that were not present in both
datasets are filled with $null$ values. We can see that in the last column of the
table in Fig. 2.155.

2.7.11 Exercises

Exercise 1: Identify and Count Non-available Values


In the Microsoft Excel dataset “england_payment_fulltime_2014_reduced.xlsx”,
some weekly payments are missing. Please use the template stream “Template-
Stream England payment 2014 reduced” to get access to the dataset. Extend the
stream by using appropriate nodes to count the number of missing values.

Exercise 2: Comfortable Selection of Multiple Fields


In Sect. 2.7.2, we explained how to use the function “sum_n”. The number of
training days of each IT user was calculated by adding the number of training days
they participated in the last year and the number of additional training days they
2.7 Data Handling and Sampling Methods 133

Fig. 2.154 Final stream to Append datasets for UK payment data

Fig. 2.155 Table with the final records

would like to have. To do so, the function “sum_n” needs a list of variables. In this
case it was simply “([training_days_actual,training_days_to_add])”.
It can be complex to add more variables, however, because we have to select
all the variable names in the expression builder. The predefined function
“@FIELDS_MATCHING()” can help us here.
134 2 Basic Functions of the SPSS Modeler

1. You can find a description of this procedure in the Modeler help files. Please
explain how this function works.
2. Open the stream “simple_calculations”. By adding a new Derive and
Table node, calculate the sum of the training days in the last year and the days
to add by using the function “@FIELDS_MATCHING()”.

Exercise 3: Counting Values in Multiple Fields


In the dataset “IT_user_satisfaction.sav”, we can find many different variables (see
also Sect. 10.1.20). A lot of them have the same coding that represents satisfaction
levels with a specific aspect of the IT system. For this exercise, count the answers
that signal a satisfaction level of “good” or “excellent” with one of the IT system
characteristics.

1. Open the stream “Template-Stream IT_user_satisfaction”.


2. Using the function “count_equal”, count the number of people that answered
“good” regarding the (1) start-time, (2) system_availability, and (3) performance.
Show the result in a Table node.
3. Using the function “count_greater_than”, determine the number of people that
answered at least “good”, regarding the (1) start-time, (2) system_availability,
and (3) performance. Show the result in a Table node.
4. The variables “start-time”, “system_availability”, . . ., and “slimness” are coded
on the same scale. Now calculate the number of people that answered “good” for
the aspects asked for. Use the function “@FIELDS_BETWEEN” to determine
which variables to inspect.
5. Referring to the question above, we now want to count the number of “good”
answers for all variables except “system_availability”. Use the “Field Reorder”
node to define a new sub-stream and count these values.

Exercise 4: Determining Non-reliable Values


In the file “england_payment_fulltime_2014_reduced.xls”, we can find the median
of the sum of the weekly payments in different UK regions. The detailed
descriptions can be found in Sect. 10.1.12. Table 2.7 summarizes the variable
names and their meaning.

Table 2.7 Variables Dataset “england_payment_fulltime_2014_reduced”


admin_description
admin_description Represents the type of the region and the name separated by “:”.
For the type of the region, see Table 2.3.
area_code A unique identifier for the area/region.
weekly_payment_gross Median of sum of weekly payments.
weekly_payment_gross_CV Coefficient of variation of the value above.
2.7 Data Handling and Sampling Methods 135

The source of the data is the UK Office for National Statistics and its website
NOMIS UK (2014). The data represent the Annual Survey of Hours and Earnings
(ASHE).
We should note that the coefficient of variation for the median of the sum of the
payments is described on the website NOMIS UK (2014) in the following
statement:
“The quality of an estimate can be assessed by referring to its coefficient of
variation (CV), which is shown next to the earnings estimate. The CV is the ratio of
the standard error of an estimate to the estimate. Estimates with larger CVs will be
less reliable than those with smaller CVs.
In their published spreadsheets, ONS use the following CV values to give an
indication of the quality of an estimate . . .” (see Table 2.8).
Therefore, we should pay attention to the records with a CV value above 10 %.
Please do the following:

1. Explain why we should use the median as a measure of central tendency for
payments or salaries.
2. Open the stream with the name “Template-Stream England payment 2014
reduced”.
3. Save the stream using another name.
4. Extend the stream to show in a new Table node the CV values in descending
order.
5. In another table node, show only the records that have a confidence value
(CV) of the weekly payments below (and not equal to) “reasonably precise”.
Determine the sample size.
6. In addition, add nodes to show in a table those records with a CV that indicates
they are at least “reasonably precise”. Determine the sample size.

Exercise 5: Extract Area Names


In the dataset “england_payment_fulltime_2014_reduced.xls”, the weekly
payments for different UK regions are listed. The region is described here by a
field called “admin_description” as shown in Fig. 2.156. The aim of this exercise is
to separate the names from the description of the region type.

1. Use the stream “Template-Stream England payment 2014 reduced”.


2. Add appropriate nodes to extract the names of the regions using the variable
“admin_description”. The names should be represented by a new variable called
“area_name_extracted”.
3. Show the values of the new variable in a table.

Table 2.8 CV Value. 5 % or lower Precise


Quality of estimate
over 5 %, up to 10 % Reasonably precise
over 10 %, up to 20 % Acceptable, but use with caution
over 20 % Unreliable, figures suppressed
136 2 Basic Functions of the SPSS Modeler

Fig. 2.156 Records in the dataset “england_payment_fulltime_2014_reduced.xls”

Exercise 6: Distinguishing Between the Select and the Filter Node


We discussed the Select and the Filter node in the previous chapters. Both can be
used to reduce the data volume a model is based upon.

1. Explain the functionality that each of these nodes can help to realize.
2. Outline finally the difference between both node types.

Exercise 7: Filtering Variables Using the Source Node


In Sect. 2.7.5, we created a stream based on the IT processor benchmark test
dataset, to show how to use a Filter node. We found that we can exclude the
variable “processor type” from the analytical process. If the variable will never
be used in the whole stream, then we also have this option in the Source node, where
the data came from in the first place.
The stream “Correlation_processor_benchmark” should be modified. Open the
stream and save it under another name. Then modify the Excel File node so that the
variable “processor type” does not appear in the nodes that follow.

Exercise 8: Standardizing Values


The dataset “benchmark.xlsx” includes the test results of processors for personal
computers published by c’t Magazine for IT Technology (2008). As well as the
names of the manufacturers Intel and AMD, the names of the processors can also be
found. The processor speed was determined using the “Cinebench” benchmark test.
Before doing multivariate analysis, it is helpful to identify outliers. This can be
done by an appropriate standardization procedure, as explained in Sect. 2.7.6.

1. In Sect. 2.7.4, the stream “selecting_records” was used to select records that
represent processor data for the manufacturer “Intel”. Please now open the
stream “selecting_records” and save it under another name.
2. Make sure or modify the stream so that the “Intel” processors will now be
selected.
3. Add appropriate nodes to standardize the price (variable “EUR”) and the
Cinebench benchmark results (variable “CB”).
4. Show the standardized values.
2.7 Data Handling and Sampling Methods 137

5. Determine the largest standardized benchmark result and the largest


standardized price. Interpret these values in detail. Are these outliers?
6. Now analyze the data for the AMD processors. Can you identify outliers here?

Exercise 9: Dataset Partitioning


In the dataset “housing.data.txt” we find values regarding Boston housing
neighborhoods. We want to use that file as a basis to learn more about the
Partitioning node. Please do the following:

1. Open the template stream “008_Template-Stream_Boston_data”.


2. Add a Partition node at the end of the stream.
3. Use the option “Train and Test” in the Partition node and set the training
partition size at 70 % and the test partition size at 30 %.
4. Now generate two Select nodes automatically with the Partition node as shown
in Fig. 2.157.
5. Move the Select nodes to the end of the stream and connect them with the
Partition node.

Fig. 2.157 Generate the Select nodes using a Partition node


138 2 Basic Functions of the SPSS Modeler

Fig. 2.158 Final stream “dataset_partitioning”

6. To have a chance of analyzing the partitions, add two Data Audit nodes. For the
final stream, see Fig. 2.158.
7. Now please run the Data Audit node behind the Select node for the training
records TWICE. Compare the results for the variables and explain why they are
different.
8. Open the dialog window of the Partition node once more. Activate the option
“Repeatable partition assignment” (see Fig. 2.157).
9. Check the results from two runs of the Data Audit node once more. Are the
results different? Try to explain why it could be useful in modeling processes to
have a random but repeatable selection.

Exercise 10: England Payment Gender Difference


The payments for female and male employees in 2014 are included in the Excel
files “england_payment_fulltime_female_2014” and “england_payment_fulltime_
male_2014”. The variables are described in Sect. 10.1.12. We focus here only on
the weekly gross payments.
By modifying the stream “merge_employee_data”, calculate the differences in
the medians between payments for female and male employees.

Exercise 11: Merge Datasets


In Sect. 2.7.9, we discussed several ways to merge two datasets. Using small
datasets, in this exercise we want to go into the details of the merge operations.
Figure 2.159 depicts the “Template-Stream Employee_data”. This stream obvi-
ously gives us access to the datasets “employee_dataset_001.xls” and
“employee_dataset_002.xls”. The records of these datasets, with a sample size of
three records each, are shown in Figs. 2.160 and 2.161.

1. Now please open the stream “Template-Stream Employee_data”.


2. Add a Merge node and connect it with the Source File nodes.
3. Also add a Table node to show the results produced by the Merge node.
Figure 2.162 shows the final stream.
2.7 Data Handling and Sampling Methods 139

Fig. 2.159 Nodes in the “Template-Stream Employee_data”

Fig. 2.160 Records of “employee_dataset_001.xlsx”


140 2 Basic Functions of the SPSS Modeler

Fig. 2.161 Records of “employee_dataset_002.xlsx”

Fig. 2.162 Final stream to check the effects of different merge operations
2.7 Data Handling and Sampling Methods 141

4. As shown in Fig. 2.163, Merge nodes in general offer four methods for merging
data. By changing the type of merge operation (inner join, full outer join, . . .),
produce a screenshot of the results. Explain in your own words the effect of the
join type. See, e.g., Fig. 2.164.

Fig. 2.163 Parameters of a Merge Node

Fig. 2.164 Result of a merge operation


142 2 Basic Functions of the SPSS Modeler

Exercise 12: Append Datasets


In Sect. 2.7.10, we discussed how to add rows to a dataset by appending another
one. Figure 2.169 shows the nodes in the “Template-Stream Employee_data_
modified”. The stream is based on the datasets “employee_dataset_001.xlsx” and
“employee_dataset_003.xlsx” (not “. . .002”!). The records of these datasets, with a
sample size of three records each, are shown in Figs. 2.165 and 2.166.
Contrary to the dataset “employee_dataset_002.xls”, in these datasets the cus-
tomer ID’s are unique. This is important because otherwise we would get an
inconsistent dataset.
The aim of this exercise is to understand the details of the Append operation and
the Append node.

1. Open the stream “Template-Stream Employee_data”.


2. Add an Append node and connect it with the Excel File nodes.

Fig. 2.165 Records of “employee_dataset_001.xlsx”

Fig. 2.166 Records of “employee_dataset_003.xlsx”


2.7 Data Handling and Sampling Methods 143

Fig. 2.167 Final stream to check the functionalities of the Append node

3. Add a Table node to show the results produced by the Append node. For the final
stream, see Fig. 2.167.
4. By modifying the parameters of the Append node, using (see Fig. 2.168) the
options of

– “Match fields by . . .” or
– “Include fields from . . .” as well as
– “Tag records by including source dataset in field”,
Find out what happens and describe the functionality of these options.

Exercise 13: Append Versus Merge Datasets


In Sect. 2.7.9, we discussed how to merge datasets and in Sect. 2.7.10, we explained
how to add rows to a dataset by appending another one. Figure 2.169 shows
the nodes in the “Template-Stream Employee_data”. This stream is based on
the datasets “employee_dataset_001.xlsx” and “employee_dataset_002.xlsx”. The
144 2 Basic Functions of the SPSS Modeler

Fig. 2.168 Parameters of the Append node

records of these datasets, with a sample size of three records each, are shown in
Figs. 2.170 and 2.171.
The aim of this exercise is to understand the difference between the Merge and
the Append node. Therefore, both nodes should be implemented into the stream.
After that, the mismatch of results should be explained. You will also become
aware of the challenges faced when using the Append node.

1. Open the stream “Template-Stream Employee_data”.


2. Run both Table nodes and compare the data included in the datasets.
3. Add a Merge node and connect it with the Excel File nodes.
4. In the Merge node, use the full outer join operation to get the maximum number
of records in the result.
5. Also add an Append node and connect it with the Excel File nodes.
6. In the Append node, enable the option “Include fields from . . . All datasets”.
7. Behind the Merge and the Append node, add a Table node to show the results
produced. Figure 2.172 shows the final stream.
8. Now run each of the Table nodes and outline the differences between the Merge
and the Append node results.
2.7 Data Handling and Sampling Methods 145

Fig. 2.169 Nodes in the “Template-Stream Employee_data”

Fig. 2.170 Records of “employee_dataset_001.xlsx”

9. Describe the problems you find with the results of the Append operation. Try to
make a suggestion as to how we could be aware of such problems before append
two datasets.
146 2 Basic Functions of the SPSS Modeler

Fig. 2.171 Records of “employee_dataset_002.xlsx”

Fig. 2.172 Final stream “append_vs_merge_employee_data”


2.7 Data Handling and Sampling Methods 147

2.7.12 Solutions

Exercise 1: Identify and Count Non-available Values


Name of the solution stream identifing_non-available_values
Theory discussed in section Section 2.7.2

Figure 2.173 shows one possible solution.

1. We extended the template stream by first adding a Data Audit node from the
Output tab and then connecting that node with the Source node. We do this to
have a chance of analyzing the original dataset. Figure 2.174 shows that the
number of values in row “weekly_payment_gross” is 1021.

Fig. 2.173 Stream “identifing_non-available_values”

Fig. 2.174 Data Audit node analysis for the original dataset
148 2 Basic Functions of the SPSS Modeler

Fig. 2.175 Table node with the new variable “null_flag”

Fig. 2.176 Frequency distribution of the variable “null_flag”

In comparison, the number of values for “admin_description” is 1024. So we


can guess that there are three missing values. The aim of this exercise, however,
is to count the missing values themselves.
2. Behind the Type node we add a Derive node, from the Field Ops tab of the
Modeler. The formula used here is “@NULL(weekly_payment_gross)”.
3. In the Table node we can see the results of the calculation (see the last column in
Fig. 2.175).
4. To calculate the frequencies we use a Distribution node. Here we find that there
are indeed three missing values (Fig. 2.176).

Exercise 2: Comfortable Selection of Multiple Fields


Name of the solution stream England_payment_gender_difference
Theory discussed in section Section 2.7.2

1. The function “@FIELDS_MATCHING(pattern)” selects the variables that match


the condition given as the pattern. If we use “training_days_*”, the function gives us
a list containing the variables “training_days_actual” and “training_days_to_add”.
Figure 2.177 shows the final stream “simple_calculations_extended”. Here we
added a third sub-stream. The formula in the Derive node is “sum_n
2.7 Data Handling and Sampling Methods 149

Fig. 2.177 Final stream “simple_calculations_extended”

Fig. 2.178 Results of the calculation

(@FIELDS_MATCHING(‘training_days_*’))”. We can find the results in the


Table node, as shown in Fig. 2.178.

Exercise 3: Counting Values in Multiple Fields


Name of the solution stream counting_values_multiple_fields
Theory discussed in section Section 2.7.2
150 2 Basic Functions of the SPSS Modeler

1. Figure 2.179 shows the initial stream “Template-Stream IT_user_satisfaction”.


2. Figure 2.180 shows the final stream. First we want to explain the sub-stream,
counting the number of people that answered “5 ¼ good” for satisfaction with
the start-time, the system_availability, and the performance.
We use a Filter node to reduce the number of variables available in the first
and second sub-stream. Here we disabled all the variables except “start-time”,
“system_availability”, and “performance”, as shown in Fig. 2.181. This is not

Fig. 2.179 Template-Stream IT_user_satisfaction

Fig. 2.180 Stream “counting_values_multiple_fields”


2.7 Data Handling and Sampling Methods 151

Fig. 2.181 Enabled/disabled variables in the Filter node

necessary but it helps us to have a better overview, both in the expression builder
of the Derive node, and when displaying the results in the Table node.
In the Derive node, we use the function
count_equal(5,[starttime, system_availability, performance])
as shown also in Fig. 2.182. Figure 2.183 shows the results.
3. The second sub-stream is also connected to the Filter node. That’s because we
also want to analyze the three variables mentioned above. The formula for the
Derive node is (Fig. 2.184):
count_greater_than(3,[starttime,system_availability,performance])
To count the number of answers that represent satisfaction of at least “5 ¼
good”, we used the function “count_greater_than”, with the first parameter “3”.
Figure 2.185 shows the result.
The variables “start-time”, “system_availability”, . . ., “slimness” should be
analyzed here. We are interested in the number of values that are 5 or 7. To get a
list of the variables names, we can use the function “@FIELDS_BETWEEN”.
We have to make sure that all the variables between “start-time” and “slimness”
have the same coding. The formula
count_greater_than(3,@FIELDS_BETWEEN(start-time, slimness))
can be found in the Derive node in Fig. 2.186. Figure 2.187 shows the results.
4. The function “@FIELDS_BETWEEN(start-time, slimness)” produces a list of
all the variables between “start-time” and “slimness”. If we want to exclude a
variable we can filter or reorder the variables. Here we want to show how to use
the Field Reorder node.
152 2 Basic Functions of the SPSS Modeler

Fig. 2.182 Parameters of the first Derive node

Fig. 2.183 Results of the first calculation


2.7 Data Handling and Sampling Methods 153

Fig. 2.184 Parameters of the second Derive node

Fig. 2.185 Results of the second calculation


154 2 Basic Functions of the SPSS Modeler

Fig. 2.186 Parameters of the third Derive node

Fig. 2.187 Results of the third calculation


2.7 Data Handling and Sampling Methods 155

If we add a Field Reorder node to the stream and double-click on it, the list of
the fields is empty (see Fig. 2.188). To add the field names, we click on the
button to the right of the dialog window that is marked with an arrow in
Fig. 2.188. Now we can add all the variables (see Fig. 2.189).
After adding the variable names, we should reorder the variables. To exclude
“system_availability” from the analysis, we make it the first variable in the list.
To do so, we select the variable name by clicking on it once and then we use the
reorder buttons to the right of the dialog window (see Fig. 2.190).
In the Derive node, we only have to modify the name of the variable that is
calculated. We use “count_5_and_7_for_all_except_system_availabilty”. The
formula is the same as in the third sub-stream. As shown in Fig. 2.191, it is:
count_greater_than(3,@FIELDS_BETWEEN(start-time, slimness))
Running the last Table node, we get slightly different results than in the third
sub-stream. Comparing Figs. 2.187 and 2.192, we can see that the number of
answers with code 5 or 7 is sometimes smaller. This is because we excluded the
variable “system_availability” from the calculation, by moving it to first place
and starting with the second variable “start-time”.

Fig. 2.188 Field selection in a Field Reorder node


156 2 Basic Functions of the SPSS Modeler

Fig. 2.189 Adding all the variables in a Field Reorder node

Fig. 2.190 Reorder variables in a Field Reorder node


2.7 Data Handling and Sampling Methods 157

Fig. 2.191 Parameters of the fourth Derive node

Fig. 2.192 Results of the third calculation


158 2 Basic Functions of the SPSS Modeler

Exercise 4: Determining Non-reliable Values


Name of the solution stream identifing_non-reliable_values
Theory discussed in section Section 2.7.4

1. The median should be used as a measure of central tendency for payments or


salaries because it is less sensitive to outliers. The mean would be a questionable
measure because it is very sensitive to outliers.1
2. Figure 2.193 shows the initial template stream.
3. We save the stream with the name “identifing_non-reliable_values”.
4. To sort the values, we can use a Sort node from the Record Ops tab. Figure 2.194
shows the parameters of the node. Additionally, we add a Table node behind the
Sort node to show the results (see Fig. 2.195).

Fig. 2.193 Template-Stream England payment 2014 reduced

Fig. 2.194 Parameters of the Sort node

1
See also the solution of Exercise 7 in Sect. 3.2.8 and here especially Table 3.15.
2.7 Data Handling and Sampling Methods 159

Fig. 2.195 Sorted records shown in a Table node

Fig. 2.196 Select node parameters

Fig. 2.197 Selected records

5. To extract only the records that have a confidence value (CV) of the weekly
payments below (and not equal to) “reasonably precise”, we can use a Select
node with the parameters shown in Fig. 2.196 from the Record Ops tab and a
Table node to show the results.
To determine the sample size, we run the Table node. Figure 2.197 shows a
sample size of 983 records.
160 2 Basic Functions of the SPSS Modeler

Fig. 2.198 Final stream “identifing_non-reliable_values”

6. If we want to extract the records with a CV that is at least “reasonably precise”, we


can use another Select node with the condition “weekly_payment_gross_
CV<¼10”, instead of “weekly_payment_gross_CV<10”. The sample size is
more than 994. Obviously, the coefficient of variation equals 994  983 ¼ 11
records. Figure 2.198 shows the final stream.

Exercise 5: Extract Area Names


Name of the solution stream string_functions_extract_names”
Theory discussed in section Section 2.7.3

Figure 2.199 shows the extended template stream. We added a Derive node and a
Table node.
There are a lot of possible formulas for extracting the region names. Figure 2.200
shows the formula:
Substring(locchar(“:”,1, admin_description)+1,length(admin_description)-
locchar(“:”,1, admin_description),admin_description)
The function substring helps us to separate parts of a stream. With the first
parameter “locchar(‘:’,1, admin_description)+1”, we determine the position of the
character “:”. The substring procedure should then begin at the position+1 to extract
the region name.
With the second parameter “length(admin_description)-locchar(‘:’,1,
admin_description),admin_description)”, we calculate the length of the string to
be extracted. The first part “length(admin_description)” gives us the length of the
string and then we subtract the first part that should be ignored.

Exercise 6: Distinguishing the Select and the Filter Node


Name of the solution stream –
Theory discussed in section Section 2.7.4
2.7 Data Handling and Sampling Methods 161

Fig. 2.199 Solution “string_functions_extract_names”

Fig. 2.200 Derive node with the formula to extract the region names

1. The functionality of the Select node is explained in Sect. 2.7.4. A description of


the functionalities of the Filter node can be found in Sect. 2.7.5.
2. The difference between the Select and the Filter node is that the Select node
reduces the number of records and the Filter node reduces the number of
162 2 Basic Functions of the SPSS Modeler

Fig. 2.201 Stream “Filter_processor_dataset modified”

variables. Consider a table with 100 rows and 10 columns. We can minimize the
number of columns shown in the modeling process if we use a Filter node,
whereas the Select node helps us to cut down the number of records or rows that
are disposable. We can also use a Filter node to rename the variables.

Exercise 7: Filtering Variables Using the Source Node


Name of the solution stream Filter_processor_dataset modified
Theory discussed in section Section 2.7.5

We can find the solution in the stream “Filter_processor_dataset modified” (see


Fig. 2.201). To exclude the variable “processor type”, we modify the parameters of
the Excel File node, as depicted in Fig. 2.202.
2.7 Data Handling and Sampling Methods 163

If we compare the results in the Type node before (see Fig. 2.203) and after (see
Fig. 2.204) the modification of the Excel File node, we can see that the variable
“processor type” no longer appears.

Fig. 2.202 Parameters of Excel File node

Fig. 2.203 Type node parameters before stream modification


164 2 Basic Functions of the SPSS Modeler

Fig. 2.204 Type node parameters after stream modification

Exercise 8: Standardizing Values


Name of the solution stream standardize_IT_Processor_data
Theory discussed in section Filtering data see Sect. 2.7.5
Standardizing data see Sect. 2.7.6

1. For the solution stream discussed here, we used the name


“standardize_IT_Processor_data”.
2. To modify the stream so that the processors of the firm “Intel” will be selected,
we have to adjust the parameters in the Select node. The correct expression is:
Firm ¼ “Intel”.
3. To standardize the price and the Cinebench benchmark results within the
variables “EUR” and “CB”, we use an Auto Data Preparation node. As discussed
in Sect. 2.7.6, this node transforms continuous input variables. To do this, we
activate the Settings tab in the dialog window of the Auto Data Preparation node
(see Fig. 2.205). Then we choose the category “Prepare Inputs and Target”.
At the bottom of the dialog window, we can activate the option “Put all
continuous fields on a common scale (highly recommended if feature construc-
tion will be performed)”.
We can close the dialog window of the Auto Data Preparation node. Finally,
we should add a Table node as well as a Data Audit node. This is to show the
results of the standardization process (see Fig. 2.206 for the final stream).
2.7 Data Handling and Sampling Methods 165

Fig. 2.205 Auto Data Preparation node parameters to standardize continuous input variables

Fig. 2.206 Stream “standardize_IT_Processor_data”

4. For the standardized values, see Fig. 2.207.


5. We can use a Data Audit node to determine the largest standardized benchmark
result and the largest standardized price. We add this node to the stream and
connect it with the Auto Data Preparation node (see Fig. 2.206).
166 2 Basic Functions of the SPSS Modeler

Fig. 2.207 Standardized values of the price and the benchmark test result

If we run the Data Audit node, we get the results shown in Fig. 2.208. Using
the 3s-rule, explained in more detail in Sect. 3.2.7, we can definitely identify one
or more outliers. The maximum standardized price is 3.057. That means that the
largest price is 3 standard deviations away from the mean to the right. The value
is outside the 3s interval and therefore this processor is very expensive. Scrolling
through the records in the Table node, we can see that the processor “Core
2 Quad QX9770” is the outlier. It also has the maximum CPU performance.
In practice, it could be helpful to filter the records that have a absolute
standardized value larger than 3. This can be done by using another Select
node. For an explanation, see also Exercise 3.
6. Data on the AMD processors can be analyzed by modifying the condition in the
Selection node. No outliers are found in the results from the standardized values.

Exercise 9: Dataset Partitioning


Name of the solution stream Dataset partitioning
Theory discussed in section Section 2.7.7

Figure 2.209 shows the final stream once more. Additionally, Fig. 2.210 shows
the option of the Partition node.
If the option “Repeatable partition assignment” in the Partition node (see
Fig. 2.210) is not activated, the records will be selected randomly each time and
will be assigned to one of the partitions. Each of the trials produced different results
in the Data Audit node.
If we want to assign the same records to the partitions and thereby get the same
results in each trial, we can activate the option “Repeatable partition assignment”
and set a seed value. This gives us the opportunity to reproduce the results each time
we run the stream.
2.7 Data Handling and Sampling Methods 167

Fig. 2.208 Maximum values of the price and the test results

Fig. 2.209 Final stream “Dataset partitioning”

Exercise 10: England Payment Gender Difference


Name of the solution stream England_payment_gender_difference
Theory discussed in section Section 2.7.8
168 2 Basic Functions of the SPSS Modeler

Fig. 2.210 The Select nodes are generated using a Partition node

By modifying the given stream “merge_England_payment_data”, the structure


of the stream shown in Fig. 2.211 remains the same. Here we describe the
modifications:

1. First we have to make sure we access the correct source files. In the Excel File
nodes, we have to define the filenames “england_payment_fulltime_female_
2014” in the upper node and “england_payment_fulltime_male_2014” in the
node below. Figure 2.212 shows the parameters of the first Excel File node.
2. We can check the parameters by using the Table nodes connected with the
Source nodes. Figure 2.213 shows some records from female employee data,
2014.
3. We can disable a lot of variables in each of the following Filter nodes behind the
Source nodes. Apart from the weekly gross payment, we do not need the other
payment related variables (see Fig. 2.214).
4. Because the variable names have to be unique, we have to disable the variable
“area”. Figure 2.215 shows the parameters for the payments to male employees
dataset.
5. After filtering the variables needed, we can now merge the two subsets (see
Fig. 2.216).
2.7 Data Handling and Sampling Methods 169

Fig. 2.211 Final stream “England_payment_gender_difference”

6. In an additional derive node, we define a new field called “weekly_payment_


gross_DIFFERENCE” with the formula “weekly_payment_gross_male_weekly_
payment_gross_female”, as shown in Fig. 2.217
7. In two Table nodes at the end of the stream, we can find the records with and
without the weekly payment difference (see Fig. 2.211 for the final stream
structure). Figure 2.218 shows the results in the last column.

Exercise 11: Merge Datasets


Name of the solution stream merge_employee_data
Theory discussed in section Section 2.7.9

Figures 2.219 and 2.220 show the records of the sample datasets with their small
sample size of three records each. This gives us the chance to understand the effect
of the different join operations. The final stream can be found in Fig. 2.221.
Figure 2.222 shows again the other option of a Merge node and in particular the
different merge operations.
We choose and describe the merge operations step-by-step and show the results.
Inner join
170 2 Basic Functions of the SPSS Modeler

Fig. 2.212 Excel node parameters

Fig. 2.213 Records of “england_payment_fulltime_female_2014”


2.7 Data Handling and Sampling Methods 171

Fig. 2.214 Filter node parameters for the payments of female employees dataset

Fig. 2.215 Filter node parameters for the payments to male employees dataset
172 2 Basic Functions of the SPSS Modeler

Fig. 2.216 Merge UK payment datasets

Rows are only matched where the customer_ID in both datasets is the same (see
Fig. 2.223).
Full outer join
All rows from both datasets are in the joined table, but a lot of values are not
available ($null$) (see also Fig. 2.124). Figure 2.224 shows that all the variables
that are not in one of the subsets are filled with “non-available” or “$null$” values.
We get a sample of size four because we can find the customer_ID’s 1711 and
9001 in both datasets, and the ID’s 3831 and 6887 are only included in one of the
sets. So the number of unique values is four.
Partial outer join
Records from the first-named dataset are in the joined table. From the second
dataset, only those with a key that matches the first dataset key will be copied.
In the Merge node, we select the partial outer join option and select the first
dataset, as shown in Fig. 2.225.
2.7 Data Handling and Sampling Methods 173

Fig. 2.217 Derive node to calculate the gross payment difference

Fig. 2.218 Gross payment difference

Figure 2.226 shows the result. ID 3831 appears in the result because it is in the
first subset, selected as the leading subset. If we set dataset 2 as the leading subset
with the option of the merge node (see Fig. 2.227), then ID 6887 appears in the
result instead of 3831 (see Fig. 2.228).
174 2 Basic Functions of the SPSS Modeler

Fig. 2.219 Records of “employee_dataset_001.xlsx”

Fig. 2.220 Records of “employee_dataset_002.xlsx”

It does not make sense to activate both datasets in the options, using partial outer
join, because the result equals the full outer join (see also the explanation in the
middle of Fig. 2.225).
Anti-join
The joined table only shows the records with a key that does not appear in the
second dataset. Here, the leading subset is “employee_dataset_001.xls”, as shown
in Fig. 2.219. The only unique customer ID not included in the second dataset is
3831. The result is therefore a dataset with just one record, as shown in Fig. 2.229.
2.7 Data Handling and Sampling Methods 175

Fig. 2.221 “merge_employee_data” is streamed to check the effects of different merge


operations

Exercise 12: Append Datasets


Name of the solution stream append_employee_data
Theory discussed in section Section 2.7.10

Figure 2.230 shows once more the different parameters of an Append node.
If we modify the field “Match fields by . . .”, we can determine how the append
operation will select the variables to append. In the case of the datasets used here,
there is no difference in the results. The option “Match case” enables case sensitiv-
ity for the names of the variables. So that variables such as “employer” or
“Employer” would be differentiated. This is sometimes useful. In our datasets
however, we cannot see any changes that depend on this option.
With the option “Include fields from . . .”, we can define which variables will
appear in the result. Though please be sure to realize that the results depend also on
176 2 Basic Functions of the SPSS Modeler

Fig. 2.222 Parameters of a Merge Node

Fig. 2.223 Inner join results

the option “Match fields by”. The append operation will only have an effect if
“Name” is activated! (see Fig. 2.230).
If we choose “Main dataset only”, then the variables “customer_ID”,
“monthly_salary”, and “employer” will be in the result. Only if we enable “All
datasets” will “gender” also be in the result. The difference is shown in Figs. 2.231
and 2.232.
2.7 Data Handling and Sampling Methods 177

Fig. 2.224 Full outer join results

Fig. 2.225 Partial outer join options

Fig. 2.226 Partial outer join result part 1


178 2 Basic Functions of the SPSS Modeler

Fig. 2.227 Partial outer join options

Fig. 2.228 Partial outer join result part 2

Fig. 2.229 Anti-join result

The last option “Tag records by including source dataset in field” is easy to
understand. As outlined also in Sect. 2.7.10, here we can determine if there is a new
variable in the results that represents an indicator with the number of the dataset
where the records come from. If we enable the option, we get the result as shown in
Fig. 2.233. We can also determine the name of the new variable, but unfortunately it
2.7 Data Handling and Sampling Methods 179

Fig. 2.230 Parameters of the Append node

Fig. 2.231 Result with “All datasets” disabled


180 2 Basic Functions of the SPSS Modeler

Fig. 2.232 Result with “All datasets” enabled

Fig. 2.233 Number of dataset included in the result


2.7 Data Handling and Sampling Methods 181

is not possible to redefine the values that are used to mark the subsets. Often we
need perhaps the year that the data represents, but for this we have to use an extra
Derive node to determine the value of a new variable and then we append the sets
afterwards. This is shown in Sect. 2.7.10.

Exercise 13: Append Versus Merge Datasets


Name of the solution stream append_vs_merge_employee_data
Theory discussed in section Section 2.7.9
Section 2.7.10

The records of the datasets “employee_dataset_001.xlsx” and “employee_


dataset_002.xlsx” are shown once more in Figs. 2.234 and 2.235.

Fig. 2.234 Records of “employee_dataset_001.xls”

Fig. 2.235 Records of “employee_dataset_002.xls”


182 2 Basic Functions of the SPSS Modeler

Fig. 2.236 Records after merging the subsets

Fig. 2.237 Records after appending the subsets

It is important to realize that the customer ID’s 1711 and 9011 are included in
both datasets. As we outlined in Sect. 2.7.10, the objects represented in each dataset
should be unique if we want to use the Append node. Otherwise we get inconsistent
data, as we will show here.
If we run the Merge node, we get the results shown in Fig. 2.236. As expected,
all unique customer ID’s are present. Variables for ID 3831 not present in dataset
2, are filled with $null$. Variables not present in dataset 1 for ID 6887 are also filled
with $null$ for unavailable data. So far, no failures can be found here. The data are
consistent.
Now we run the Table node behind the Append node. Figure 2.237 shows the
result. As we can see, this result is useless. That’s because we have two records per
2.7 Data Handling and Sampling Methods 183

Fig. 2.238 Merge node is modified to find duplicated primary keys in two datasets

customer ID 1711 and 9001. The information cannot be consolidated—worse still,


we get inconsistent data. The “family_status” of ID 1711 in row 2 is $null$, but in
row 4 we can see that the person is in fact “married”.
How can we avoid such problems? We should also make sure that in the
“combined” dataset, the primary key is unique. One way to check this is to merge
both subsets by using the customer ID and an inner join. We can easily check that
by modifying the Merge node parameters in the first sub-stream. Figure 2.238
shows that we should use the inner join to find the duplicated “customer_ID” in
the datasets (see Fig. 2.239).
Do not forget to reset the parameters of the Merge node to a full outer join
afterwards!
184 2 Basic Functions of the SPSS Modeler

Fig. 2.239 Result of the Merge node to find duplicated primary keys in two datasets

Literature

c’t Magazine for IT Technology. (2008). CPU-Wegweiser: x86-Prozessoren im Uberblick (Vol.
2008, No. 7, pp. 178–182).
IBM. (2014). SPSS Modeler 16 Source, Process, and Output Nodes. Accessed September
18, 2015, from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.
0/en/modeler_nodes_general.pdf
NOMIS UK. (2014). Official labour market statistics – Annual survey of hours and earnings –
Workplace analysis. Accessed September 18, 2015, from http://nmtest.dur.ac.uk/
Univariate Statistics
3

After finishing this chapter, the reader is able to . . .

1. Explain in detail the necessity and the characteristics of different scale of


measurement as well as to keep hold of the big picture.
2. Create diagrams of frequency distributions to assess the shape and to determine
outliers after assigning the correct scale of measurement to a variable.
3. Describe the necessity to use reclassification or binning procedures to determine
frequencies of values in specific intervals and finally
4. Use SuperNodes to transform variables and compact streams

So the successful reader is familiar with the statistical theory of assigning correct
scale of measurement to variables and after that to select and apply correct methods
to show, determine, and assess frequency distributions.

3.1 Theory

3.1.1 Discrete Versus Continuous Variables

Measuring a variable means assigning values to it. How we choose the method to
examine the variables highly depends upon the scale of measurement or so-called
“scale type” of each variable. Therefore, it is important to assign the correct scale to
the variables in the first place, before starting the analysis.
As an example, Fig. 3.1 shows a small stream called “car_sales_modified”.
Before we go into the details and explain how to use the SPSS Modeler, it is
necessary to understand the theory behind how we describe the variables used in
each sample data file. Each analysis in data mining is based on this information. In
other words: if the description of the type of information a variable represents is
wrong, the variable cannot be handled correctly and the results of the analysis may
be incorrect. So it is essential to at least check the scale of measurement.

# Springer International Publishing Switzerland 2016 185


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_3
186 3 Univariate Statistics

Fig. 3.1 Variables used in stream “car_sales_modified”

The dataset inspected in Fig. 3.1 includes details of different cars. The dialog
window on the right shows several details from the included variables. For more
details see also Sect. 10.1.5.
First, we are interested in the manufacturer of each car, represented by the
variable “manufact”. Its possible values “Acura”, “Audi”, etc. in the column
marked “values” give us some initial rough information about the car. Variables
that have such a restricted, i.e., finite, or countable number of values are called
discrete, categorical, or qualitative. A variable that can only take on two different
values is called dichotomous, for example, male or female.

Discrete Variables
Discrete variables can only take on a limited number of values. It is theoretically
impossible to find a third value between two very close values of the variable.
Often discrete variables are unsuitable for data mining, however, because of
their limited information. Therefore, lets discuss the variable “fuel capacity”. It is
denoted “fuel_cap” and is the third in the list of variables, as shown in Fig. 3.1.
In comparison with “manufact”, “fuel_cap” values can be measured with utmost
precision. The number of decimal places after the decimal point is infinite, and the
precision of the values is theoretically unrestricted. These types of variables are
called continuous or quantitative.

Continuous Variables
Continuous variables have an infinite number of values between two points.
Furthermore, a variable can be called continuous if there is always a theoretical
chance that between two very close values another third value can exist. Variables
that represent currencies will also always be called continuous.
In the case of the SPSS modeler, an additional explanation for the term “contin-
uous” is necessary. This is because the Modeler also uses this term for so-called
“absolute scaled variables”. In the dataset “car_sales_modified”, for example, the
variable “sales” is included. We can easily see that between 1.000 and 1.001 sold
3.1 Theory 187

cars, no other car can be sold, but we cannot just order the sales numbers. We can
calculate with them and, more importantly, we can interpret ratios of the values. We
therefore assign the scale type “continuous” to such variables in the modeler. These
are variables that include the most detailed information we can expect.
The discrete and the continuous variables represent two extremes of informa-
tion: one gives us a rough overview, the other, very detailed information.

" To decide if a variable is discrete or continuous, the following steps


are helpful:

1. We have to decide if the variable is infinitely valued and the values


can be ordered.
2. If this is the case, we must see if we can imagine in each case a
value between two value numbers that are as close as possible.
3. If there is a theoretical chance that a third value can exist (e.g., the
average of the two chosen values), then the variable is continuous.
In all other cases, the scale type of the variable is discrete.

Now, it is possible to assign a characteristic to each variable: discrete/qualitative


and/or continuous/quantitative. In the next section, additional terms are explained
to help differentiate the variables in even more detail.

3.1.2 Scales of Measurement

There are three other important terms for distinguishing the different types of
variable measurements. They are used in the SPSS Modeler. Let’s focus once
more on the example shown in Fig. 3.1. As well as the continuous variable
“fuel_cap”, in the column “Measurement” we can find the terms nominal and
ordinal. To distinguish between both types let’s define these terms.

Nominal Scale
Discrete variables with values that have no natural order are called “nominally
scaled”. The variable values are most likely strings/text or numbers, but, regarding
numbers usage, these values are only used for the purpose of assigning the object to
a particular group of objects. The variable values can only be ordered alphabeti-
cally, but there is no implicit order. The SPSS Modeler uses three circles to
symbolize the scale type “nominal” (see Fig. 3.1).

Ordinal Scale
An ordinal variable is similar to a nominally scaled variable, but the values have
a natural order too. The “type” of a car can be “small” or “large”, and these two
values can only be ordered ascending or descending. In the SPSS Modeler, this
scale type is represented by a column chart (see Fig. 3.1).
188 3 Univariate Statistics

Table 3.1 Variables and their scale type used in the “car_sales_modifed” dataset
Variable name Scale type
manufact Represents the manufacturer of a car. Values are “Acura”, “Audi”, etc. The
values have no implicit order and between two values we cannot find a third.
Therefore, the variables are discrete.
type Possible values are “Automobile” (small) or “Truck” (large), and so there is an
implicit order. But there is a limited number of possible values. The variables
are ordinally scaled.
fuel_cap The fuel capacity of a car can be determined theoretically with infinite
precision. The variable is metrically/continuously scaled.

Metrical Scale
A variable is called metrical if it can take on an infinite number of values. So all
continuous variables and currency variables are metrical. In the SPSS Modeler, the
metrical scaled variables are called “continuous” and represented with a ruler
symbol (see Fig. 3.1).
Coming back to the variables shown in Fig. 3.1. Table 3.1 shows a more detailed
description of the variable scale types used in the dataset “car_sales_modified”. All
the other variables included in this dataset are discussed in Exercise 4 of Sect. 3.1.3.
There are many more terms to describe the characteristics of a variable in more
detail, but we do not need more sophisticated descriptions to use the functionalities
of the SPSS Modeler. Interested readers are referred to Anderson et al. (2014).

3.1.3 Exercises

Exercise 1: “Fundamental Terms”


Please answer the following:

1. Explain in your own words the difference between a discrete and a continuous
variable. Give two examples for each term.
2. We also discussed the scale types nominal, ordinal, and metrical. Can you
remember which of these scale types can also be named “discrete”?
3. There are also variables that are called dichotomous. Explain in your own words
the meaning of this term and give an example.
4. Give two examples for each of these categories: Nominally, ordinally, and
metrically scaled variables.

Exercise 2: “Scale Type Examples”


State the scaling type for the following features:

(a) Color of a car


(b) Age of people in different countries
(c) Gender
(d) Nationality
3.1 Theory 189

(e) Satisfaction with a product (very good, good,. . .)


(f) Number of semesters taken by a college student
(g) Weight of vehicles
(h) Number of traffic accidents
(i) Hair color of people
(j) Types of a motor vehicle (automobile, truck, motorcycle)
(k) Ticket price for local transport in different countries

Are the following features discrete and/or dichotomous, quasi-continuous or


continuous?

(a) Election results of a party in percentage terms


(b) Fuel consumption of an automobile per 100 km
(c) Lot size
(d) Number of people dispatched at a counter per hour
(e) Income of an employee
(f) Number of married couples without children

Exercise 3: “Multiple Choice Test Scale Types”


Question Possible answers
1. Which of the following scale types define an ☐ Nominal scale
order, and a reference point or zero point too? ☐ Ordinal scale
☐ Metrical scale
2. Please tick all the scales that are only used to ☐ Nominal scale
label qualitative statements about statistical units. ☐ Ordinal scale
☐ Ratio scale
3. Please tag the metrical scales. ☐ Nominal scale
☐ Ratio scale
☐ Absolute scale
4. Please tag the necessary characteristics of a ☐ No transformation possible
transformation function that have to be fulfilled ☐ Inherent order must be preserved
for an ordinal variable to avoid information ☐ Uniqueness of the function
losses.
5. The variable of interest is ordinal scaled. Which ☐ Arithmetic mean
of the following statistical parameters can you ☐ Minimum
calculate without additional assumptions? ☐ Maximum
Please indicate if the following statements are correct or Yes No
incorrect
6. In statistics, scales are used to record the values ☐ ☐
of the variables of interest qualitatively or
quantitatively.
7. The different scales (nominal, ordinal etc.) ☐ ☐
cannot be hierarchically sorted according to their
characteristics.
8. The variable “gender of a person” is a ☐ ☐
dichotomous variable.
(continued)
190 3 Univariate Statistics

Question Possible answers


9. Variables with unit currency are always quasi- ☐ ☐
continuous.
10. Dichotomous variables are discrete. ☐ ☐
11. The variable “quality of hotels” is nominally ☐ ☐
scaled.
12. The variable “number of registered visitors in ☐ ☐
the library” is quasi-continuous, because the
arithmetic mean is 14.3.
13. The variable “length of the underground- ☐ ☐
platform” in Berlin is continuous.

Exercise 4: “Car characteristics and Variable Measurements”


In Sect. 3.1.1, we discussed measurement of the variables included in the dataset”
car_sales_modified.sav”. Figure 3.2 shows all of the variables. Their meaning is
self-explanatory. Now assign the correct scale types “nominal”, “ordinal”, or
“continuous” to each of the variables. Explain your decisions.

Fig. 3.2 Variables of “car_sales_modified” dataset


3.1 Theory 191

3.1.4 Solutions

Exercise 1: “Fundamental Terms”


Theory discussed in section Section 3.1.1
Section 3.1.2

1. The explanation of the terms can be found in Sect. 3.1.1.


2. Each nominal and each ordinal variable is also discrete. This is because both
types are assigned to variables that have “gaps” between two close values.
3. Dichotomous variables can only take on two exact values. For example, the
variables “Patient has critical illness insurance” or “The animal is infected with
pathogenic bacteria” are called dichotomous. This term is often used in the
medical sector.
4. Examples are for instance:

Scale type Examples


Nominal Color of cars
License plates of cars
Ordinal Classification of hotels, e.g., two-star, three-star etc.
Metrical Size of an apartment
Waiting time of guests until they are served in a restaurant

Exercise 2: “Scale Type Examples”


Theory discussed in section Section 3.1.2

The scaling types are:

(a) Nominal and discrete


(b) Metrical and discrete
(c) Nominal and discrete, dichotomous
(d) Nominal and discrete
(e) Ordinal and discrete
(f) Metrical and discrete
(g) Metrical and continuous
(h) Metrical and discrete
(i) Nominal and discrete
(j) Nominal and discrete
(k) Metrical and quasi-continuous because there are “gaps” between the smallest
national currency units. Nevertheless, there are in each case a lot of possible
values, so statisticians have decided to call it a quasi-continuous scale. This
theory can be applied to all currencies. All variables that represent currencies
are quasi-continuous and metrical.
192 3 Univariate Statistics

The features are:

(a) Continuous
(b) Continuous
(c) Continuous
(d) Discrete
(e) Quasi-continuous (see the example “ticket price for local transport in different
countries” discussed above)
(f) Dichotomous and therefore discrete

Exercise 3: “Multiple Choice Test Scale Types”


Theory discussed in section Section 3.1.2

The correct answers are:

1. The metrical scale defines an order and also a reference point. The “number of
rooms” in an apartment is an example of such a variable. It has an order because
the number of rooms can be ordered ascending or descending. With the value
zero, there is also a (theoretical) reference point. It is necessary to mention that
the values of a nominally scaled variable can never be ordered. Whereas, an
ordinally scaled variable, e.g., the classification of hotels, can be used to order
the records of a dataset. With hotel classification, there is also a theoretical
reference point of zero, but in comparison to the number of rooms there is no
measurable distance between a three- and a four-star hotel. The only thing we
know is that the four-star hotel is hopefully of better quality.
2. The nominal and ordinal scales are used to label qualitative statements about
statistical units.
3. The ratio scale and the absolute scale are metrical scales.
4. To avoid information loss, the inherent order has to be preserved and the
function has to be unique for a transformation function.
5. Minimum and maximum can be calculated without additional assumptions.
6. Yes, they are used to record the values of the variables of interest qualitatively
or quantitatively.
7. No, it’s wrong.
8. Yes, this is a dichotomous variable.
9. Yes, variables with currency are always quasi-continuous.
10. Yes, dichotomous variables are discrete.
11. No, it’s wrong. It’s ordinal scaled.
12. No, it’s wrong.
13. Yes, this variable is continuous.

Exercise 4: “Car Characteristics and Variable Measurements”


Theory discussed in section Section 3.1.2

Figure 3.3 shows the solution in the column “Measurement”. We also explain
the correct scale type in Table 3.2.
3.1 Theory 193

Fig. 3.3 Scale of measurement of variables of the “car_sales_modified” dataset

Table 3.2 Scale type of the variables included in the “car_sales_modified” dataset
Name of the variable Scale type Explanation
manufact Nominal This variable represents the name of the manufacturer.
The values represent “groups” of cars that can have no
implicit order. Additionally, the values are discrete.
type Ordinal This variable is also discrete but the values can be
ordered.
fuel_cap Continuous Between two very close values for the fuel capacity of a
car, it is always possible to find a theoretical value that
can exist. Additionally, it is possible to measure the
fuel capacity with infinite precision.
sales Continuous The variable is discrete because between two values,
e.g., 1.000 and 1.001, there can be no third value.
Nevertheless, we can say that ratios make sense. So
sales of 2.000 are double 1.000. In this case, we assign
the type “continuous” to the variable. See also the
detailed explanation in Sect. 3.1.1.
model Nominal These values are once more just names that cannot be
ordered.
resale, price, Continuous The values of these variables can also be measured with
horsepower, width, an infinite precision.
length
194 3 Univariate Statistics

3.2 Simple Data Examination Tasks

3.2.1 Theory

Using the correct data analysis tool and method depends on the determination of the
scale of measurement of the variables included in the dataset. We discuss the
different scales of measurement and their characteristics in Sect. 3.1. Figure 3.4
now shows the different steps in a univariate analysis depending if the scale is
discrete or continuous. In case of a continuous variable, the researcher must define
classes and bin the values before creating a frequency table. In the following
sections we will show examples of datasets the appropriate methods to apply.

3.2.2 Frequency Distribution of Discrete Variables

Description of the model


Stream name Distribution discrete values
Based on dataset tree_credit.sav
Stream structure

Related exercises: 2, 3, 4, 6, 7

Fig. 3.4 Steps of an analysis dependent on the scale type


3.2 Simple Data Examination Tasks 195

Theoretical Background
In Sect. 3.1.1 we learn to distinguish between discrete and continuous variables. In
Sect. 3.1.2, we discussed a procedure for determining the correct scale type. It is
easy to understand that nominally and ordinally scaled variables are always dis-
crete. The reverse, however, is not always the case, as discrete values, e.g., the
number of car accidents, can be metrically scaled.
A suitable example of a discrete variable is included in the dataset “tree_credit”
and is analyzed here: the variable, “number of credit cards”, describes the number
of credit cards owned per person and is coded as follows:

1. . . . less than 5 credit cards


2. . . . 5 or more credit cards.

For more details see Sect. 10.1.33.

Begin to Create A Stream: Adding a Data Source and Defining the Scale Type
Now we describe how to add a data source as well as how to assign the correct scale
type to the variables. We will start with an empty stream.

1. We open a new stream by using the shortcut “Ctrl+N” or the toolbar item “File/
New”.
2. We save the file to an appropriate directory.
3. The source file used here is an SPSS-Data file with the extension “SAV”. We add
from the Modeler tab “Sources” a node of type “Statistics File” by double
clicking on the settings. We define the folder and the filename “tree_credit.
sav” and confirm the modification with the “OK” button.
4. Next, we add a Type node from the tab “Field Ops”. As outlined in Sect. 2.1, we
activate the Source node by clicking on it once and then we click on the new
Type node. Both nodes are now connected in the correct order.
5. To open the settings of the Type node as shown in Fig. 3.5, we double click on
this node. In the second column of the settings of this node, we can see that the
Modeler automatically inspects the source file and tries to determine the correct
scale type.
Usually, the Modeler will determine the correct scale types of the variables,
but nonetheless, we have the chance to change this scale type with an additional
Type node. We suggest adding this node to each stream.
The dataset “Credit_cards.sav” includes the definition of the scale types, but
this definition is incorrect. So, we should adjust the settings in the column
“Measurement” as shown in Fig. 3.5. It is especially important to check the
variables that are to be analyzed with a node called “Distribution”. It has to be
defined as discrete. This is correct for the variable “Credit_cards”.
6. We can close the dialog window without any other modification and click “OK”.
196 3 Univariate Statistics

Fig. 3.5 Parameter of the Type node

" We strongly suggest adding a Type node to each stream right after
the Source node. The Type node has several functionalities, such as:

1. To inspect and to modify the scale type of a variable (see Sect. 3.1).
2. To define or to modify the value labels (see Sect. 2.3).
3. To disable a variable by using the option “none” in the column.

Using the Distribution Node to Plot a Bar Chart


After building up this part of the stream, we can now analyze the given dataset. It is
always a good idea to do this first analysis using diagrams. In the case of our
ongoing example, we would like to inspect the variable “Credit_Cards” with an
appropriate chart. Since the variable is discrete, we should use a bar chart. For the
relationship between scale types and chart types, see also Exercise 2 in Sect. 3.2.8.

1. We add a node of type “Distribution” from the “Graphs” section to the stream.
2. We connect this node with the Type node. Figure 3.6 shows the current stream.
Up to now, we could find the question mark beneath the Distribution node.
Obviously, we have to define the variable, which should be visualized with the
Distribution node.
3. To define the target variable, we double-click on the Distribution node and select
“Credit_cards” in the first dialog box as the variable of interest (see Fig. 3.7). We
can also modify the other parameters that influence the diagram.
3.2 Simple Data Examination Tasks 197

Fig. 3.6 Stream “Distribution discrete values” before the selection of the target variable

Fig. 3.7 Parameter of the Distribution node

4. Finally, we click “OK” and we will get the stream shown in Fig. 3.8.
5. We use the button “Run” in the upper toolbar. A new window will appear that
shows the frequency distribution of the variable “Credit_cards” (see Fig. 3.9).
6. The number of credit cards is a discrete variable. In the dataset, a name for each
category is defined (see the description in Fig. 3.9). If we want to see the data
labels instead of the values, we should use the button “Display field and value
labels”, which is included in the middle of the windows toolbar and marked with
an arrow in Fig. 3.9.
7. Furthermore, in the tab “Graph” of the dialog window, we can find a more
detailed diagram of the frequency distribution as shown in Fig. 3.10.
198 3 Univariate Statistics

Fig. 3.8 Final stream “Distribution discrete values”

Fig. 3.9 Frequency distribution of variable “Credit_cards” (Table)

Fig. 3.10 Frequency distribution of variable “Credit_cards” (Graph)

8. This is the result of the graphical analysis using a bar plot, and we can close the
window.

" If the axis annotations in a diagram should be modified, the values


labels in the Type node of the stream should be modified (see
Sect. 2.3).
3.2 Simple Data Examination Tasks 199

3.2.3 Frequency Distribution of Continuous Variables

Description of the model


Stream name Distribution continuous values
Based on dataset tree_credit.sav
Stream structure

Important additional remarks


It is important to assign the scale type to each variable correctly when the Histogram node is
used. In the settings of this node, only variables that are defined as metrical can be selected.
Therefore, this means that if a variable that should be inspected by a Histogram node doesn’t
appear, the scale type definition of this variable in the type node should be modified. It then
becomes necessary to define the scale as metrical.
Related exercises: 2, 6, 7

In general, discrete variables are used to reduce the amount of information and to
get a better overview. Variables with continuous values are used in order to obtain
more information (see Sect. 3.1.1 for more details).
Two types of charts are necessary to distinguish in the analyses of values. A bar
chart can be used when we have discrete or categorical values to analyze, e.g., the
number of credit cards as shown in Sect. 3.2.1.
If a variable is continuously scaled, however, this approach won’t work. This is
because of the huge/infinite number of values the variable can take on. The results
would then normally show a frequency of one for each value, and so the bar chart
would give unsatisfactory results, since each bar would have the same height, and
the distribution of the variable would be unidentifiable by the graph.
To handle this problem and to create a chart that we can interpret, the typically
used method is to split the co-domain of the continuous variable into intervals and
to determine the frequency of the values in each of these subintervals. This transfers
the continuous variable into a discrete one, and then a bar chart can be drawn as
described in the previous subsection. The SPSS Modeler offers different methods
for this procedure. At the end of the classification process, using equidistant classes
(intervals with the same length), a special kind of bar chart called a histogram will
be produced. Based on the stream template “credit_cards”, we will demonstrate
how to use the SPSS Modeler for this kind of graphical analysis. Interested readers
are referred to Anderson et al. (2014) for more details in using histograms.

1. We open the template stream “credit_cards” and save it under another name.
Alternatively, a Statistics File node can be added to an empty stream, and the
“tree_credit.sav” file should be defined as the data source. Finally, we add a
Type node and connect it with the Statistics File node.
200 3 Univariate Statistics

In both cases, we get a stream as shown in Fig. 3.11 with the settings in the
Type node as shown in Fig. 3.12.
2. We focus on the scale type of the variable “Age”. The settings of the Type node
in Fig. 3.12 shows us that the variable is continuous.
3. Now, we add a “Histogram” node to the stream, from the graph section of the
toolbar.
4. We connect the Type node with the Histogram node.
5. Figure 3.13 shows the Histogram node is at the end of the stream on the right-
hand side, Within it, the variable that will be graphically analyzed must be
defined. For this purpose, we double-click on this node and select the variable
age as shown in Fig. 3.14.
We have to pay attention to the fact that in this node only metrical values can
be selected! If the scale type is not appropriately defined in the Type node, then
the variable will not appear in the drop-down list of the Histogram node.

Fig. 3.11 Stream as defined in template “Credit_cards”

Fig. 3.12 Type node settings


3.2 Simple Data Examination Tasks 201

Fig. 3.13 Stream “Distribution continuous values” before selection of the target variable

Fig. 3.14 Histogram node settings

Fig. 3.15 Final stream “Distribution continuous values”

We click “OK” and confirm the settings in the node.


6. The stream is now ready to analyze the variable “age” (see Fig. 3.15).
To run the stream, we click the “Run” button in the toolbar of the Modeler.
After binning the values automatically, the histogram in Fig. 3.16 appears. The
values start from the age of 20 years, and the distribution has a longer right tail.
The dialog window can now be closed.
202 3 Univariate Statistics

Fig. 3.16 Histogram of variable “Age”

3.2.4 Distribution Analysis with the Data Audit Node

Description of the model


Stream name Distribution analysis using Data Audit node
Based on dataset tree_credit.sav
Stream structure

Related exercises: 5, 9

With the Distribution node and the Histogram node, only one variable can be
analyzed at a time. As shown in the Sect. 3.2.3, the definition of the scale type is
important for getting the correct results. To sum up the findings using these nodes,
we believe that a lot of stream parameters have to be defined, in order to inspect the
shape of a distribution. When considering a powerful data-mining tool, this is a
particularly inefficient way of condensing the information included in a more or less
3.2 Simple Data Examination Tasks 203

larger dataset. The Data Audit node helps us to reduce the overhead and to condense
the information in a much faster process. Here, we present how this node can be used.

1. Open the template stream “credit_cards” and save it under another name, e.g.,
“Distribution analysis using Data Audit node”. Alternatively, create a stream
based on the dataset “tree_credit.sav” as demonstrated in the Sect. 2.1.
Either way, we get a stream as shown in Fig. 3.17 with the settings in the Type
node as shown in Fig. 3.12.
2. Now, we add a Data-Audit node from the output section of the toolbar. We
connect the Data Audit node to the existing Type node. Immediately, the
Modeler scans the data source for the number of included variables that can be
inspected by this node. Here, there are six variables. The number of variables is
displayed under the Data Audit node (see Fig. 3.18).
3. To become familiar with this type of node, we will discuss the different options.
To open the dialog window we double-click on the Data Audit node. In the
dialog window, we enable the option “Advanced statistics” as shown in
Fig. 3.19. This option gives us the chance to select certain more detailed
measures. They are then shown in the final analysis (see Fig. 3.21). Additionally,
we suggest enabling the option to calculate the median and mode. We confirm
and finish the modifications with “Run”.
4. Figure 3.20 shows the result of the analysis for dataset “tree_credit”. Besides the
shape of the distribution, a lot of useful measures can be inspected und
interpreted. To reduce the number of measures, we can use the button “Display
statistics”, marked with an arrow in Fig. 3.20.
Figure 3.21 shows a comparison of the options in the case of the enabled
option “Advanced statistics” in Fig. 3.19. It will be part of Exercise 7 to explain

Fig. 3.17 Stream defined in template “Credit_cards”

Fig. 3.18 Stream “Distribution analysis using Data Audit node”


204 3 Univariate Statistics

Fig. 3.19 Settings of the Data Audit node

Fig. 3.20 Data Audit node—results for dataset “tree_credit”

the calculated measures in your own words. Finally, we can close the dialog
window with “OK”.
5. We can see another interesting feature of the Data Audit node by defining a
variable as a target variable in the Type node. The variable “Credit_rating” can
take on two possible values (0 ¼ bad and 1 ¼ good) as shown also in Fig. 3.20.
For demonstration purposes, we define this variable as the target variable in the
3.2 Simple Data Examination Tasks 205

Fig. 3.21 Comparison of the measures offered by the Data Audit node—advanced statistics on
the right-hand side

Fig. 3.22 Type Node settings modified


206 3 Univariate Statistics

Fig. 3.23 Data Audit node with defined target variable in Type node

Type node (see Fig. 3.22). If we now open the Data Audit node once again we
can find stacked bar charts which allow us to determine the proportion of bad and
good ratings (see Fig. 3.23).

" The Data Audit node offers a chart of the frequency distribution of
each variable as well as a wide range of statistical measures of central
tendency, volatility, and skewness.

" It is important to define the scale type of each variable correctly, so


the Data Audit node applies the proper chart (bar chart or histogram)
to each variable. For discrete variables, the SPSS Modeler uses a bar
chart, whereas for continuous/metrical variables, the distribution is
visualized by a histogram.

" A more detailed analysis can be created by defining one variable as a


target variable. Then the Data Audit node produces stacked bar
charts that allow determining the proportion of subgroups
depending on specific variable values.
3.2 Simple Data Examination Tasks 207

3.2.5 Concept of “SuperNodes” and Transforming a Variable


to Normality

Description of the model


Stream name transform_diabetes.str
An additional R-script to find the optimal Box–Cox transformation:
“transform_diabetes_data.R”.
Based on dataset diabetes_data_reduced.sav
Stream structure

Related exercises: 8, 9, 11

Theory
Many algorithms for creating statistical models need normally distributed variables
to produce reliable results. Otherwise, either the algorithms can’t determine the
correct solution or the goodness of fit statistics are inprecise. For instance, we can fit
a linear regression model on non-normally distributed variables, but the hypothesis
tests for the parameters, as well as their confidence intervals, are inaccurate. See
Aczel and Sounderpandian (2009, p. 448) for the regression model and Zimmerman
(1998) for the influence of non-normality to parametric and nonparametric tests.
Dealing with data coming from real practical backgrounds means often having to
transform them to normality or at least towards more normally distributed variables.
In this section, we will show how to analyze data and how to determine the best
transformation.
Standard normal distributions are represented by a typical bell-shaped curve. On
the one hand, Fig. 3.24 shows two examples with different expected values of 40 and
65, but the same standard deviation of 3. That’s because the range of the values or the
width of the distributions is equal. See also the 3s-rule mentioned later in Sect. 3.2.7.
On the other hand Fig. 3.25 shows an example of two distributions with the same
expected value of 55 but different standard deviations of 3 and 13. It is important to
note that the probability density is represented at the y-axis. The area under the curves,
between minus infinity and e x , can illustrate the probability of a specific value.
Each normal distribution can be described by its expected value μ and its
standard deviation σ. The skewness of the curves is always zero because the curves
are completely symmetrical. The curves are called left-skewed or negative-skewed
if they have outliers or a longer tail on the left and vice versa.
208 3 Univariate Statistics

Fig. 3.24 Normal


distributions with μ ¼ 40 or μ
¼ 65; but same standard
deviation σ ¼ 3

Fig. 3.25 Normal


distributions with standard
deviation σ ¼ 3 and σ ¼ 10;
but the same μ ¼ 55

The link between the frequency distribution of variables and the probability
density curve is given by the law of large numbers. We want to approximate a
frequency distribution by using normal distribution with an expected value equal to
the mean and the standard deviation; but this only works if the other parameters,
e.g., skewness of zero meet the assumptions.
If a curve has skewness of other than zero, however, we can try to transform the
curve to a more symmetrical shape. Table 3.3 shows common transformations such
as inverse, log, square root, and other power transformations. The different
exponents in a transformation have a massive impact on the result. For instance,
the log and the square root behave differently between 0 and 1, in comparison with
what happens with independent variables larger than 1. Neither functions are
defined for values smaller than zero.

" A transformation replaces the original variable by a function of this


variable. The transformation of data is necessary to satisfy the
assumptions of several data-mining algorithms.
3.2 Simple Data Examination Tasks 209

Table 3.3 Common Exponent n of xn Transformation function


transformations
2 x2 ¼ x12
1 x1 ¼ 1x
0.5 x0:5 ¼ x0:5
1
¼ p1ffiffix
0 loge x ¼ lnx
pffiffiffi
+0.5 x0:5 ¼ x
+1 x
+2 x2

" The disadvantage when transforming variables is that their interpre-


tation becomes more complicated than for the original values. If a
target variable must be transformed, the result of the calculations
must be retransformed back to the units of the original variable.

" Normally distributed values are not equal to standardized


values. Standardized values are calculated for variables with different
dimensions/units, and make it possible to compare them.
Standardized values represent the original values in terms of the
distance from the mean in standard deviations, whereas the nor-
mally distributed data are transformed so the shape of the distribu-
tion equals a bell-shaped curve.

Box–Cox Transformations
Finding the correct transformation can sometimes be challenging. Tukey (1957)
introduced a family of power transformation functions, later improved by Box and
Cox (1964), which covers all these cases. We want to outline some details here, so
that we are familiar with the bigger picture of statistical theory.
The Box–Cox transformation as a family of functions can be denoted as:
(
xλ  1
T ðxÞ ¼ if λ 6¼ 0
λ
loge x if λ ¼ 0

As we can see, this transformation also covers the functions presented in Table 3.3,
and as summarized also in Table 3.4.
The SPSS Modeler offers the Transform node for visualizing the original
variables and the variables after different transformations. To find the optimal
value for the Box–Cox transformation, other statistical software packages offer
specific functions. The interested reader is referred to the package “car”, which can
be found in Fox and Weisberg (2015, p. 106), there is documentation for the
function “powerTransform”. In the following example, we will show how this
procedure works and how it selects the best λ, based on the log-likelihood profile.
210 3 Univariate Statistics

Table 3.4 Different 2 Inverse square


values of Box–Cox l and
1 Inverse
their meaning
0.5 Inverse square root
0 Natural logarithm
+0.5 Square root
+1 No transformation

An R-script for transforming the variables, can be found as “R_transform_dia-


betes_data.R” in the “streams” folder of this book. Additionally, we will show in
Sect. 9 how to use an R Building node to add this Box–Cox transformation function
to the SPSS Modeler. Here, we will use the stream “R_transform_diabetes”.

" The Box–Cox power transformations are a family of transformation


functions for moving data towards normality. The user should try to
estimate the optimal transformation by using the Transform node in
the SPSS Modeler or other statistical programs such as R (see also
Sect. 9).

Transforming Variables to Normality


The SPSS Modeler offers the Transform node for visualizing the original variables
and the variables after different transformations. We can use this node to identify a
useful transformation. Then we can automatically produce nodes to implement this
transformation in a stream. Here, the concept of SuperNodes in the SPSS Modeler is
used to incorporate functionalities that would normally be implemented in more
than one node. In this way, a SuperNode encapsulates and accumulates several
functions in one node.
We will use the diabetes dataset here. This dataset represents the results of a
study of the Pima Indian population, conducted by the National Institute of Diabe-
tes and Digestive and Kidney Diseases. The Pima Indians are affected by higher
rates of diabetes and obesity (see Schulz et al. 2006). A detailed description of the
meaning of the variables can be found in Sect. 10.1.8.
Based on the diabetes dataset, we will show how to analyze and transform
variables, as well as how to implement the transformation in the stream automati-
cally, by using the SPSS Modeler’s concept of SuperNodes.

1. We open the template stream “Template-Stream_Diabetes” and save it with


another name. Figure 3.26 shows the initial stream.
2. The stream offers the option of scrolling through the records using the Table node
or assessing the frequency distribution with the Data Audit node. As we can see
from the Table node results in Fig. 3.27, the last variable shows us the medical test
result. The variable is binary with the values “0 ¼ tested negative for diabetes” and
“1 ¼ tested positive for diabetes”. Also the variable is defined as the target variable
for data-mining models, e.g., cluster analysis in the Type node. This is because the
frequency distributions in Fig. 3.28 are shown as stacked bar charts.
3.2 Simple Data Examination Tasks 211

Fig. 3.26 Initial template stream “Template-Stream_Diabetes”

Fig. 3.27 Sample records from “diabetes_data_reduced.sav”

We want to focus now however, on the five variables “glucose_con-


centration”, “blood_pressure”, “serum_insulin”, “BMI”, and
“diabetes_pedigree”, shown in Fig. 3.28. The skewness of the frequency
distributions is shown in the third column from the right and highlighted in
Fig. 3.28 with an arrow. All distributions are positively skewed except
“blood_pressure”. The log and the root transformation should not be applied
here, because the diabetes pedigree function shown at the bottom, with its range
from 0.085 to 2.420, has values smaller than 1. We mentioned in the introduction
the different behaviors between 0 and 1, in comparison with what happens for
independent variables larger than 1.
3. To find a good power transformation, we need to add a Transform node from the
Modeling tab of the SPSS Modeler. We connect it with the Type node (see
Fig. 3.29).
212 3 Univariate Statistics

Fig. 3.28 Frequency distributions in the Data Audit node

Fig. 3.29 A transform node is added to the template stream

4. We double-click the new Transform node to open the dialog window. In the
Fields tab, we add the variables “glucose_concentration”, “blood_pressure”,
“serum_insulin”, “BMI”, and “diabetes_pedigree” as also shown in Fig. 3.30
with an arrow. Figure 3.31 shows the result.
5. We activate the Options tab. Here, we have the chance to run all possible
transformations or to just select and modify some of them (see Fig. 3.32).
The Modeler offers:

– Inverse 1/x
– Log n equals natural logarithm logex
– Log 10 equals log10x (no Box–Cox power transformation function, but
similar results to log n)
– Exponential equals ex (no Box–Cox power transformation function)
pffiffiffi
– Square root equals x
3.2 Simple Data Examination Tasks 213

Fig. 3.30 Adding variables to transform

Fig. 3.31 Variables are added in the Fields tab of the Transform node
214 3 Univariate Statistics

Fig. 3.32 Options tab in the Transform node

Adding an offset would mean substituting the original variable x for its
þoffset. As we saw in the theoretical introduction to this chapter, the option
can be helpful for moving the values outside the intervals of 0 to 1 before
transforming them. For now, we do not modify the options. The Modeler should
apply all possible transformations to show us the results.
6. In the Output tab, we can define a name for the output; in the annotations tab, we
can determine a name for the Transform node itself. We do not want to modify
these options here. We click on “Run” to see the results, as shown in Fig. 3.33.
7. In the second column of Fig. 3.33, the Transform node shows the current
distribution of the variable. Under the curve, we can find the mean and the
standard deviation. If we double-click on one of these charts, a new window
appears that shows us the details of the frequency distribution selected and an
added normal curve (see Fig. 3.34).
For the curves, both the mean and the standard deviation of the selected
variable are calculated. The SPSS Modeler determines the expected frequencies
and draws the bell-shaped curve. By assessing the deviation of both curves, we
have the chance to decide whether the distribution is normal or not. We can close
this window.

" The Transform node helps assess the frequency distribution of the
variables. The user can decide if the original variable is approximately
normally distributed. The histogram as well as the mean and the
standard deviation will be calculated and shown.
3.2 Simple Data Examination Tasks 215

Fig. 3.33 Transform node results

Fig. 3.34 Histogram of “glucose_concentration” with normal curve

" The following transformations can be applied to the variables:

– Inverse 1/x
– Log n equals natural logarithm logex
– Log 10 equals log10x
– Exponential equals ex (no Box–Cox power transformation function)
pffiffiffi
– Square root equals x
216 3 Univariate Statistics

" Transformed data are also not normally distributed, but in general the
logarithm of the original values helps to move the distribution
towards normality. So if an algorithm assumes there are normally
distributed values, the user is recommended to test the logarithm
of values instead of the original variable itself.

" To calculate the skewness of the distribution of the original variable, it


is recommended to add a Data Audit node to the stream. The SPSS
Modeler does not provide statistical tests such as Shapiro Wilk to
assess whether the distributions are normal or not.

8. Starting at the third column in Fig. 3.33, we can see the charts of the frequency
distributions, depending on the transformations mentioned above. Here, we can
decide which transformation to use, to shift the distribution of the original variable
more towards a normal curve. To see more details, we can also double-click on
one of those diagrams. A window similar to that in Fig. 3.34 appears.
9. Assessing the distributions, we find that in particular “serum_insulin”, “BMI”,
and “diabetes_pedigree” do not have a bell-shaped form. The second column of
Table 3.5 shows the result of the Shapiro Wilk test.
The null hypothesis of the test is that the values come from a population with
normally distributed values. The stated significance/p-value is the probability,
assuming that this null hypothesis is true. In the result, none of the original
variables are normally distributed.
For all variables except “blood_pressure”, however, we can see that the
log-transformed values look much better. As the log 10-transformation leads
to smaller standard deviation, we prefer them to the natural logarithm.
We select the transformations, identified by clicking once on the distributions.
As we see in Fig. 3.35, the name of the transformation, as well as the histogram
of the transformed values, appear in the first column.
The fourth column of Table 3.5 shows us that not all of these transformations will
result in normally distributed variables, but other transformations cannot be deter-
mined, even if we use the automated Box–Cox transformation implemented in R.

Creating SuperNodes to Implement the Transformation of a Stream


So far, we have identified transformations of the variables included in the diabetes
dataset that are probably helpful. The Transform node helps us to assess the
distribution of the original variables, as well as the histogram of the different
transformations. We selected the best transformations and they now appear in the
first column of the Transform node. This is illustrated in Fig. 3.35.
We now want to use the Transform node to implement the transformations of the
variables in the stream. We could use four Derive nodes, but we would have to add
them manually and we would then have to define the formulas. Using the Transform
node is much more practical. As we will see, the Modeler produces one SuperNode
that encapsulated all these transformations in one place.
3.2

Table 3.5 Assessment and transformation results


Original variable tests Transformation used Transformed variable tests Lamda identified Test significance of
significance of Shapiro in the Modeler’s significance of Shapiro for Box–Cox Shapiro Wilk’s normality
Simple Data Examination Tasks

Variable name Wilk’s normality test transform node Wilk’s normality test with R test for Box–Cox lambda
glucose_concentration Non-normally distributed Not transformed Non-normally distributed 0.06295239 0.002
0.000 0.000
blood_pressure Non-normally distributed log n or log 10 Non-normally distributed 1.181447 0.014
0.009 (K–S-test 0.053) 0.000
serum_insulin Non-normally distributed log n or log 10 Normally distributed 0.259 0.04507211 0.377
0.000
BMI Non-normally distributed log n or log 10 Normally distributed 0.063 0.1490053 0.089
0.000
diabetes_pedigree Non-normally distributed log n or log 10 Normally distributed 0.212 0.044299 0.266
0.000
217
218 3 Univariate Statistics

Fig. 3.35 Transformations are selected in the Transform node

Fig. 3.36 Creating a SuperNode to transform many variables at once (step 1)

1. Our stream locks, as depicted in Fig. 3.29. Additionally, as shown in Fig. 3.35,
a dialog window appears on the screen and the different transformations for the
variables are determined. Now we can use the toolbar item “Generate” and
select “Derive Node” (Fig. 3.36).
2. A dialog window appears that lets us choose either to transform the values or to
transform and standardize the values at the same time. Here, we use the first
option, as shown in Fig. 3.37. For an explanation of the standardization
procedure, see Sect. 2.7.7. We click on “OK”.
3. The dialog window in Fig. 3.37 disappears, but a SuperNode is added automat-
ically to the stream by the SPSS Modeler (Fig. 3.38).
4. To evaluate the SuperNode, we activate it by clicking it once.
3.2 Simple Data Examination Tasks 219

Fig. 3.37 Creating a SuperNode to transform many variables at once (step 2)

Fig. 3.38 Stream with added SuperNode

5. We can now have a look at the details by clicking on “Zoom into SuperNode”
in the toolbar, shown in Fig. 3.39.
6. The Modeler shows the different nodes encapsulated in the SuperNode (see
Fig. 3.40). Furthermore, we can navigate between the SuperNode and the
stream by activating them in the Streams tree, to the right of the Modelers
window. This is illustrated in Fig. 3.41.
7. In the Transform node, we defined four transformations. They are now
implemented in the SuperNode, using one Derive node for each transformation.
If we want to verify the formula used in the Derive nodes, we can double-click
on them (see Fig. 3.42).
8. We can close the dialog window in Fig. 3.42 and navigate back to the stream by
activating the stream in Fig. 3.41.
220 3 Univariate Statistics

Fig. 3.39 SuperNode details symbol in the Modelers toolbar

Fig. 3.40 Nodes included in the SuperNode

Fig. 3.41 Navigating between Streams and SuperNodes

9. We must connect the SuperNode to the rest of the stream. To do this, we


activate the Type node and press F2. Then we click on the SuperNode.
10. Finally, we can add a Table node and a Data Audit node to the SuperNode,
which enables us to assess the results of the transformation process. In the
Data Audit node, we can see thirteen variables. These are the nine original
variables plus the four transformed variables. They are named with the
extension “_Log10”, as shown in Fig. 3.43.

" SuperNodes help the user to encapsulate several nodes into one.
SuperNodes can be created using the option “Create SuperNode”
that is available when more than one node is marked. The option
then appears in the dialog field, after a right click with the mouse.

" SuperNodes can be helpful, especially in larger streams. The disad-


vantage is that the stream is less transparent (Fig. 3.44).
3.2 Simple Data Examination Tasks 221

Fig. 3.42 Derive node parameters for “glucose_concentration”

Fig. 3.43 Data Audit node with the transformed variables


222 3 Univariate Statistics

Fig. 3.44 Final stream with a SuperNode

Summary and Outlook of How to Use Box–Cox Transformation


We used the Transform node to calculate new values that are shifted towards a
(more) normally distributed shape. The Transform node helped us to assess the
original values and to determine transformation functions. Unfortunately, the SPSS
Modeler does not provide any normality test, such as Shapiro Wilk. As outlined in
column 2 of Table 3.5, the original variables are not normally distributed.
Transforming variables doesn’t always lead to normally distributed values
anyway, however. The test results in column 4 of Table 3.5 show that the
transformed values of “glucose_concentration” are also not normally distributed.
As we mentioned at the beginning of the theoretical section, the transformations
offered by the Modeler are often part of the Box–Cox power transformation
function family. The difference between using the basis e for the natural logarithm
or 10, which is not part of the Box–Cox functions, is insignificant. The user often
wants to determine the optimal lambda for the Box–Cox transformation function, to
shift the values towards normal distribution. This is to reduce the difference
between the realized frequencies and the expected frequencies, in cases of normally
distributed values.
Some software packages offer functions to determine the optimal exponent for
a Box–Cox transformation. We implemented the assessment and transformation
of the diabetes dataset variables in R. The interested reader can find the R-script
“transform_diabetes_data.R” in the streams folder. This script determines the
optimal Box–Cox lambda for all five variables discussed in this section. Alterna-
tively also the stream “R_transform_diabetes” can be used. Here, the R script is
implemented in an R Building node.
Figure 3.45 shows an example. Here, the optimal value for lambda is determined
as 0.04507. As we know from the introduction, the Box–Cox transformation for
λ ¼ 0 equals the natural logarithm. As the determined value is not far away from
zero, the transformation used in the Modeler is a good estimation of this result. This
would also be true if we used log 10 instead.
3.2 Simple Data Examination Tasks 223

Fig. 3.45 Log-likelihood profile of “Serum Insulin”

Fig. 3.46 Histogram of the original “Serum Insulin” and the transformed “Serum Insulin”

Using the determined λ in the Box–Cox-function, we transform the distribution


as shown in Fig. 3.46. The result can be assessed in the Q–Q-plot in Fig. 3.47.
The λ values for the other variables can be found in the fifth column of Table 3.5.
The transformation is better than in the Modeler, but as shown in the last column
of this table, these transformed values are not normally distributed either. The
script also provides the Kolmogorov–Smirnov and Shapiro Wilk test results for
normality.
224 3 Univariate Statistics

Fig. 3.47 Q–Q plot of the original “Serum Insulin” and the transformed

Another application of the Box–Cox transformation, as well as an R example,


can be found in Chapman and Feit (2015, pp. 102–104).

3.2.6 Reclassifying Values

Description of the model


Stream name reclassify IT_user_satisfaction single
Based on dataset IT_user_satisfaction.sav
Stream structure

Related exercises: 10
3.2 Simple Data Examination Tasks 225

Fig. 3.48 Data analysis depending on the scale type

Theory
Figure 3.48 shows the differences between procedures used to analyze discrete or
continuous values in general. In this section we will show how to reduce the number
of values on the scale of a discrete variable.
When talking about data and its analysis, we have to distinguish between
discrete and continuous variables, as described in detail in Sect. 3.1.1. The number
of different discrete values can generally be infinite. Nevertheless, a small gap can
always be identified between two, even very close, values. Typical examples of
these types of data are variables based on questionnaires. Respondents are often
asked about their satisfaction with a specific product or the characteristics of an
object. We can consider that the respondents often tend to the values in the middle
of a scale and avoid giving positive or negative answers. That’s because the high
and low ends of the scales are less-used and underrepresented. Therefore, it makes
sense to combine the frequencies of two directly adjacent values, e.g., excellent and
good. In theory, we could simply add the frequencies, but if we would like to use
this information in a stream and in different ways, we should understand how to
transform the values using a so-called “reclassification”.

Reclassify Values
The dataset “IT_user_satisfaction.sav” represents the opinions of IT users in a
particular firm. 180 users were asked to assess the quality of a specific IT system.
For details, see the dataset description in Sect. 10.1.20. As an example of the
reclassification/recoding procedure, we would like to analyze the frequency distri-
bution of a variable. After that we want to reclassify the values.

1. We start with the template stream “Template-Stream IT_user_satisfaction.str”


and open it (see Fig. 3.49). Additionally, Figs. 3.50 and 3.51 show some
records. We should remember to switch the button in the middle of the dialog
226 3 Univariate Statistics

Fig. 3.49 Template stream “IT_user_satisfaction”

Fig. 3.50 Records of dataset “IT_user_satisfaction” without labels

window from the values to the labels of the values. In Fig. 3.51, the button is
marked with an arrow (see also Sect. 2.3).
2. Obviously, the variables are discrete. In order to have the opportunity to
reclassify the values, the type “discrete” has to be assigned to the variables.
To do this, we should use the Type node shown in the template stream in
Fig. 3.49. We double-click on the Type node, and the scale types as shown in
Fig. 3.52 should be assigned in the second column “measurement”.

" Reclassification means to modify the scale values/coding of nominal


or ordinal variables.
3.2 Simple Data Examination Tasks 227

Fig. 3.51 Records of dataset “IT_user_satisfaction” with labels

Fig. 3.52 Variables and their scale type in “IT_user_satisfaction”

" Often the SPSS Modeler does not determine the scale type correctly
using its internal procedures. Therefore, a Type node should be
included in the stream, right after the Source node. The user should
check and possibly modify the scale type settings!
228 3 Univariate Statistics

" After a Reclassify node always a additional Type node should be


implemented to assign the correct scale to the new variable.

3. The first column in Fig. 3.51 shows some values for the variable “starttime” that
represent the satisfaction of respondents, with the time the IT system needs to
come alive and be ready to login. To have a better overview, we can use a Data
Analysis node for example and connect it with the Type node (see Fig. 3.53).
4. We double-click on the Data Analysis node. In Fig. 3.54, we get a rough
overview of the distributions.
5. To have the chance to assess the distribution in detail, we double-click on the
first small diagram in Fig. 3.54. In the new window, we can assess the details of
the distribution.
6. As we can see in Fig. 3.55, the number of respondents that use the option
“poor” to characterize the start time of the IT-system is very small. To reduce
the number of categories, we can combine the categories “good” and “poor”.
Normally, we could then add the frequencies 3 + 109 ¼ 112, but we want to
show how to modify the values in the stream itself, to achieve in the end a
completely new variable with the transformed values. We therefore have to
implement the transformation as summarized in Table 3.6.

Fig. 3.53 Data Analysis node is added to the template stream

Fig. 3.54 Results of the Data Analysis node


3.2 Simple Data Examination Tasks 229

Fig. 3.55 Frequency distribution of “starttime”

Table 3.6 Summary of Variable “starttime” Variable “starttime_recoded”


the reclassification
Value label Value Reclassified value Value label
procedure
Excellent 7 7 Excellent
Fair 5 5 Fair
Good 3 3 Good or poor
Poor 1

Fig. 3.56 Reclassify node is added to the stream

7. To do so we add a Reclassify node to the stream from the “Field Ops” tab of the
Modeler and connect it with the Type node as shown in Fig. 3.56.
8. To implement the reclassification procedure shown in Table 3.6, we double-
click on the new Reclassify node. A dialog window as shown in Fig. 3.57
appears.
9. We now adjust all the parameters as follows:
(a) Mode:
Here, we would like to reclassify just one variable. Therefore, the first
“Mode” option is “Single”;
230 3 Univariate Statistics

Fig. 3.57 Parameters of the Reclassify node

(b) Reclassify into:


In the second row of the dialog window, we have the chance to decide if
we create a new variable or overwrite the existing variable “starttime”. We
strongly suggest not using the second option “Existing field”. The reason
for using the Modeler is to realize a transparent and easy to understand
analysis process. Therefore, a scenario where we modify an existing
variable is inappropriate. It is better to use the first option “New field”.
(c) Reclassify field:
In the dropdown list “Reclassify field”, we can define which variable
should be reclassified. The complexity of this field depends for sure on the
setting of the “Mode” option. If we choose a multiple reclassification
3.2 Simple Data Examination Tasks 231

procedure, then we have the opportunity to select more than one variable.
We select the field “starttime”.
Once more we have to realize that in this dialog box only discrete
variables appear! If we miss a variable, then we have to adjust the settings
in the Type node.
(d) New field name:
The SPSS Modeler displays a generic name with a number. It would be
better to select a name that is self-explanatory, so the name for our new
reclassified variable will be “starttime_recoded” (see Fig. 3.58).

Fig. 3.58 Parameters to reclassify the variable “starttime” (part 1)


232 3 Univariate Statistics

(e) Reclassify values:


We have to make sure to define the reclassification as shown in
Table 3.6. The handiest way to do this is to gather all the values of the
original variable by using the button “Get”. In the left-hand column
“original values”, we now find the values 1, 3, 5, and 7. These are the
codes we saw also in Fig. 3.50. Now we can define the new values in the
second column of the table, as in Fig. 3.58. To do so, we click in each row
and choose “specify” and define the new value. Figure 3.59 shows the final
parameters of the Reclassify node.

Fig. 3.59 Parameters to reclassify the variable “starttime” (part 2)


3.2 Simple Data Examination Tasks 233

" The Reclassify node enables the transformation of the values of a


variable. This node can be used, e.g., to combine certain values within
a scale. It is critically important that the variables used are defined as
discrete (nominal, ordinal, or categorical).

" The term “reclassify” does not mean “binning”. A better description
would be “Recode node”.

" It is critically important to make sure that the last row of the reclassify
definition has absolutely no value, that it is blank!

" To define value labels for a reclassified variable, a Type node must be
added to the stream after the Reclassify node.

10. Figure 3.60 shows the actual stream.


We have so far implemented the transformation of the variable “starttime”
into the new variable “starttime_recoded”, but we have not defined value labels
for the new variable, as shown in the last column of Table 3.6. We have
explained in detail in Sect. 2.3 how to do that. There we have also outlined
that it is necessary to add a Type node to define these new labels. Therefore, we
should add a Type node to this stream too, right behind the Reclassify node (see
Fig. 3.61).
11. If we double-click on the Type node and scroll down to the new variable
“starttime_recoded”, we can define the new value labels as shown in Fig. 3.62.
12. In the new dialog window in Fig. 3.63, we use the option “Specify values and
labels” and define the labels in the second column of the table below. We use
the coding scheme of Table 3.6.

Fig. 3.60 Stream to Reclassify values


234 3 Univariate Statistics

Fig. 3.61 Stream with Reclassify and Type node

Fig. 3.62 Step 1—Definition of value labels in the Type node

13. We can confirm the settings with “OK” and then close the dialog window of the
Type node with “OK”.
14. Finally, we add a Data Audit node at the end of the stream, as shown in Fig. 3.64.
This gives us the chance to analyze the new variable “starttime_recoded”.
3.2 Simple Data Examination Tasks 235

Fig. 3.63 Step 2—Definition of value labels in the Type node

Fig. 3.64 Final stream to reclassify values and add value labels

15. To see the frequency distribution, we double-click on the Data Audit node, run
the analysis, and scroll down to the new variable “starttime_recoded” (see
Fig. 3.65).
16. Finally, if we double-click on the small diagram marked with an arrow in
Fig. 3.65, we can see the new value labels and the frequency distribution as
shown in Fig. 3.66.
236 3 Univariate Statistics

Fig. 3.65 Details of the new variable in the Data Audit node

Fig. 3.66 New codes for the variable “starttime_recode” and their frequencies

3.2.7 Binning Continuous Data

Description of the model


Stream name binning_continuous_data
Based on dataset tree_credit.sav
Stream structure

Related exercises: 8, 11, 12


3.2 Simple Data Examination Tasks 237

A continuous variable can take on an infinite number of values within a certain


interval. As statisticians, we want to analyze the shape of the distribution and
several measures of central tendency and volatility. These tasks are explained
in Sects. 3.2.3 and 3.2.4. Often, we need a better overview of the information
represented by a variable, or it is necessary to reduce the number of different values.
Now, we describe the procedure to bin values:

1. We open the template stream “tree_credit” and save it under another name, e.g.,
“binning_continuous_data”. Alternatively, we create a stream based on the
dataset “tree_credit.sav” as demonstrated in Sect. 3.2.2. Either way, we get a
stream as shown in Fig. 3.67 with the settings in the Type node, as shown in
Fig. 3.12.
2. Now, we add a node called “Binning” from the “Field Ops” panel, and we
connect the Binning node with the existing Type node (see Fig. 3.68).
3. Before we bin the values, it is always a good idea to analyze the original dataset
(see Sect. 3.2.4). For this reason, we add a Data Audit node to the stream and
connect it with the Statistics File node (see Fig. 3.69).
4. Starting the analysis with this Data Audit node, we realize that only “age” is a
continuous variable. The second row in Fig. 3.70 shows a minimum age of
20 years and a maximum age of 63.350 years. Furthermore, the average age is
33.816 years and the standard deviation is 8.539 years. We will come back to
those results when we discuss the different binning procedures.
5. The Binning node determines the values of a new variable that we would like to
analyze afterwards. To do so, we add finally an additional Data Audit node to the
stream and connect it with the Binning node (see Fig. 3.71).
At the moment, the modeler tells us that there are just six variables. Normally,
we would expect seven—the six original variables and the new additional binned
variable—but so far we have not defined any parameter in the Binning node.
After the next step, there will be seven variables including the new binned age.

Fig. 3.67 Stream as defined in template “tree_credit”

Fig. 3.68 Added Binning node to the stream


238 3 Univariate Statistics

Fig. 3.69 Stream with the Data Audit node added to the Statistics File node

Fig. 3.70 Results of the analysis using the Data Audit node

6. First double click on the Binning node and choose “age”, the only variable in the
dataset that is continuous (see Fig. 3.72).

In any case, the user has to determine the method of how to bin. Figure 3.72
shows the methods. To become more familiar with these options, we now want to
use and explain them.

Options Within the Binning Node


Figure 3.72 shows the names of the different binning methods in a drop-down list.
We explain the most important parameters here. More details can be found in IBM
3.2 Simple Data Examination Tasks 239

Fig. 3.71 Final stream to bin values

Fig. 3.72 Variables and methods available in the Binning node


240 3 Univariate Statistics

(2014, pp. 125–130). If we would like to specify cut points by ourselves to define
a flag variable (status indicator), we should use a Derive node (see Exercise 12
in Sect. 3.2.8). Additionally, we have to mention that sometimes we would like to
reclassify the categories of a variable. The Reclassify node should be used for this
(see Sect. 3.2.6).

Method “Fixed-Width”—Constant Bin Width


This method can be used to calculate the values of a new categorical variable that
has equidistant cut points. After specifying the width of 10 for each of the new
classes, the modeler can calculate the cut points. To do so we activate the tab “Bin
Values” and click on the button “Read Values”. When we compare the minimum
for “age” with 20.003 (see Fig. 3.70), we can see in Fig. 3.73 that the modeler
adjusts the boundaries so that the first value is in the middle of the class
(20.003  10/2 ¼ 15.003).
The name extension for the new variable can be defined in the middle of the
dialog window shown in Fig. 3.72. Normally this is set to “_BIN”.

Fig. 3.73 Cut points for equidistant classes


3.2 Simple Data Examination Tasks 241

Method “Fixed-Width”—Predefined Bin Number


Alternatively, the fixed bin option can be used to create a new variable with
a predefined number of classes or categories (see Fig. 3.74). After using “Read
values” in the “Bin values” tab, the modeler lets us start the first class with the
minimal age of 20.003 (see Fig. 3.75). For more powerful options to define the
number of classes, we refer to the next method “Tiles (equal count)”. Here, we can
define what happens with values that are equal to a cut point.

Method “Tiles (Equal Count)”


We should use the option “Tiles (equal count)” to determine the values of a new
variable so that the number of values of the original variable in each class is the

Fig. 3.74 Predefined number of categories


242 3 Univariate Statistics

Fig. 3.75 Cut points for a predefined number of categories

same (see Fig. 3.76). It is essential to understand all the terms, “quartile”, “quin-
tile”, etc. that we find in the middle of the same dialog box shown in Fig. 3.7.6.
In general, a p-quantile xp is the value where p % of all the values can be found
on the left-hand side of xp. Per definition, the median is x0,50 or the 50th percent
quantile or the 50th percentile. All other terms are explained in Table 3.7.
The Binning node also offers the option of putting a certain number of values in
each class. This option is similar to the cut point definition for a predefined number
of categories, using the “Fixed-width” option as mentioned above. The advantage
here, however, is that we can determine what happens with the values at the cut
points themselves. If there is a value that is equal to a cut point, then the options
“Add to next”, “Keep in current”, and “Assign randomly” will assign this number.
In the case of a continuous variable, the probability that this will happen is
approximately zero, but we can imagine this, e.g., in the case of a variable “age
in years”. If the respondent is 25 years old and the cut point definition is exactly
25, then we can determine the class of ages the person is assigned to.
3.2 Simple Data Examination Tasks 243

Fig. 3.76 Defining categories with the same frequency of original values

Table 3.7 Quantile function in the Binning node


Quartile Each bin includes 25 % of the values. Four bins are generated.
Quintile 20 % of the values are then falling in each class. Five classes are generated.
Decile Each decile represents 10 % of the values. The cut points for 10 classes are
therefore determined.
Vingtile 5 % of the values fall within the boundaries of each class.
Percentile Each class represents one % of the values. All in all, 100 classes are determined.

If we would like to define several new variables, we can activate several of the
options explained in Table 3.7. Figure 3.76 shows an example with the quintile and
the decile. To show the results, we click once more on the tab “Bin Values” and
choose the option “Tile”. The drop-down list is marked with an arrow in
244 3 Univariate Statistics

Fig. 3.77 Defining more than one variable to be generated with the “Tile” option

Fig. 3.77. Let’s analyze the details of the generated variables “AGE_TILE5” and
“AGE_TILE10”. Both are also shown in the analysis of the Data Audit node
connected to the Binning node (see Fig. 3.78).

Method “Ranks”
This option should be used to rank the values in ascending or descending order.
Figure 3.79 shows a definition for three new variables. For an explanation, see
Table 3.8.

Method “Mean/Standard Deviation”


One of the most important and interesting options is the binning method, using a
combination of the mean/average and a multiple of the standard deviation. As a
3.2 Simple Data Examination Tasks 245

Fig. 3.78 Newly generated variables using the Binning node

standard measure of volatility, standard deviation is one of the most important


measures. It can be interpreted as the average of the squared distances of the values
around the mean. Although the median would represent the distribution better, the
standard deviation can only be interpreted in combination with the mean.
To become familiar with the theory, we will focus once more on the variable
“age” and the measures determined with the Data Audit node in Sect. 3.2.4.
Figure 3.80 shows the results once more. The average age is 33.816 and the
standard deviation is 8.539.
We want now draw the attention to the following intervals:

½x  2  s ; x þ 2  s ¼ ½33:816  2  8:539; 33:816 þ 2  8:539 ¼ ½16:738; 50:894

½x  3  s ; x þ 3  s ¼ ½33:816  3  8:539; 33:816 þ 3  8:539 ¼ ½8:199; 59:433

" Probability theory has a rule of thumb: in the first interval—also


called “2s-interval”—usually approximately 90–95 % of all the values
of the sample can be found. Between 95 and 100 % of the values can
also usually be found within the boundaries of the so-called “3s-
interval”. As we said, these percentages merely represent a rule of
thumb! Table 3.9 shows the exact percentages, depending on the
shape of the distribution.
246 3 Univariate Statistics

Fig. 3.79 Defining an order of values and generating new variables

Table 3.8 Options for ranking cases


Rank Depending on the order type, the first value is marked with 1. In the
case of ascending ordered values, the smallest value is ranked 1
Fractional Rank The new value represents the rank mentioned above, defined by the
sum of all the weights
Percentage fractional rank The rank is divided by the number of valid cases and multiplied by
100. The result is a value between 1 and 100
3.2 Simple Data Examination Tasks 247

Fig. 3.80 Data Audit node—results for dataset “tree_credit” (see Sect. 3.2.4)

Table 3.9 s-intervals around the mean


Normal distribution Unimodal, symmetrical Arbitrary
Interval (%) distribution distribution
½x  1  s; x þ 1  s 68.27 55.56 0
½x  2  s; x þ 2  s 95.45 88.90 75.00
½x  3  s; x þ 3  s 99.75 95.06 88.89

The attentive reader will probably ask why we did not discuss the 1s-interval and
its boundaries

½x  1  s ; x þ 1  s ¼ ½33:816  1  8:539; 33:816 þ 1  8:539 ¼ ½25:277; 42:355

As we can see in Table 3.9, the percentage of values falling within those boundaries
highly depends upon the shape of the distribution. Furthermore, in the case of
variables with a normally distributed frequency, only 68 % of the values can be
found in that interval, and so these are of less importance to us.
We would like to use the 2s- and the 3s-interval to identify outliers. Looking
at the percentage of values in these intervals, we can say that values outside the
2s-interval are potentially outliers; they are suspicious. Values outside the
3s-interval are definitely outliers.
Coming back to the Binning node, we now use the 2s-interval option for the
variable “age”, as shown in Fig. 3.81. The “Bin values” tab of the Binning node
in Fig. 3.82 shows the results. If we then use the connected Data Analysis node,
we find the frequency distribution as shown in Fig. 3.83. When we double click
on the distribution chart, we get the diagram in Fig. 3.84. Clearly values are
larger than the upper boundary of the 2s-interval. So we find 78 potential outlier
values in this dataset. This is what we expected from the shape of the distribution
of “age” shown in Fig. 3.80. The distribution has a long right tail. These are the
“outliers” we have now found once again with the more refined 2s-interval
method (Fig. 3.84).
248 3 Univariate Statistics

Fig. 3.81 Using the mean and the standard deviation to determine values of a new variable

3.2.8 Exercises

Exercise 1: “Sorting Values”


In this chapter, we discussed several nodes that can be used to analyze data in a
more or less simple way. We did not discuss all the available nodes, however. For
example, the Sort node is a node that has not been mentioned up to now, but it is
simple to use so let’s have a look at how it works.

1. Open the predefined stream “car_sales_modified” that refers to a dataset


representing car sales statistics for several car types. See also Sect.10.1.5.
2. Add an appropriate node to the stream that shows the records in the order
originally defined.
3.2 Simple Data Examination Tasks 249

Fig. 3.82 Multiple standard deviation intervals around the mean

3. Now the records should be reordered. Therefore, add a Sort node to the end of
the stream on the right-hand side. Modify the node settings so that the records are
ordered in descending order in relation to the frequency of sales.
4. Finally, add a node that shows you the records in the modified order.

Exercise 2: “Scale Types Versus Diagram Types”


To assess a frequency distribution by using a correct diagram, as well as to calculate
the measures of central tendency and volatility, the identification of the proper scale
type is fundamental. Please answer the following:

1. We distinguish between nominal, ordinal, and continuous variables. Name and


explain one practical relevant example for each scale type.
250 3 Univariate Statistics

Fig. 3.83 Results of the outlier analysis

Fig. 3.84 Frequency distribution of the classified variable using the 2s-rule

2. Explain the difference between these scale types.


3. The most important chart types are pie, bar, and curve. Using the Table 3.10, try
to match these chart types to the different scale types. Mark the appropriate chart
types with “X” and explain your decision.

Exercise 3: “Create a Diagram for Discrete Values”


The dataset “z_pm_customer_train1.sav” should be used to create a diagram. For
more details see Sect. 10.1.36.
3.2 Simple Data Examination Tasks 251

Table 3.10 Matching of Diagram type


scale and diagram types
Scale type Pie chart Bar chart Curve chart
Nominal
Ordinal
Continuous

Fig. 3.85 Data in “IT-projects.txt”

1. Create a new stream.


2. Add an appropriate node and load the data into the stream.
3. Add a Type node and determine the correct scale type for the variable “campaign”,
which represents the number of campaigns a customer has participated in.
4. Create a bar chart using the Distribution node.

Exercise 4: “Analyzing IT-Data”


Figure 3.85 shows a screenshot of the dataset “IT-projects.txt” in a simple text
editor format. For more details see Sect. 10.1.19. This file should be used to
understand how the Variable File node works. Additionally, other questions should
be answered using the SPSS Modeler. Please create a new stream and follow the
instructions described below.

1. Open a new empty stream and add a Variable File node from the sources section.
2. Then try to load the records from the data file “IT-projects.txt”. As you can see,
this is a simple text file with the variable names in the first line and tabs between
the numbers in the following lines. To read the file, we will use the following
Variable File node options: “Read file names from file”, “File delimiters”, “Tab
(s)”, and “Newline”. All the other options we do not have to modify.
Close the options dialog of the Variable File node.
3. Now we would like to see the imported records. To do so, add a Table node to
the stream and connect it with the Variable File node. Try to show the records
in a table.
4. Finally, diagrams should be used to visualize the data. Before we do that, please
analyze the variables and determine the correct scale type using the theoretical
findings in this chapter.
5. Add a Type node and make sure to check the scale types so that a Histogram
node can be used.
252 3 Univariate Statistics

Fig. 3.86 Analysis of the variables in “tree_credit.sav”

6. Can you use another node to create a graph of the frequency distribution of
several variables?

Exercise 5: “Comfortable Distribution Analysis”


Figure 3.86 shows an analysis of the variable distributions in the file “tree_credit.
sav”. Using a Data Audit node create a proper stream that reproduces the same
result. Please note that only certain measures per distribution will appear! Try to
modify the settings of the Data Audit node to get the same results.

Exercise 6: “Multiple Choice Test Measures of Central Tendency”


Question Possible answers
1. The appearance of a distribution ☐ The number of classes
in a histogram depends on. . . ☐ The width of the classes
☐ The start of the classing
2. Distributions can be characterized ☐ Skewness
by the following measures . . . ☐ Dispersion
☐ Location on the x-axis ¼ central tendency
☐ Kurtosis
3. Measures of location for ☐ Maximum distribution
distributions were calculated to ☐ Location of the distribution relative to the x-axis
describe clearly and concisely the ☐ Location of the distribution relative to the y-axis
following characteristics . . . ☐ Labeling of outliers
4. To calculate the average as a ☐ Whether data are classified/unclassified
measure of central tendency, the ☐ Dimension of the values
following aspects should be
considered . . .
5. A list with four values is given: ☐ 10
3, 2, 10, 7. What is the median? ☐2
☐6
☐5
(continued)
3.2 Simple Data Examination Tasks 253

Question Possible answers


6. To calculate the arithmetic mean of ☐ The left boundary of each class
classified values, the following ☐ The right boundary of each class
values are used . . . ☐ The class midpoints
☐ The abs./rel. frequency per class
☐ If applicable, the sample size
7. A list for the variable “visitors to ☐ The arithmetic mean is 4
the library per hour” with 4, 4, 2, 5, ☐ The median is 2
5 is given. Which statements are ☐ The mode is 4
correct? ☐ The mode is 5
☐ The median is 4
8. The median always equals a value ☐ n ¼ even
in RAW data if the sample size is ☐ n ¼ odd
... ☐ n¼1
☐ n ¼ infinite
9. The wages of a company were ☐ Symmetrical
statistically analyzed. What can ☐ Skewed to the right
you state about the distribution of ☐ Skewed to the left
the wages if x_mean ¼ 95,000 ☐ Extremely skewed
EUR, x_med ¼ 50,000 EUR and
x_mod ¼ 45,000 EUR? The
distribution is . . .
10. A distribution of the wages was ☐ The middle income is . . . EUR
analyzed. Which of the following ☐ The median income is . . . EUR
the payments for employees in ☐ The arithmetic mean of the income is . . . EUR
general? Which of the following ☐ The mode of the income is . . . EUR
statements describes the
distribution of the payments best?
Are the following statements correct? Yes No
Tick “Yes” or “No”
11. The distribution of incomes is ☐ ☐
always skewed to the left.
12. The arithmetic mean is sensitive to ☐ ☐
outliers.
13. The arithmetic mean of a unimodal ☐ ☐
distribution skewed to the right
always lies on the right side of the
maximum of the distribution.
14. To calculate the arithmetic mean of ☐ ☐
a variable using relative
frequencies, the sum of all xi*f(xi)
has to be divided by the sample size
at the end.
15. The arithmetic mean of a variable ☐ ☐
can be greater or less than most of
the values.
16. The median of a data file can be ☐ ☐
greater or less than most individual
values.
17. Outliers in the data of a sample ☐ ☐
have no consequences for the
median.
254 3 Univariate Statistics

Exercise 7: “Interpretation of Measures of Central Tendency and Volatility”


In a Data Audit node, the calculation of the different measures can be activated.
Figure 3.87 gives an overview. Certain measures of central tendency, as well as
volatility, are available.

1. Which options in a Data Audit node have to be activated to calculate the median
and the mode?
2. Explain the following measures, also shown in Fig. 3.87: mean, median, and
mode, range, and standard deviation. Create a table with the columns “Name of
measure” and “Explanation”.

Exercise 8: “Theory of Data Transformation”


Explain the difference between variable normalization and standardization as well
as transformation of a variable to normality.

Exercise 9: “Transforming Data Towards Normality”


A lot of algorithms assume the variables to be normally distributed. In Sect. 3.2.5,
we discussed how to evaluate a distribution regarding their normality and how to
transform the values towards normality. In this exercise, the variables in the dataset

Fig. 3.87 Measures offered


by the Data Audit node
3.2 Simple Data Examination Tasks 255

“customer_bank_data_calculated” should be analyzed and transformed. For details


regarding the variables see also Sect. 10.1.7.
The results of the transformation are used in Exercise 3 of Sect. 7.4.3 as well as
7.4.4 to apply the TwoStep Algorithm to cluster the data.

1. Open the “Template-Stream_Customer_Bank”. It includes a Variable File node,


a Type node to define the scale types, and a Table node to show the records.
2. Calculate the ratio of the debt (credit card debt and other debt) and the income as
new variable “DEBTINCOMERATIO” in percentage.
3. Based on this template stream assess the distribution of the variables.
4. Using an appropriate node transform the variables towards more normally
distributed variables.

Exercise 10: “Reclassification of Multiple Variables”


In Sect. 3.2.6, we discussed the procedure to reclassify the values of a single
variable. For this we used the dataset “IT_user_satisfaction.sav” and re-organized
the values of the variable “starttime”. In this dataset, however, we can find more
than just this variable. All together 11 variables have the same scale. The aim of this
exercise is to reclassify all of these variables at once.
To do so, please have a look at Fig. 3.88. In the upper part of the dialog window,
you can see the parameters “Mode”. Please use the option “Multiple” and do the
following:

1. Open the stream “Template-Stream IT_user_satisfaction”.


2. Save it with another name.
3. Now reclassify or recode the variables “starttime”, “system_availability”, . . .,
“slimness” as described in Table 3.11.
4. Show your results using a Data Audit node.

Exercise 11: “Outlier Analysis”


In Sect. 3.2.7, we discussed the relationship between mean and the standard
deviation of a distribution. See especially Table 3.9. Furthermore, we introduced
the 2s-/3s-rule and its usage for identifying outliers. Now we will use the stream
“binning_continuous_data”. Please open this stream.

1. Explain the meaning of the 2s- and the 3s-rule in your own words.
2. Using the 3s-rule, determine the number of people that are relatively old. In
other words, those who are outliers in the dataset in terms of their age.

Exercise 12: “Creating a Flag Variable”


In Sect. 3.2.7, we discussed several methods to bin values of a continuous variable.
In this exercise we want to determine the values of a flag variable depending on the
income of a person or respondent.

1. Open the stream “binning_continuous_data” and save it using another name.


The solution can be found in “create_flag_variable”.
256 3 Univariate Statistics

Fig. 3.88 Parameters of the reclassify node, option “Multiple” selected

Table 3.11 Summary of the reclassification procedure


“starttime”, “system_availability”, . . .,
“slimness” Variable name with extension “_Reclassify”
Value label Value Reclassified value Value label
Excellent 7 1 At least fair
Fair 5
Good 3 2 At most good
Poor 1
3.2 Simple Data Examination Tasks 257

2. Add a Derive node and connect it with the Type node. Using the Derive node,
calculate values of a new variable “Income_binned” which tells you if the
income is smaller than 2.5. This should be a flag variable as shown in Fig. 3.89.
3. By using the expression builder on the right in Fig. 3.89, define a condition for
the calculation of the new variable.
4. Finally, define labels “<2.5” and “2.5” for the new variable “Income_binned”.
5. Show the results in a Table as well as in a Data Audit node.

Fig. 3.89 Option of the derive node to calculate a flag variable


258 3 Univariate Statistics

Exercise 13: “Multiple Choice Test Measures of Volatility”


Question Possible answers
1. Mark the statistical measures of central tendency ☐ Median
and volatility that can be directly compared, based ☐ Mean
on their calculation and unit ☐ Standard deviation
☐ Variance
2. What is the smallest value of the standard ☐ min(xi)
deviation of a variable X with its values xi? ☐ Average x
☐ Zero
☐ max(xi)
3. Consider that the mean, the mode, and the median ☐ Variance/standard deviation rises
of the variable “rental price per square meter” in a ☐ Variance/standard deviation
town is 5 EUR. Now all prices raise abruptly by remains the same
2 EUR/sqm. How is the variance/the standard ☐ Variance/standard deviation
deviation affected? decreases
4. Consider that the mean, the mode, and the median ☐ Variance/standard deviation rises
of the variable “rental price per square meter” in a ☐ Variance/standard deviation
town is 5 EUR. Now all prices raise abruptly by remains the same
10 %. How is the variance/the standard deviation ☐ Variance/standard deviation
affected? decreases
5. Consider a variable in two samples with the same ☐ The sample with unimodal,
size. One of the samples has a unimodal and symmetrical distribution
symmetrical distribution and the other one has a ☐ The sample with uniform
uniform distribution for the values of this variable. distribution
Both samples have the same mean, median, and ☐ The variance of both samples is
mode. Which of these samples has the largest identical
variance?
6. You can choose exactly four values out of the ☐ 1,1,1,1
intervals between 1 and 9. Which of the following ☐ 1,2,8,9
combinations has the largest standard deviation? ☐ 1,1,9,9
☐ 9,9,9,9
7. Consider a symmetrical and unimodal frequency ☐0%
distribution with the mean 10, and a standard ☐ 95,45 %
deviation of 1.5. Which proportion of the values ☐ 88,89 %
can we find at least in the intervals between ☐ 75,00 %
7 and 13.
Are the following statements correct? Answer “Yes” or Yes No
“No”.
8. The unit of the variance equals the unit of the ☐ ☐
values to the power of 2.
9. They accuracy of the arithmetic mean (average) of ☐ ☐
a sample can only be assessed, when the measure
of volatility (i.e., standard deviation) is known.
10. The variance of a variable is influenced by ☐ ☐
outliers.
11. The standard deviation and the variance of a ☐ ☐
variable have the same units.
12. The standard deviation of a variable is zero if all ☐ ☐
values are the same.
13. The more skewed a distribution is, the bigger the ☐ ☐
variance/standard deviation.
3.2 Simple Data Examination Tasks 259

3.2.9 Solutions

Exercise 1: “Sorting Values”


Name of the solution streams sorting_values.str

Before we start we would like to show the initial stream “car_sales_modified” in


Fig. 3.90. Nodes should be added to this stream to answer the questions.
To do so we suggest the following steps:

1. Figure 3.90 shows the original stream.


2. To have a look at the original dataset, we can use a Table node connected with
the Statistics node on the left-hand side of the stream. Figure 3.91 shows the
result of this step. When we open the table node and click “Run”, we will get the
result shown in Fig. 3.92.
3. To sort the values, a Sort node from the category “Records Ops” should be added
and connected with the Type node to the right of the stream (see Fig. 3.93 on the
right-hand side).
To sort the records in descending order, related to the frequency of sales, we
double click on this Sort node. In the dialog window, we click on the first button

Fig. 3.90 Initial template stream “car_sales_modified”

Fig. 3.91 Expanded template stream “car_sales_modified” (step 2)


260 3 Univariate Statistics

Fig. 3.92 Initial order of the records

Fig. 3.93 Expanded template stream “car_sales_modified” (step 3)

on the right-hand side, to define the variable used to sort the values. This is
shown with an arrow in Fig. 3.94.
We use the variable “sales” and add it to the dialog box (Fig. 3.95).
Finally, we modify the order by clicking on the text “Ascending”. Alterna-
tively, we can first set the default option to “Descending” at the bottom of the
dialog window, to get the same result. Figure 3.96 shows the result.
If we would also like to use other variables to sort the records, we have the
opportunity here to add additional criteria.
4. To show the result, or better still the records, in the new order, the stream should
be expanded by another Table node on the right-hand side. This is shown in
Fig. 3.97. Clearly Fig. 3.98 shows the reordered records by applying the sales in
descending order. When we compare this figure with Fig. 3.92, we can verify the
result.
3.2 Simple Data Examination Tasks 261

Fig. 3.94 Sort node parameter definition

Fig. 3.95 Select the variable to sort the values


262 3 Univariate Statistics

Fig. 3.96 Sort node parameter definition

Fig. 3.97 Final stream to reorder the records

Fig. 3.98 Reordered records of dataset “car_sales_modified”


3.2 Simple Data Examination Tasks 263

Table 3.12 Examples for different scale types


Scale type Examples
Nominal Color of bags, gender of a person
Ordinal Assessment of a product, category of hotels
Continuous Size of an apartment, number of car accidents (see explanation in Sect. 3.1.1)

Table 3.13 Matching of scale and diagram types


Diagram type
Scale type Pie chart Bar chart Curve chart
Nominal X There is no natural order in the values Values are just discrete. So
and so a bar chart cannot be used a curve is inappropriate
Ordinal X X Values are just discrete. So
a curve is inappropriate
Continuous A classification of the values is necessary before X
these chart types can be used

Fig. 3.99 Stream “pm_customer_train1”

Exercise 2: “Scale Types Versus Diagram Types”


Theory discussed in section Section 3.1
Section 3.2.2
Section 3.2.3

1. Practical useful examples for each scale type can be found in Table 3.12.
2. For an appropriate explanation of each scale type, see Sect. 3.1.2.
3. Matching of scale of measurement and the chart types see Table 3.13.

Exercise 3: “Create a Diagram for Discrete Values”


Name of the solution streams pm_customer_train1.str
Theory discussed in section Section 3.2.2

Figure 3.99 shows the structure of the stream used to graphically analyze the
variable “campaign”. The stream can be found in the solutions, with the name
“pm_customer_train1.str”.
It is important to assign the scale type “ordinal” to this variable, in the Type node of
the stream. Figure 3.100 shows the frequency distribution in the form of a bar chart.
264 3 Univariate Statistics

Fig. 3.100 Bar chart of the frequency distribution of the variable “campaign”

Exercise 4: “Analyzing IT-Data”


Name of the solution streams analyzing IT-Data.str
Theory discussed in section Section 3.2.2

1. To open a new stream, use “File\New Stream” from the tool bar. Then add a
Variable File node from the sources section.
2. To load the records from the data file “IT-projects.txt”, the parameter of the node
has to be modified as shown in Fig. 3.101.
3. Figure 3.102 shows the results of the operation using a Table node.
4. There are four variables that should be discussed here. Table 3.14 shows the
scale type of each variable. For an explanation of several scale types see
Sect. 3.1.
5. Figure 3.103 shows the scale type definitions of the type node as described
above. Figure 3.104 also shows the final stream.

" Definition of the scale type depends often on how the variable in the
streams will be used. The scale type is particularly essential if
diagrams are to be created. The Distribution node needs discrete
(nominal, ordinal, or categorical) variables, whereas the Histogram
node only accepts continuous variables. Therefore, sometimes we
need to add a Type node to the stream and modify the scale types.

6. Figures 3.105 and 3.106 show diagrams of frequency distributions. The best way
to have a rough overview of the distributions is to use a Data Audit node, as
described in Sect. 3.2.4. The shape of the distributions depends heavily on the
scale of measurement of the variables, however. If, for example, variable KDSI
is defined as continuous in the Data audit node, then the values will be binned. If
3.2 Simple Data Examination Tasks 265

Fig. 3.101 Parameter of the Variable file node to load data records of “IT-projects.txt”

Fig. 3.102 Table with some records from “IT-projects.txt”


266 3 Univariate Statistics

Table 3.14 Variables in the dataset “IT-projects.txt”


Variable name Scale type
Project Discrete and ordinal
number The scale type is not metrical because it does not make sense to calculate with
the values
TDEV, PM Continuous and metrical
If we consider that the necessary number of months can be determined with an
infinite number of decimal places, then the variable is metrically scaled. If we
think this precision makes no sense, we should define the variable as discrete.
In either case, it is a metrical variable
To use a distribution node to draw a graph of the frequency distribution
afterwards, we should determine the scale type as discrete
KDSI The number of source instructions is discrete. Nevertheless, to work with the
values it can sometimes be a good idea to define the scale type as “continuous”
If a bar chart as shown in Fig. 3.104 should be used, the scale of measurement
must be defined as “ordinal” or “categorical”

Fig. 3.103 Scale type definitions for “IT-projects.txt” in the Type node

the scale type of the variable is discrete, a binning is impossible. We can verify
that by adding a Data Audit node and connecting it to the Variable File node. The
result is different when we modify the scale of measurement definition in the
Type node, e.g., for KDSI.
3.2 Simple Data Examination Tasks 267

Fig. 3.104 Final stream “Analyzing IT-Data.str”

Fig. 3.105 Frequency distribution of “KDSI”

Exercise 5: “Comfortable Distribution Analysis”


Name of the solution streams comfortable distribution analysis.str
Theory discussed in section Section 3.2.4

The solution can be found in the stream “comfortable distribution analysis.str”.


There are two important aspects to keep in mind: firstly, the option “advanced
statistics” has to be activated in the Data Analysis node. See Fig. 3.19; secondly, the
options for the measures to calculate have to be activated, as shown in Fig. 3.21.
Then we get the same results as shown in Fig. 3.86.
268 3 Univariate Statistics

Fig. 3.106 Histogram with the frequency distribution of “TDEV”

Exercise 6: “Multiple Choice Test Measures of Central Tendency”

1. The number of classes, the width of the classes and the start of the classing are
all relevant to the appearance of a distribution in a histogram.
2. The measures of skewness, dispersion, location, and kurtosis characterize a
distribution.
3. They were calculated to describe the maximum distribution, the location of the
distribution relative to the x-axis and relative to label outliers.
4. It should be considered whether the data is classified or not.
5. The median is (5 þ 6)/2 ¼ 5.5.
6. The class midpoints, the absolute/relative frequency per class, and, if applica-
ble, the sample size are used to calculate the arithmetic mean of classified
values.
7. The arithmetic mean is 4. There are two modes 4 and 5. The median is 4.
8. If n ¼ odd or n ¼ 1.
9. The distribution is skewed to the right and extremely skewed.
10. The median income is . . . EUR. The mode of income is . . . EUR.
11. No, it’s wrong. It’s skewed to the right.
12. Yes, the average mean is sensitive to outliers.
13. Yes, it lies on the right-hand side of the maximum of the distribution.
3.2 Simple Data Examination Tasks 269

Table 3.15 Explanation of statistical measures


Name of measure Explanation
Mean This is a typical measure of the central tendency of a distribution. It can be
calculated by summing up all the values and dividing the sum by the
sample size. The mean is very sensitive to outliers.
Median Due to the sensitivity of the mean to outliers, often the median is used to
describe a frequency distribution. The median can be calculated as follows:
we order all the values in ascending order and then we use the value in the
middle as the median. If the sample size is even, the average of the values
either side of the middle is used as the median.
Mode This is the value with the highest frequency.
Range The range is roughly a measure of the volatility of a distribution. It is
calculated as the difference between the maximum and the minimum of all
the values
Standard As a standard measure of volatility, standard deviation is one of the most
deviation important measures. It can be interpreted as the average of the squared
distances of the values around the mean. Although the median represents
the distribution better, standard deviation can only be interpreted in
combination with the mean. For a very useful method to identify outliers,
using a combination of mean and standard deviation, see Sect. 3.2.7 and
here Table 3.9.

14. No, it’s wrong. Thats because the relative frequencies are calculated with the
sample size.
15. Yes, the arithmetic mean can be greater or less than most values.
16. No, it’s wrong.
17. No, it’s wrong. Also the median is sensitive to outliers.

Exercise 7: “Interpretation of Measures of Central Tendency and Volatility”


Theory discussed in section Section 3.2.4

1. Firstly, in the Data Analysis node the option “advanced statistics” has to be
activated (see Fig. 3.19). Secondly, the options for the measures to calculate
have to be activated, as shown in Fig. 3.21.
2. See Table 3.15.

Exercise 8: “Theory of Data Transformation”


Theory discussed in section Section 2.7.6
Section 3.2.5

For details see Sect. 2.7.6 and Sect. 3.2.5. For details see Table 3.16.
270 3 Univariate Statistics

Table 3.16 Summary of terms for transformations of variables


Statistical description of Details can be
the transformation type Short explanation find in . . .
Normalization Using the formula Sect. 2.7.6
xi xmin
xnorm ¼ xmax xmin
the transformed variable has a range between [0; 1]
Standardization The formula Sect. 2.7.6
zi ¼ xi x
s
Section 3.2.6
is used to transform the values xi. The result spreads for 2s-/3s-rule
around zero and can be interpreted in terms of
standard deviation of the variable. Outlier detection
is possible to use the 2s- or 3s-rule
Transformation to One of the Box–Cox transformations are used to Sect. 3.2.5
normality move the frequency distribution towards the shape
of the normal distribution

Fig. 3.107 Initial template stream “Template-Stream_Customer_Bank”

Exercise 9: “Transforming Data Towards Normality”


Theory discussed in section Section 3.2.4
Section 3.2.5

1. We open the template stream “Template-Stream_Customer_Bank” and save it


with another name, e.g., “transform_Customer_Bank”. Figure 3.107 shows the
initial stream.
2. To calculate a new variable in the form of the ratio of credit card debt and other
debt to income, we add a Derive node from the “Field Ops” tab of the Modeler
(see Fig. 3.108). Using its expression builder, as outlined in Sect. 2.7.2, we
define the parameters as shown in Fig. 3.109. The name of the new variable is
“DEBTINCOMERATIO” and the formula “(CARDDEBT + OTHERDEBT)/
INCOME * 100”. So “DEBTINCOMERATIO” equals the summarized debt of
the customer versus the income in percent.
3. We should add a Data Audit node to assess the distributions (see Fig. 3.110). As
we can see in Fig. 3.111, “AGE” is positive skewed. This is also the case for
“DEBTINCOMERATIO” and “EDUCATIONReclassified”.
3.2 Simple Data Examination Tasks 271

Fig. 3.108 Added Derive node in the stream

Fig. 3.109 Parameters of the Derive node to calculate the debt–income ratio
272 3 Univariate Statistics

Fig. 3.110 Added Data Audit node in the stream

Fig. 3.111 Variables and their distribution in the Data Audit node

Fig. 3.112 A transform node is added to the template stream


4. To find a good power transformation, we need to add a Transform node from
the Modeling tab of the SPSS Modeler. We connect it with the Derive node
(see Fig. 3.112).
We double-click the new Transform node to open the dialog window. We
add the variables to transform to the list by using the button on the right. In
3.2 Simple Data Examination Tasks 273

Fig. 3.113 Variables are added in the Fields tab of the Transform node

Fig. 3.114 Transform node results

the Fields tab, we add the variables “AGE”, “YEARSEMPLOYED”, and


“DEBTINCOMERATIO” as also shown in Fig. 3.113.
We can run the Transform node.
In the second column of Fig. 3.114, the Transform node shows the current
distribution of the variable. Under the curve, we can find the mean and the
standard deviation. If we double-click on one of these charts, a new window
appears that shows us the details of the frequency distribution selected and an
added normal curve (see Fig. 3.115).
For the distribution, both the mean and the standard deviation of the selected
variable are calculated. The SPSS Modeler determines the expected frequencies
and draws the bell-shaped curve. By assessing the deviation of both curves, we
have the chance to decide whether the distribution is normal or not (see
Fig. 3.115). We can close this window.
274 3 Univariate Statistics

Fig. 3.115 Histogram of “AGE” with normal curve

Fig. 3.116 Final stream with a SuperNode and a Data Audit node

Starting at the third column in Fig. 3.114, we can see the charts of the
frequency distributions, depending on different transformations. Here, we can
decide which transformation to use, to shift the distribution of the original
variable more towards a normal curve. To see more details, we can also
double-click on one of those diagrams.
Assessing the distributions, we find that all variables do not have a bell-
shaped form also after transformation.
For “AGE” and “DEBTINCOMERATIO”, we can see that the square root-
transformed values look much better. We select this transformation for both
variables, identified by clicking once on the distributions.
As explained in Sect. 3.2.5, we now define a SuperNode and connect it to the
rest of the stream. Finally, we add a Data Audit node to evaluate the result (see
Figs. 3.116 and 3.117.
3.2 Simple Data Examination Tasks 275

Fig. 3.117 Transformed variables

Fig. 3.118 Stream “reclassify IT_user_satisfaction multiple” part I

Exercise 10: “Reclassification of Multiple Variables”


Name of the solution stream reclassify IT_user_satisfaction multiple
Theory discussed in section Section 3.2.6

Here, we used the stream “Template-Stream IT_user_satisfaction” and saved it


under a new name.

1. To show the original values, we added a Table and a Data Audit node to the
stream, as shown in Fig. 3.118.
2. Now we should add a Reclassify node and connect it with the Type node.
Figure 3.119 shows the parameters for reclassifying the variables “starttime”,
“system_availability”, . . ., “slimness”. In addition, we defined the recoding
procedure itself, as summarized in Table 3.11.
276 3 Univariate Statistics

Fig. 3.119 Reclassify node parameters

It is absolutely vital to make sure that the last row of the reclassify definition
has absolutely no value and is blank! This row is marked in Fig. 3.119 with
an arrow. To be certain, this row is completely empty, we can also click
the red “delete selection option” on the right. This button is also marked with
an arrow.
3.2 Simple Data Examination Tasks 277

Fig. 3.120 Parameters of the second Type node

3. As shown in the final stream in Fig. 3.122, we have to add another Type node to
define the value labels for the reclassified variables. If we double-click on the
second Type node, we can reorder the variables by clicking at the column head
of the first column “name” (see Fig. 3.120). For each of the first eleven variables,
we have defined a new reclassified variable with the name extension “_Reclas-
sify”. The extension is defined in the dialog window of the Reclassify node (see
Fig. 3.119 in the middle).
If the reclassification does not work, we should try to define the definition for
just one variable and then add the other variables in the dialog box “reclassify
fields”.
4. In the column “Values” of Fig. 3.120, we can now specify the value labels as
shown in Fig. 3.121. Don’t forget we must click on “Specify values and labels”
before we define the new labels.
5. Unfortunately, we must implement this definition for all of the new variables
with the extension “_Reclassify” shown in Fig. 3.120.
6. Finally, we should add a Data Analysis node. As shown in Fig. 3.122, we can
now find 27 variables in the stream. This is because we added 11 new variables
with the implementation of the Reclassify node. In the data source, 16 variables
are defined (see the Data Analysis node in Fig. 3.122).
7. Figure 3.123 shows the results of the reclassification procedure in the Data
analysis node.
278 3 Univariate Statistics

Fig. 3.121 Value label definitions in the Type node

Fig. 3.122 Final stream “reclassify IT_user_satisfaction multiple”


3.2 Simple Data Examination Tasks 279

Fig. 3.123 Results of the reclassification procedure

Fig. 3.124 Stream to find outliers with the Binning node

Exercise 11: “Outlier Analysis”


Name of the solution streams outlier analysis.str
Theory discussed in section Section 3.2.7

Part 1
For an explanation of the 2s- and 3s-rule, the interested reader is referred to
Sect. 3.2.7 and Table 3.9. Here, the functionalities of the Binning node were
discussed, and the option to bin values using the mean and standard deviation.
Part 2
The solution can be found in the file “outlier analysis”. Figure 3.124 shows the
complete stream. So far, we have not discussed the Sort node on the right-hand side
of the dotted line in Fig. 3.124. In any case, this node is easy to use and there is an
exercise, “Sorting values”, with a detailed solution (see Exercise 1).
Here, we describe the most important steps.

1. First we should determine which variable should be used for the calculation. In
the binning node, we choose the variable “age”.
2. We also use the option “Mean/standard deviation” with “3 standard
deviations” (see Fig. 3.125). In Fig. 3.126, we can see that the 3s-threshold is
33.82 + 3*8.54 ¼ 59.4 years. Persons that are older can be described as “outliers”
in the context of this dataset.
280 3 Univariate Statistics

Fig. 3.125 Outlier analysis with the Binning node

3. In the Data Analysis node, we can find the distribution of the binned values.
A double click on the frequency distribution of “Age_SDBIN” shows
the details in Fig. 3.127. Here, we can see there are seven persons that are
relatively old.

The interested user can try to find out the concrete records of the persons who
have an age above the threshold mean + 3* standard deviation. To sort the records,
the Sort node can be used. This explains the solution to Exercise 1 in this chapter.
Figure 3.128 shows the concrete parameters in the stream discussed here. Further-
more, a Table node will then show the records (see Fig. 3.129).
3.2 Simple Data Examination Tasks 281

Fig. 3.126 Thresholds to bin the variable “age”

Fig. 3.127 Frequency distribution of the binned values “AGE_SDBIN”


282 3 Univariate Statistics

Fig. 3.128 Settings in the Sort node

Fig. 3.129 Sorted records with the highlighted outliers

Exercise 12: “Creating a Flag Variable”


Name of the solution stream create_flag_variable

Figure 3.130 shows the parameters of the Derive node. Figure 3.131 shows the
final stream.
To define the value labels, we double click the Type node and then double click
the new variable “Income_binned”. Here, we can define the value labels as shown
in Fig. 3.132. Figure 3.133 depicts the frequency distribution the flag variable
“Income_binned”.
3.2 Simple Data Examination Tasks 283

Fig. 3.130 Option of the derive node to calculate a flag variable

Exercise 13: “Multiple Choice Test Measures of Volatility”

1. Only the mean and the standard deviation can be directly compared, based on
their calculation and unit. Looking at the formula for the standard deviation, we
find that it is being calculated using the mean/average. So we cannot compare
the result with the median.
2. The smallest value for standard deviation is zero. We will get this result if all
the values are the same.
3. The standard deviation is the square root of the variance. Both remain the same
in this example. That’s because the curve of the frequency distribution will
moved to the right, but the shape of the curve is not(!) affected in this scenario.
Therefore, the standard deviation also remains the same.
4. The variance rises, and so the standard deviation rises too. Consider an apart-
ment with a rental price of 3 Euro per square meter. The new price is then 3.30
Euro per square meter. If we look at the rental price of a more expensive
284 3 Univariate Statistics

Fig. 3.131 Final stream “create_flag_variable”

apartment, e.g., 8 Euro per square meter, it will increase by 0.80 Euro. These
examples show us that the frequency distribution of the variable will has longer
tails. Consequently, the deviation of the values from the mean will be larger
than before. So the standard deviation is higher.
5. The sample with unimodal distribution has the larger variance. That’s because
in this case we can also find many more values at the interval boundaries. That
means that the number of large deviations from the mean is greater. As the
standard deviation measures this dispersion of the values, the standard devia-
tion is much larger in the case of uniform distributed values.
6. For 1,1,9,9 we get the largest standard deviation. See questions 4 and 5 for an
explanation.
7. As a rule of thumb, approximately 88.89–95.45 % of the values can be found in
the so-called. “2s-interval” from 7 to 13. For an explanation see Sect. 3.2.7 and
particularly Table 3.9.
8. Yes.
9. Yes, additional knowledge of standard deviation as a measure of dispersion is
necessary to assess the accuracy of the mean.
3.2 Simple Data Examination Tasks 285

Fig. 3.132 Value label definition in the Type node

Fig. 3.133 Frequency distribution of “Income_binned” with value labels


286 3 Univariate Statistics

10. Yes, it’s true. The variance measures the deviation of the values from the mean.
Additionally, we can find in the formula of the standard deviation, the power of
two of the difference between each value and the mean. This difference is large
for an outlier. The power of two boosts this effect, and the standard deviation
will definitely be influenced by this outlier.
11. No, it’s wrong.
12. Yes, because all the differences between the values and their average are zero.
13. Yes, the more skewed a distribution, the bigger is the variance and/or standard
deviation. For an explanation see also question 10.

Literature
Aczel, A. D., & Sounderpandian, J. (2009). Complete business statistics (7th ed.). Boston:
McGraw-Hill Higher Education.
Anderson, D. R., Sweeney, D. J., & Williams, T. A. (2014). Essentials of statistics for business and
economics, 7th edn.
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical
Society Series B (Methodological), 26(2), 211–252.
Chapman, C., & Feit, E. M. (2015). R for marketing research and analytics, use R! Berlin:
Springer.
Fox, J., & Weisberg, S. (2015). Package ‘car’. Accessed June 5, 2015, from http://cran.r-project.
org/web/packages/car/car.pdf
IBM. (2014). SPSS modeler 16 source, process, and output nodes. Available at: ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/16.0/en/modeler_nodes_general.pdf
Schulz, L. O., Bennett, P. H., Ravussin, E., Kidd, J. R., Kidd, K. K., Esparza, J., et al. (2006).
Effects of traditional and western environments on prevalence of type 2 diabetes in Pima
Indians in Mexico and the USA. Diabetes Care, 29(8), 1866–1871.
Tukey, J. W. (1957). On the comparative anatomy of transformations. The Annals of Mathematical
Statistics, 28(3), 602–632.
Zimmerman, D. W. (1998). Invalidation of parametric and nonparametric statistical tests by
concurrent violation of two assumptions. The Journal of Experimental Education, 67(1),
55–68.
Multivariate Statistics
4

After finishing this chapter, the reader is able to . . .

1. Explain in detail the tools and the process to determine and assess the depen-
dency between variables
2. Explain the difference between a correlation and contingency table
3. Create and analyze correlation matrices and contingency tables and finally
4. Analyze dependencies between variables as well as to explain why the statistical
analysis is not intended to replace the analysis of practical relevant facts

So the successful reader is familiar with the process to analyze two or more
variables regarding their dependencies. Furthermore, she or he understands and can
describe the specific steps to produce reliable results and to exclude spurious
correlations.

4.1 Theory

First of all, we should understand the difference between univariate, bivariate, and
multivariate analysis. The univariate analysis we discussed in the previous chapter
enables us to analyze and characterize the frequency distribution of each variable in
a dataset separately. We therefore come to know a lot of measures, e.g., the mean
and the standard deviation. See Sect. 3.2, Exercise 6 and 13.
Dependencies between two or more variables are also of interest, however.
When considering the unquestionable correlation between the price and the sales
volume of a product, or between the income of an employee and the rental price of
his/her flat, we can also show those relationships in a graph.
For the last mentioned correlation, the process of moving from a univariate to a
bivariate statistic is shown in Fig. 4.1. Combining two frequency distributions, we
get a so-called scatterplot. This gives us the chance to measure the type (positive or

# Springer International Publishing Switzerland 2016 287


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_4
288 4 Multivariate Statistics

Fig. 4.1 From univariate to bi-/multivariate analysis

negative) as well as the strength of the dependency and to determine the parameter
of, e.g., a linear regression model.
We suggest the steps shown in Fig. 4.2 as a general approach to finding out if
there is a dependency between variables and for describing the strength of the
relationship. At the end of this analysis we have created a valid multivariate model.
The steps are explained in this section. Figure 4.3 shows the different topics and the
numbers of the sections.
At this point, we wish to add a word of warning. Often we can find applications
on the concept of correlation and all the models that are based on it for different
scenarios, e.g., based on time series. To understand the difficulties we want to
outline here the difference between a cross-sectional and a longitudinal study.
In a cross-sectional study, a researcher analyzes the objects of interest at only
one point in time. That means, for instance, she/he asks the respondents about their
number of cinema visits in the last year or about their constant monthly payback on
the debt for a recently bought new car. These are snapshots of the person’s behavior
(“statistical objects”) taken just at one point in time.
In a longitudinal study, the researcher conducts the observations many times,
over a period of time. At the end she/he gets a series of values—a time series. So
this is obviously a totally different approach. Table 4.1 shows a summary of both
types of studies.
4.1 Theory 289

Fig. 4.2 Analyzing multivariate data

Fig. 4.3 Topics discussed in this chapter

Table 4.1 Cross-sectional vs. longitudinal study


Cross-sectional study Longitudinal study
Short definition One observation per object at one Observations for the same objects
and the same point in time but at different points in time
Example: Respondents in different cities Respondents in the same city
“How often did questioned at the end of year 2014 questioned at the end of year 2005,
you visit a cinema 2006, 2007, . . . 2014
in the last year?”
Average net The data just for April should be The (secondary) data from the
income per month analyzed, coming from the statistical office for each month in
statistical office in the form of 2014 should be analyzed
secondary data
Data representation Data for 2014: Data for City A:
City A City B 2005 2006 . . .
Avg. cinema visits 5.4 3.1 Avg. cinema visits 4.1 3.7 . . .
Avg. income 2212 1845 Avg. income 1798 1950

Now the question is, when can the concept of correlation be applied? The answer
is that in general we can calculate and interpret the correlation coefficient in a cross-
sectional study. Otherwise the results cannot be interpreted very easily. Granger
and Newbold showed in many articles that time series tend to have an “apparently
290 4 Multivariate Statistics

high” correlation. See Granger and Newbold (1974). But despite the probably
moderate or very often strong correlation of two time series, this does not mean
that these variables are “connected”. This is our word of warning. Even in cases of
completely independent series, we can normally find large correlation coefficients.
Hence, the concept of using the correlation alongside other methods, e.g., regres-
sion should be used with hesitation when the variables are represented by a time
series.

" To measure the “correlation” of two variables represented in the form


of time series, we should always keep the concept of stationarity in
mind. This means that the expected value, normally expressed by the
average (and the standard deviation) do not change over time. Usu-
ally, the correlation can only be correctly calculated for the first or the
second differences of the time series values.

4.2 Scatterplot

Description of the model


Stream name test_scores_analysis.str
Based on dataset test_scores.sav
Stream structure

Related exercises: 1

Theoretical Background
With Data Mining in particular, we have to deal with huge datasets and sample
sizes. Scrolling through the records is obviously not the best way to get an idea of
the characteristics of the data. Therefore, we suggest creating diagrams to get a
4.2 Scatterplot 291

rough overview of the data structure as well as of the outliers. Here, we want to
demonstrate the possible steps for such an analysis with the SPSS Modeler.
As outlined in Fig. 4.1, multivariate analysis means dealing with at least two
variables. So we will start with a simple example. In the dataset “test_scores”, we
can find data from schools that have been using traditional as well as new/experimental
teaching methods. Furthermore, the dataset includes the pretest and the posttest results
of the students as well as some variables to characterize the students themselves. See
also Sect. 10.1.31. The dependencies between the variables should be examined.

Creating a Scatterplot
First of all a simple scatterplot should be used to get an idea if the pretest and the
posttest results of the students could be dependent. To get access to the data, we use
the template stream “Template-Stream test_scores”. See Fig. 4.4.

1. We open the stream “Template-Stream test_scores” and save it using


another name.
2. We add from the SPSS modeler toolbar in section “Graphs”, a node of type
“Graphboard”. This is a multifunctional graph node that gives us a very good
overview of the different chart types the SPSS Modeler offers. This node is also
very easy to use.
3. We must make sure that the new node “Graphboard” is connected with the type
node. See Fig. 4.5.
4. We double-click the node “Graphboard”. The dialog window as shown in
Fig. 4.6 appears.
5. The advantage of the Graphboard node is that the settings depend upon the
selected variables. When we click on the variables to the left, the diagram type
on the right side changes. Here, we would like to examine the relationship
between the pre- and the posttest results.

Fig. 4.4 Template-stream


“test_scores”
292 4 Multivariate Statistics

Fig. 4.5 Added Graphboard node to the template stream

Fig. 4.6 Initial Graphboard node settings


4.2 Scatterplot 293

Fig. 4.7 Graphboard node settings to create a simple Scatterplot

We select “pretest” and “posttest” on the left side using the Ctrl-Key and the
left mouse-button. On the right side, we scroll down to the simple “Scatterplot”.
We activate this plot by clicking it once. The arrow in Fig. 4.7 shows the actual
status of the dialog window.
6. We click “Run” to close the window and draw the graph. Figure 4.8 shows the
result. In this scatterplot, we can see that there is probably a strong dependency
between the pre- and the posttest.
7. There is also the possibility to create a more valuable scatterplot in the case of
two variables that may be dependent. Using the same dataset, we will show the
usage of the Plot node here. From the section “Graph” of the toolbar, we add a
Plot node and connect it with the Type node. See Fig. 4.9.
8. To define the correct dependency in the graph, we should think about the
possible direction of the dependency between both variables: pre- and posttest.
Here, we know the causality that the posttest depends on the pretest result.
Therefore, the variable posttest should be assigned to the vertical axis and the
pretest to the horizontal axis.
294 4 Multivariate Statistics

Fig. 4.8 Scatterplot for Pre- and Posttest results

Fig. 4.9 Stream with the Plot node added


4.2 Scatterplot 295

" Before creating a scatterplot the direction of the dependency


between the variables should be analyzed. The independent variable
should always be assigned to the horizontal axis and the dependent
variable to the vertical axis in a diagram.

Now we double-click at the Plot node and assign the variables as outlined above.
See Fig. 4.10.

9. We click “Run” to create a scatterplot that Fig. 4.11 also shows.

The difference from the scatterplot in Fig. 4.8 is that we can see here much more
clearly the concentration of the points in the middle of the plot. Of course, it is
possible to modify the setting of the Graphboard node to get a similar result. What
we would like to show here is that there are two different nodes that can be used to
create scatterplots for two variables. The Graphboard node is the easiest to handle,
but the Plot node shows more details without modifications.
The regression model based on that identified linear relationship between pre-
and posttest can be found in Sect. 5.2.2.

Fig. 4.10 Parameters of the Plot node


296 4 Multivariate Statistics

Fig. 4.11 Plot node result

4.3 Scatterplot Matrix

Description of the model


Stream name Scatterplot matrix.str
Based on dataset test_scores.sav
Stream structure

Related exercises: 2, 3
4.3 Scatterplot Matrix 297

Theoretical Background
Here we address the second step of the suggested approach to create valid bi-/
multivariate models. See Fig. 4.12.
When we get a dataset, we usually do not know the dependencies of the
variables. Additionally, datasets usually have more than two variables that we are
interested in. In general, we can use the simple scatterplots outlined in Sect. 4.2, but
to examine a lot of variables we would have to create a lot of scatterplots. Here we
want to show a more efficient procedure.
The idea of a scatterplot matrix (see, e.g., Fig. 4.17) is to create multiple
scatterplots at once. The user is not interested in any detail of the bivariate
distribution. Instead, the aim of such a diagram matrix is to get a “feeling” for the
dependencies and the type of the dependencies.
We use here once more the dataset “test_scores”. In this dataset, we can find
several variables, e.g., pretest, posttest results and teaching method, as well as the
number of students in the class. See also Sect. 10.1.31 for more details. Following
from the idea of the bivariate distribution discussed in Sect. 4.2, we can expect
dependencies between these variables.
The scatterplot matrix is a squared matrix with diagrams in the cells. We will
show how to create and how to interpret this diagram type.

Create and Interpret a Scatterplot Matrix


We start with the template stream “Template-Stream test_scores” also shown in
Fig. 4.13.

1. We open the stream “Template-Stream test_scores”.


2. We add from the SPSS modeler toolbar in section “Graphs” a node of type
“Graphboard”. See Fig. 4.14.
3. We have to make sure that the new node “Graphboard” is connected with the
Type node. See Fig. 4.14.
4. We double-click the node “Graphboard” and we can see the dialog window in
Fig. 4.15. The details are also outlined in Sect. 4.2.

Fig. 4.12 Steps in multivariate analysis


298 4 Multivariate Statistics

Fig. 4.13 Template stream “test_scores”

Fig. 4.14 Added Graphboard node to the template stream

5. We would like to examine the relationship between the variables, which are the
number of students in the classroom and the pretest and the posttest results. We
select all these variables on the left side of the dialog window.
Now on the right side we see that only diagrams where more than two variables
can be visualized are available. In particular, the remaining plot “Scatterplot
matrix (SPLOM)” can be used to examine the dependency between more than
two variables. We select this plot by clicking it. See also the arrow in Fig. 4.16.
6. We click “Run” to close the dialog window. The operation is often time-
consuming, so it takes a while until the plot appears. Figure 4.17 shows the result.

The order of the variables (from left to right: number of students, pretest and
posttest) on the horizontal and the vertical axis is the same. The diagonal represents
4.3 Scatterplot Matrix 299

Fig. 4.15 Initial Graphboard node settings

the frequency distribution of the different variables. Indeed it makes no senses to


create a scatterplot here because the dependency of the variable upon itself is
obvious. The only thing we would get are points on a diagonal in the diagram.
Instead of this, we can inspect here the shape of the distribution and we can see
possible outliers.
Additionally, we can see that the scatterplots above and below the diagonal from
the left upper corner to the right bottom corner of the matrix are identical. Nor-
mally, it would be enough to create and inspect just one half of this matrix.

" A scatterplot matrix is a squared matrix with diagrams in the cells. The
order of the variables on the horizontal and the vertical axis is the
same. Additionally, the scatterplots above and below the diagonal
from the left upper corner to the right bottom corner of the matrix are
identical (symmetrical matrix). The main diagonal of the matrix
consists of the frequency distribution of the variables.

We find the expected pattern within the diagrams of pre- and posttest
relationships, e.g., in the middle of the last row and the last column of the matrix.
This is identical with the scatterplot we created in Sect. 4.2.
300 4 Multivariate Statistics

Fig. 4.16 Graphboard node settings for a simple Scatterplot

In addition, we can find the relationship between the number of students in the
classroom and the pretest or the number of students in the classroom and the
posttest. Indeed, it makes sense that the correlation is negative. See the diagram
in the middle of the first column of Fig. 4.17 for pretest vs. number of students and
the diagram in the bottom left corner for posttest vs. number of students.

" To have a better overview of the details of the analysis, as well as to


have documentation of the research process, we suggest printing the
scatterplot and marking the “strong” or “interesting” correlations. In
this first analysis step, we do not ask if the correlation is meaningful or
even spurious. This procedure simply ensures that we can also find
unexpected correlations or dependencies.
4.3 Scatterplot Matrix 301

Fig. 4.17 Scatterplot matrix for number of students vs. pre- vs. posttest results

We found out that it is a good idea to create scatterplots for all variables, so that
we can identify (hidden or unknown) correlations. Of course it is also necessary to
inspect them using mathematical measures, e.g., the correlation coefficient. Never-
theless we strongly suggest creating such diagrams to get a first rough overview of
the data/information. You can find a linear regression model of the relationship
between pre- and posttest in Sect. 5.2.2.
302 4 Multivariate Statistics

4.4 Correlation

Description of the model


Stream name correlation calculation.str
Based on dataset test_scores.sav
Stream structure

Related exercises: 1, 2, 3, 4

Theory
Here we address the second step of our suggested approach to analyze datasets and
to create valid bi-/variate models. See Fig. 4.12.
In addition to a graphical analysis of the data using scatterplots, it is helpful and
necessary to calculate measures for correlations. Usually in statistical programs, a
correlation matrix would be used. As the SPSS Modeler focuses upon data mining
and therefore large datasets with many variables, these matrixes would be hard to
handle because of their size. Probably due to this fact, IBM chooses to calculate the
correlations, but present them in the form of a list. We will show how to create and
interpret these results.
Let us also add some notes related to the statistics. In Sect. 3.1.2, we distin-
guished between nominal, ordinal, and metric variables. Here we should remember
that theory. It is easy to determine the difference between two values measured on a
metric scale (at least interval scaled). The dependency between two metric
variables can be measured by using the co-variance. This measure can be
standardized using the product of the standard deviations of each variable. At the
end of this calculation, we get the so-called “Pearson’s correlation coefficient”.
4.4 Correlation 303

Table 4.2 Value of the correlation coefficient and the strength of the correlation
Absolute value of the correlation
coefficient Interpretation
1 Perfect correlation
0.8 Strong correlation
0.5 Moderate correlation
0.3 Weak or very low correlation, probably no dependency
between the variables
0 No correlation

That is the measure determined by the Modeler. See IBM (2015a, p. 268). It
measures only the linear proportion of the relationship between both (metric)
variables. However, it is a standard measure for metric values and is also often
used when only one of the variables involved is metric. Although, out of necessity
the variable has to at least be ordinal. Ordinal input variables normally ask for
“Spearman’s rho” as an appropriate measurement. See for instance Scherbaum and
Shockley (2015, p. 92). The Pearson’s correlation coefficient, however, is an
approximation based on the assumption that the distance between the scale items
are equal. Details of how to deal with ordinal variables and how to use in particular
the so-called “polychoric correlations”, for ordinal scaled variables, can be found in
Drasgow (2004). The algorithm tries to determine normally distributed continuous
variables (latent) behind the ordinal scaled variables and then to measure the
association between them.

" The Modeler determines Pearson’s correlation coefficients that are


normally appropriate for metric (interval or ratio) scaled variables
only. Assuming constant distances between the scale items, the
measure can also be used for ordinal scaled variables, but this is
only an approximation.

The correlation coefficient can take on values in the intervals 1 to +1. If the
correlation coefficient is zero, there is no (linear) correlation between both
variables. In the case of a correlation coefficient of exactly 1 or +1, all the values
are on a straight line. Table 4.2 shows typical interpretations of the strength of the
correlation coefficient.
The SPSS Modeler calculates the Pearson’s correlation. The following example
demonstrates how to calculate this measure. We use the dataset “test_scores”. In
this dataset we can find several variables, e.g., pretest/posttest results and the
number of students in a class. Following the idea of bivariate dependency we can
expect other dependencies between these variables.
304 4 Multivariate Statistics

Calculation of the Correlation Coefficient


We start with the template stream “Template-Stream test_scores”. See Fig. 4.18.

1. We open the stream “Template-Stream test_scores”.


2. Now we add from the SPSS Modeler toolbar in section “Output”, a node entitled
“Statistics”. See Fig. 4.19.
3. We have to be sure that the new node “Statistics” is connected with the Type
node. See Fig. 4.19.
4. We double-click the Statistics node. Now we can see the dialog window as
shown in Fig. 4.20.

Fig. 4.18 Template stream “test_scores”

Fig. 4.19 Statistics node is added to the template stream


4.4 Correlation 305

Fig. 4.20 Initial status of the Statistics node

5. The correlations between the variables can be calculated in the second part of the
window at the bottom. Firstly, we suggest adding all the variables of interest into
the upper window. To do this we click the button marked with an arrow in
Fig. 4.20. By pressing the keyboards Ctrl-button at the same time, we select the
variables “n_student”, “pretest”, and “posttest”, and we click “OK”. We then get
the dialog window as shown in Fig. 4.21.
6. In the next step, we must define the variables with which the correlations are
calculated. We click the button on the right side of the dialog window as marked
in Fig. 4.21 with an arrow.
7. Then we add the same three variables “n_student”, “pretest”, and “posttest” to
the list and we click “OK”. Figure 4.22 shows the result of this operation.
8. We click “Run” to close the dialog window and to get the results. Figure 4.23
shows two of three correlation coefficients. If we scroll down we can find the
third.

In relation to Table 4.2 the question at hand is why the correlations in Fig. 4.23
are described as strong. There are two options offered in the Statistics node to
306 4 Multivariate Statistics

Fig. 4.21 The first step in calculating the correlation coefficient

determine the strength of a correlation. If we click the button “Correlation Settings


. . .”, also shown in Fig. 4.22 at the bottom, then a new dialog window will be
opened (Fig. 4.24).
By default the Modeler labels the correlations based on inverse significance
(1  p-value between 0 and 1). The null hypothesis is that the correlation is not zero.
The larger the 1  p-value, the larger the chance that both variables are correlated.
Table 4.3 shows the details based on IBM (2015b, p. 296).

" The Modeler offers two options in the Statistics node for assessment
of correlations. By default the values are labeled based upon their
inverse significance. The larger the value, the more reliable the deter-
mined correlation. One can decide to label the correlations based on
their absolute value too. We recommend using the default option.

The final stream can be found as stream “correlation calculation.str”.


The way the SPSS modeler presents the results makes it difficult to interpret
them. Usually, a researcher expects to get a correlation matrix as shown in
Table 4.4.
4.4 Correlation 307

Fig. 4.22 The second step in calculating the correlation coefficient

We should scroll through the Modeler results and identify all correlations that
are at least approximately moderate—that means their absolute value is equal or
larger than 0.5. See Table 4.2 for details. Here all the correlations are approximately
moderate, so we should have a look at all of them. The next step is to identify and
exclude spurious correlations.

" We suggest creating a scatterplot matrix and marking the candidates


of paired variables that could have a dependency. In the next step,
the correlation coefficients between all the variables should be calcu-
lated. Then at least moderate coefficients should be marked. Only at
the end of this process should spurious correlations be excluded
based upon expert knowledge. See Sect. 4.6. This procedure
ensures that hidden correlations are identified. It is not a good idea
to examine pairs of variables where a dependency is expected.
Unknown correlations would then remain undiscovered.
308 4 Multivariate Statistics

Fig. 4.23 Results for the Pearson’s correlation coefficients


4.4 Correlation 309

Fig. 4.24 Settings to assess the strength of correlations

Table 4.3 Labels assigned by default in the Modelers Statistics node to the correlations
Inverse significance 1  p Label and interpretation
0 up to 0.9 Weak
The correlation between both variables is questionable
Larger than 0.9 up to 0.95 Medium
The correlation between both variables can exist
Larger than 0.95 Strong
Both variables are correlated

Table 4.4 Correlation n_student Posttest Pretest


matrix for dataset
n_student 0.505 0.499
“test_scores”
Posttest 0.951
Pretest
310 4 Multivariate Statistics

4.5 Correlation Matrix

Description of the model


Stream name correlation matrix calculation
Based on dataset test_scores.sav
Stream structure

Related exercises: 3, 4

In the Sect. 4.4, we discussed how to calculate the correlations and to summarize
the findings in the form of a correlation matrix. The outlined procedure can be used
in all cases. However, calculating the correlation coefficients and arranging them
manually in the form of a matrix is cumbersome; we would therefore like to propose
another, more convenient method.
We would like to improve the correlation calculation stream described in
Sect. 4.4. Therefore we start by using the results of the previous section and the
stream we created there.

" The correlation matrix can be calculated by using a Sim Fit node, but
there are some restrictions, especially if many values are missing.
Furthermore, the approximation of the distributions as determined
can result in misleading correlation coefficients. The Sim Fit node is
therefore a good tool, but the user should verify the results by using
other functionalities, e.g., the Statistics node.
4.5 Correlation Matrix 311

Fig. 4.25 Template stream “test_scores”

Fig. 4.26 Sim Fit node is added to the stream

1. We open the stream “correlation calculation.str”. See Fig. 4.25.


2. We use the “File\Save Stream as . . .” option to save it with another name and to
avoid changes in the original stream. Our stream, with the solution, can be found
in “correlation matrix calculation.str”.
3. Now we add from the SPSS Modeler toolbar in section “Output” a node of type
“Sim Fit”. We connect the node with the Type node. See Fig. 4.26.
312 4 Multivariate Statistics

Fig. 4.27 Parameter of the Sim Fit node

4. Now we double-click the Sim Fit node to open its parameter dialog, as shown in
Fig. 4.27.
The Sim Fit node tries to find a distribution that the values of each variable
represent. If we have a dataset with a huge sample size, we can restrict the
number of values used to estimate the distribution. The option “number of cases
to sample” in the node parameters gives us the chance to avoid long sampling
procedures by setting a different sample size. See the option marked by an arrow
in Fig. 4.27.
Furthermore, we can determine the statistics being used to find the “correct”
type of distribution. Here, the Anderson–Darling and the Kolmogorov–Smirnov
criterions are being offered by the Modeler. Both forms of statistic can be used in
the case of continuous variables. The Anderson–Darling statistic ensures the best
results, however, especially for the tails of the distribution. See also Vose (2008,
p. 292). We use this criterion here and close the dialog window with “OK”.
5. Now we can run the new part of the stream. We click with the right mouse button
on the Sim Fit node and choose “Run”.
6. The Modeler determines the distribution of values that fits best, in terms of the
criterion activated above. A Sim Gen node is added to the stream. See Fig. 4.28.
7. We can inspect the results by double-clicking on the Sim Gen node. Figure 4.29
shows the distributions determined by the Modeler in the fourth column.
4.5 Correlation Matrix 313

Fig. 4.28 Sim Gen Node is added to the stream

Fig. 4.29 Determined distributions in the Sim Gen node


314 4 Multivariate Statistics

Fig. 4.30 Determined correlation matrix in the Sim Gen node

8. We would now like to inspect the correlations. Activating the option


“correlations” on the left side of the dialog, the Modeler shows us the result.
The symmetrical matrix shows Fig. 4.30. The matrix created here by the modeler
equals the matrix presented in Table 4.4.

The advantage of the Sim Fit node is that it easily calculates the correlation
matrix. Nevertheless, there are some pitfalls because of the approximation of the
frequency distributions, particularly when there are a lot of missing values. Inter-
ested readers are referred to IBM (2014, pp. 275–276). The approximation
criterions of Anderson–Darling and of Kolmogorov–Smirnov are explained by
Vose (2008, pp. 292–295). This author also prefers the Anderson–Darling test
statistic.

4.6 Exclusion of Spurious Correlations

In this section, we cannot present a statistical method or SPSS Modeler functional-


ity. In the previous sections, we discussed methods to examine datasets and
relationships between variables. Alongside the source of the values and their
meaning, we also marked noticeable dependencies based on the scatterplot. Fur-
thermore, we identified at least moderate correlations based on the mathematical
measure “Pearson’s correlation coefficient”.
4.7 Contingency Tables 315

Comparing the results of the scatterplot and the correlation matrix, we can
identify couples of variables that are possibly dependent. Unexpected dependencies
will probably also become apparent here.

" Spurious correlations can only be removed with expert knowledge.


There is no procedure to find them using statistics.

The methods used are based on values, but correlation does not imply causation.
That’s because we have to ask if the identified pairs of variables could be depen-
dent. In other words: is there any chance for a logical dependency between the
variables? Or in this example: is there any chance that the number of students in the
classroom causes better test results? For all the correlations that are identified and
summarized as at least moderate in Table 4.4, the answer is yes. They are not
spurious. We do not have to exclude any dependency here.
We will show in the following exercises that this is very often not the case,
however. We will find examples for spurious, or at least questionable, correlations.
An example of such a dependency is the possible correlation between the number of
people with cancer in a region and the presence of a nuclear power plant and the
number of technical failures at that plant.

4.7 Contingency Tables

Description of the model


Stream name contingency_tables
Based on dataset test_scores.sav
Stream structure

Related exercises: 5
316 4 Multivariate Statistics

Theory
In Sect. 4.4, we discussed how to calculate Pearson’s correlation coefficient. This
measure gives us the chance to determine the strength of a linear relationship
between two variables, if they are either continuous or at least ordinal (Fig. 4.31).
In the case of discrete values, and especially nominal or ordinal variables, we
can count the values and calculate their pairwise frequency. Alternatively, we can
bin metric values as described in Sect. 3.2.7. At the end of the process we get a
contingency table. To determine if both variables are independent, the Chi-square
test can be used.

" The bivariate frequency of nominal or ordinal variables can be


represented the form of a table. Then one variable would appear in
the rows and the other one in the columns. Such cross tabulations or
crosstabs are also referred to as contingency tables. Metric variables
can also be used after a binning procedure has been applied.

" To test the independency or the dependency of variables, a


Chi-square test of independence must be used as follows:

1. The null hypothesis is that both variables are statistically independent;


2. The Chi-square value and its significance level in the form of a probability
must then be obtained.
Then we compare the determined probability X with 5 %;
3. If the probability exceeds 5 %, the null hypothesis should be rejected.
In other words, we would make a mistake with X % probability if we
reject the hypothesis.

" Important remarks:

– This test should only be used if the expected frequencies in all or at


least 80 % of the cells are larger than five.

Fig. 4.31 Steps in multivariate analysis


4.7 Contingency Tables 317

– The test cannot be used to determine the direction and the strength
of the relationship between the variables.

The test as described in the box tells us “only” if both variables are independent.
If we reject the null hypothesis, we can say that the data do not let us expect that
there is a relationship. Normally we cannot conclude that both variables are
dependent—or in other words—there is a contingency.
We cannot use the chi-square test of independence to determine the direction of
the contingency in every case, however, and additionally we cannot measure the
strength of the contingency. The Modeler also has limitations in this regard. If we
would like to determine Cramer’s V or the Phi coefficient as a measure of the
strength of contingency, we can for example use R to calculate them.

Using the Chi-square Test of Independence


Once more we want to use the “test score”-example. In the dataset “test_scores.
sav”, we can find several nominal and therefore discrete variables as a basis for a
contingency table. Table 4.5 shows the variable names and their meaning.
Here we want to examine if the teaching method and the type of the school are
independent. That means we ask: is there dependency between the experimental or
more traditional teaching methods and the type of the school, i.e., public or
non-public?

1. We start with the stream “Template-Stream test_scores”. See Fig. 4.32.


2. First let’s examine the scale types of the different variables determined in
the Type node. Figure 4.33 shows us that the variables “school_type” and
“teaching_method” are nominal.
3. To determine the bivariate frequency distribution in the form of a contingency
table, we add from the Output tab of the Modeler a Matrix node and connect it
with the Type node. See Fig. 4.34.
4. To modify the parameter of the Matrix node, we double-click it. First of all we
select in the Settings tab the variables “teaching_method” and “school_type”.
See Fig. 4.35.

Table 4.5 Selection of Field name Description


variables included in the
school_setting School setting
dataset “test_scores.sav”
1 ¼ Urban, 2 ¼ Suburban, 3 ¼ Rural
school_type School type
1 ¼ Public, 2 ¼ Non-public
teaching_method Teaching method
0 ¼ Standard, 1 ¼ Experimental
gender Gender
0 ¼ Male, 1 ¼ Female
lunch Reduced/free lunch
1 ¼ Qualifies for reduced/free lunch
2 ¼ Does not qualify
318 4 Multivariate Statistics

Fig. 4.32 Template stream “test_scores”

Fig. 4.33 Types of variables in the stream “test_scores”

5. No other modifications are necessary and so we can run the node.


6. In Fig. 4.36, the contingency table can be analyzed. To show the variable and
the value labels, we use the label button in the middle of the dialog window. In
Fig. 4.36, this button is marked with an arrow.
4.7 Contingency Tables 319

Fig. 4.34 Matrix node is added to determine the contingency table

Fig. 4.35 Determination of the variable names in a Matrix node

As we can see, 1087 schools are public and use standard teaching methods.
Later we will explain a better method to assess the dependency. For now we
want to look at the test results for the chi-square test of independence.
320 4 Multivariate Statistics

Fig. 4.36 Absolute frequencies and the Chi-square test results in a Matrix node

7. At the bottom of the dialog window in Fig. 4.36, we can find the results of this
test. Let’s focus here on the interpretation of the probability.
(i) The null hypothesis of the test is that there is no dependency between the
school type and the teaching methods.
(ii) We must compare the determined probability of 0 % with a 5 % probabil-
ity value.
(iii) As the determined probability of 0 % is smaller than 5 % we can reject the
hypothesis. In other words, we make a mistake with 0 % rejecting this
hypothesis. At a confidence level of 95 %, we can therefore reject this
hypothesis.
So the data do not tell us that there is a reason to consider a dependency
between both variables. Figure 4.37 once more summarizes the procedure
for using the Chi-square test of independence. The option to determine the
expected frequency is shown in Fig. 4.39.
8. To have the chance to look at the data from a different angle, we want to add
here another Matrix node to the stream. The variable settings are the same as
those shown in Fig. 4.35. Figure 4.38 shows the final structure of the stream.
9. In the Matrix node at the bottom, we make the following changes to the
parameter as shown in Fig. 4.39 we activate the tab “Appearance”. Here, we
choose the options “Percentage of column” and “Include row and column
totals”. After that we can run the Matrix node.
4.7 Contingency Tables 321

Fig. 4.37 Contingency table and the Chi-square test statistics

Fig. 4.38 Final stream “contingency_tables”

Fig. 4.39 Determining relative frequencies in a contingency table


322 4 Multivariate Statistics

10. Of course the test result from the Chi-square test is the same in Fig. 4.40 as it is
in Fig. 4.36, but we can now find the relative frequencies of the teaching
method dependent on the school type in each column. The process is also
explained in Fig. 4.41.

Fig. 4.40 Relative frequencies in a contingency table

Fig. 4.41 Determining dependencies using a frequency table


4.8 Exercises 323

If the teaching methods were equally distributed over both school types, we
should find in each column approximately the same percentages in the same
order. As we can easily see here this is not the case. So we can guess, just from
looking at this type of contingency table with relative frequencies, that there is
no dependency between the teaching method and the school type. The
Chi-square tests give us statistical justification for that conclusion.

4.8 Exercises

Exercise 1: Understanding the Pearson’s Correlation Coefficient


The aim of this exercise is to understand the meaning of Pearson’s Correlation
Coefficient. To find the correct interpretation, please complete the following:

1. Open the stream “Pearson’s Correlation EXERCISE”. Two variables x and y are
included in the dataset “pearson_correlation.sav”. Open the stream and make
sure you have access to the dataset. Use the Table node to check the connection.
Additionally examine the values. Both variables x and y are probably “depen-
dent”. This should be examined here step-by-step.
2. Following the procedure outlined in Fig. 4.2 in Sect. 4.1, one diagram should be
created, of the frequency distribution of x and y separately. Please use the
appropriate node and create these diagrams now. Describe your findings.
3. The next step in Fig. 4.2 is to create a diagram with bivariate distribution,
referred to as a scatterplot. Add a node to the stream and create the diagram.
Try to identify if there is a dependency between x and y and describe it.
4. Now in the last step, the Pearson’s Correlation Coefficient must be calculated.
Interpret your findings. Outline how the consequences of this analysis show the
necessity of the different steps suggested and summarized in Fig. 4.2.

Exercise 2: Correlation of Benchmark Test Results of Computer Processors


Benchmark tests are used to identify the performance of computer processors. The
file “benchmark.xlsx” contains a list of AMD and Intel processors with its prices in
Euro and the results of a benchmark test performed with the test program
Cinebench 10. The variable name for the Cinebench result is “CB”.
For details see also Sect. 10.1.3.

1. Create a new stream and import the data.


2. Inspect the different values of the variable “firm”.
3. Visualize the bivariate distribution of “CB” and “price” in a scatterplot. Use
different colors for AMD and Intel processors.
4. The correlation between the performance (“CB”) and the price should now be
determined. Please determine the Pearson’s Correlation Coefficients separately
for both firms. A selection procedure should probably be used.
5. Summarize your findings.
324 4 Multivariate Statistics

Exercise 3: Dependencies between IT-Project Characteristics


Before starting an IT project and implementing the source code, a company has to
estimate the costs of the project and the time it will take. The file “IT-projects.txt”
contains the data of several projects, the delivered source instructions in thousands
(KDSI ¼ “Kilo Delivered Source Instructions”), the time for development (TDEV)
and the cost in person months (PM). For details see also Sect. 10.1.19. We now
should examine the dependencies of the variables:

1. Plan the steps of the analysis using the outlined process in Sect. 4.1.
2. Create a stream, import and examine the data. Outline their meaning.
N.B. Alternatively the template stream “Template-Stream IT_Projects.str”
can be used.
3. Based on the meaning of the variables try to find an important relationship that
can be used to determine the time for development. Then try to identify a
relationship to determine the number of person months.
4. Examine the relationship between the three variables. Create a correlation
matrix. Outline your findings.

Exercise 4: Determining a Correlation Matrix


In the previous exercise, we determined the correlation coefficients for the dataset
“IT projects”. This “IT-projects.txt” contains the data of several projects: the
delivered source instructions in thousands (KDSI), the time of development
(TDEV), and the cost in person months (PM). See also Sect. 10.1.19. We used
the Statistics node and calculated the correlations step-by-step.
Now this stream “Correlation IT-project variables.str” should be modified. Open
the stream and save it with another name. Then modify the stream so that the
correlation matrix will be calculated.

Exercise 5: Analyzing Dependencies


Table 4.6 shows the nominal scaled variables in the dataset “test_scores.sav”.
For details see also Sect. 10.1.31.

1. Please use an appropriate test to assess the dependency or independence of


i. Teaching method and gender,

Table 4.6 Selection of Field name Description


variables included in the
school_setting School setting:
dataset “test_scores.sav”
1 ¼ Urban, 2 ¼ Suburban, 3 ¼ Rural
school_type School type
1 ¼ Public, 2 ¼ Non-public
teaching_method Teaching method
0 ¼ Standard, 1 ¼ Experimental
gender Gender
0 ¼ Male, 1 ¼ Female
lunch Reduced/Free lunch
1 ¼ Qualifies for reduced/free lunch
2 ¼ Does not qualify
4.9 Solutions 325

ii. School type and gender and


iii. School type and reduced fees for lunch.
2. Explain your findings in detail. Additionally, use the dependent relative
frequencies as an argument for your results. See Sect. 4.7.

4.9 Solutions

Exercise 1: Understanding the Pearson’s Correlation Coefficient


Name of the solution stream Pearson’s Correlation SOLUTION
Theory discussed in section Section 4.2
Section 4.4

Figure 4.42 shows the final stream. It is the basis for answering the questions in this
exercise. We will outline the findings below.

1. Figure 4.43 shows the values of both variables. As a mathematician, we can probably
see the formula to calculate the y-values. Obviously it is y ¼ x2. So there is a strong
connection that can be expressed by a formula. Both variables are dependent.
2. In Fig. 4.44 we can see the histogram of variable y.

Fig. 4.42 Stream “Pearson’s Correlation SOLUTION”


326 4 Multivariate Statistics

Fig. 4.43 Table with the records of the dataset

Fig. 4.44 Histogram of variable y


4.9 Solutions 327

3. Using a Graphboard node we get the diagram in Fig. 4.45. We can also find a
strong relationship between x and y.
4. Using the Statistics node we can calculate the Pearson’s Correlation. Figure 4.46
shows the settings of the Statistics node parameter to calculate the correlation
and Fig. 4.47 shows the result. The correlation is zero. To interpret the result
correctly, we have to remember the definition of the coefficient that we discussed
in Sect. 4.4. It measures only the linear proportion of the relationship between
both variables. In this example, we have a very strong quadratic dependency
between x and y.
In order to use the correlation coefficient, it is obviously necessary to add a
graphical analysis into the research process. Hence, we have suggested the
combination of mathematical and graphical analysis in our approach. Using
this procedure, we make sure that we reveal “hidden” dependencies between
variables.

Fig. 4.45 Scatterplot of x and y


328 4 Multivariate Statistics

Fig. 4.46 Statistics node setting for Pearson’s Correlation between x and y

Fig. 4.47 Pearson’s Correlation Coefficient


4.9 Solutions 329

Exercise 2: Correlation of Benchmark Test Results of Computer Processors


Name of the solution stream Correlation_processor_benchmark
Theory discussed in section Section 4.3
Section 4.4

1. Figure 4.48 shows the stream to find all the solutions. On the left side we can find
the Excel File node that gives access to the data.

Fig. 4.48 Stream “Correlation_processor_benchmark”


330 4 Multivariate Statistics

2. To inspect the different values of the variable “firm”, we can use the Table node
below the Excel File node in Fig. 4.48. Figure 4.49 shows the results. The
variable “firm” in the first column can have two values “Intel” and “AMD”.
We can also use the Type node. Figure 4.50 shows the result.
3. To create a scatterplot for “CB” vs. “price” we can use the scatterplot as outlined
in Exercise 2 in Sect. 5.2.6. Here we use an functionality of the Graphboard node
to determine the different colors of the symbols.
As a first step, in the dialog window of the Graphboard node we select the
variables “CB” and “EUR”, as well as scatterplot for the diagram type. See
Fig. 4.51. Additionally, we have to click on the “Detailed” tab, to determine the
different colors. Figure 4.52 shows the parameters. In the upper right corner of
the dialog window we have to select the variable “firm” to determine the colors
of the circles within the graph.

Fig. 4.49 Table node results, e.g., for the variable “firm”
4.9 Solutions 331

Fig. 4.50 Type node and the different values for “firm”

In Fig. 4.53 we can see a clear correlation, or better a linear dependency,


between price and the performance measured in Cinebenchmark points. Addi-
tionally, we can identify at least one outlier. This is an Intel processor with an
exorbitant price.
4. To calculate the Pearson’s Correlation Coefficient separately for both firms we
use Select nodes. To be sure to define the correct selection statements, we
addressed the different values of the variable “firm” in question one of this
exercise. Now we know that there are data for “Intel” and “AMD”.
Figure 4.54 shows an example for the selection statement of the first Select
node in Fig. 4.48. To be sure the selection statement really works, we added a
Table node. Figure 4.55 shows the correct result. We used the same procedure
for the second Select node, and there the selection statement is firm ¼ “AMD”.
Furthermore, we added the known Statistics node to each part of the stream.
So we get the correlation coefficient between the performance (“CB”) and the
price of the processors. Figure 4.56 shows the parameter of the Statistics node. In
Fig. 4.57, we can find the result for the correlation coefficient. As expected from
the graphical analysis, we get a strong correlation with 0.809 for Intel. With the
332 4 Multivariate Statistics

Fig. 4.51 Graphboard node settings for the scatterplot

same procedure and the second statistics node behind a Select node in Fig. 4.48,
we can find a correlation 0.986 for AMD processors.
5. As expected because of the outlier for the Intel processors, the correlation for
AMD is stronger. Finally, we can exclude the outlier and determine the values
once more. This is part of the exercise 2 in Sect. 5.2.6 where we use the
dependency in regression models.

Exercise 3: Dependencies Between IT-Project Characteristics


Name of the solution stream Correlation_processor_benchmark
Theory discussed in section Section 4.3
Section 4.4
Section 4.5
4.9 Solutions 333

Fig. 4.52 Graphboard node settings for different colors in a scatterplot

1. Figure 4.2 provides the steps of an analytical process. We will follow these steps
here also. First we examine the data by using charts. Then we try to measure the
strength of the correlation.
2. To import the data that are given in the form of a simple text file we use the
Variable File node. See also Exercise 5 in Sect. 3.2 for details.
To show the variables and their values we add a Table node. After that we
typically use a Type node to determine the correct scale type of the variables.
Figure 4.58 shows the final stream. Figure 4.59 shows the variables and some
values.
334 4 Multivariate Statistics

Fig. 4.53 Scatterplot Cinebench results vs. price

Fig. 4.54 Parameter of the Select node


4.9 Solutions 335

Fig. 4.55 Result of the selection procedure shown with the Table node

Fig. 4.56 Parameter of the Statistics node to determine the correlation coefficient
336 4 Multivariate Statistics

Fig. 4.57 Results of the statistics node for Intel-processors

To examine the relationship between the three variables-time for development


of the software (TDEV), person months (PM), and 1000 (K) delivered source
instructions (1 KDSI ¼ 1000 instructions), we have to first understand their
meaning. See Table 4.7.
3. If we now think about the necessary steps to calculate a price for the software,
we have to describe the development process. First of all we have to understand
the customers’ requirements. Based on these insights, we can try to estimate
the number of codes (expressed in KDSI) we have to write. Using this approxi-
mation, we can create the project plan with the parallel steps in the development
4.9 Solutions 337

Fig. 4.58 Final stream to examine the IT project data

Fig. 4.59 Variables in the IT project dataset


338 4 Multivariate Statistics

Table 4.7 Variables in “IT-projects.txt” and their meaning


Project number ID of the project. This variable is not needed for an analysis,
because we do not expect a relationship between this and any
other variable.
Time for development of the The time that is necessary to write the code/instructions and to
software (TDEV) deliver the application to the customer.
Person months (PM) Number of months the development process will take
PM is different from the time for development (TDEV) and
typically larger. That’s because the number of employees
working in the development process is larger than one.
Example: Three programmers and one project manager need
3 months to write the code. Then TDEV (time for
development) is 3 months or 3  20 ¼ 60 working days. But all
four will work 80 % on the project. So PM equals
approximately 4 employees  3  20  0.8 ¼ 192 days.
1000 (K) delivered source Number of source instructions to write. Normally, this number
instructions (KDSI) equals the sum of the number of commands and the number of
comments.

Fig. 4.60 Univariate Analysis of the variables in the IT project dataset

process. As a result, we get the number of days the programmers and the project
manager will need to write the code. Based on this data, we can estimate
the delivery date. We can summarize the dependencies as follows:
KDSI ! PM ! TDEV.
4. Analysis of the dependencies between the variables:
To get a first impression of the data, we can use a Data Audit node. Figure 4.60
shows the univariate analysis of the variables. Here, we can also find some
measures of central tendency, volatility, and skewness. The longer right tails of
4.9 Solutions 339

Fig. 4.61 Scatterplot of the variables in the IT project dataset

the distributions of PM and KDSI and their maximum are particularly interest-
ing. We can see that there are some outliers that we have to deal with. For details,
see also the regression models in Exercise 3 of Sect. 5.2.6.
5. We create a scatterplot with the Graphboard node. See Fig. 4.61. In the middle of
the last row, we can definitely see an almost linear relationship between KDSI
and PM. Additionally, we have concerns about the linearity of the relationship
between PM and TDEV (first diagram in row 2). As we will see in Exercise 3 of
Sect. 5.2.6, a polynomic relationship describes that dependency best.
That’s because the chance to write parallel code depends on the complexity of
the software. The more complex the program, the more modules can be written
at the same time. Then the number of months to develop the software decreases.
Later we will use a logarithmic transformation of TDEV to create a valuable
model. Alongside these findings, we now can only measure the linear proportion
of the correlation. For details of the correlation coefficient see Sect. 4.4 and
Exercise 1. Figure 4.62 shows the result of the statistics node. Table 4.8 shows
the rearranged correlations in the form of a correlation matrix.
340 4 Multivariate Statistics

Fig. 4.62 Pearson’s correlations between the variables

Table 4.8 Correlation KDSI PM TDEV


matrix for dataset “IT-
KDSI 0.999 0.936
projects.txt”
PM 0.936
TDEV

Exercise 4: Determining a Correlation Matrix


Name of the solution stream Correlation IT-project variables correlation matrix
Theory discussed in section Section 4.4
Section 4.5

1. We open the stream “Correlation IT-project variables” that should be modified.


See Fig. 4.63.
2. Now we add from the SPSS Modeler toolbar in section “Output” a node of type
“Sim Fit”. We connect the node with the Type node. See Fig. 4.64.
3. We click with the right mouse button at the Sim Fit node and choose “Run”.
A Sim Gen node will be generated. See Fig. 4.65.
4. We open the new Sim Gen node with a double-click. We choose on the left side
the item “correlations”. The correlations we can find in Fig. 4.66 are the same as
in Table 4.8. The variable “Project number” can be ignored. We explained in
Sect. 2.7 two methods to exclude this variable from the stream. Either we use an
additional Filter node just behind the Variable File node or we exclude the
variable in the Variable File node options.
4.9 Solutions 341

Fig. 4.63 Stream “correlation IT-project variables”

Fig. 4.64 Added Sim Fit node to stream “Correlation IT-project variables”
342 4 Multivariate Statistics

Fig. 4.65 Generated Sim Gen node in stream “correlation IT-project variables”

Fig. 4.66 Example of a correlation matrix


4.9 Solutions 343

Exercise 5: Analyzing Dependencies


Name of the solution stream analyzing_dependencies
Theory discussed in section Section 4.7

Figure 4.67 shows the final structure of the stream. We added a Matrix nodes to
show the results of the Chi-square tests of independence in detail. Table 4.9
explains the results.

Fig. 4.67 Stream “analyzing_dependencies”

Table 4.9 Test results of chi-square test of independence


Test
probability/
Variables significance Description
Teaching method 0.354 See Fig. 4.68. The null hypothesis seems to be true.
vs. gender The data let us conclude that there is a relationship
between the teaching method used and the gender of
the student. We would make a mistake with a
probability of 35.4 % if we reject the null hypothesis.
School type 0.416 See Fig. 4.69. The null hypothesis seems to be true.
vs. gender The data let us conclude that there is a relationship
between the school type and the gender of the
students. We would make a mistake with a
probability of 41.6 % if we reject the null hypothesis.
School type 0.000 See Fig. 4.70. We would make a mistake with a
vs. reduced fee for probability of 0 % if we reject the null hypothesis. So
lunch we can conclude that the null hypothesis here is
not true.
The data does not fit to a relationship between the
school type and the number of students who receives
financial support for the lunch fees.
344 4 Multivariate Statistics

Fig. 4.68 Teaching method vs. gender

Fig. 4.69 School type vs. gender


Literature 345

Fig. 4.70 School type vs. reduced fee for lunch

Literature
Drasgow, F. (2004). Polychoric and polyserial correlations. In S. Kotz & N. Johnson (Eds.),
Encyclopedia of statistical sciences. New York: John Wiley & Sons, Inc.
Granger, C., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of Economet-
rics, 2(2), 111–120.
IBM. (2014). SPSS modeler 16 source, process, and output nodes. Accessed September 18, 2015,
from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/mod
eler_nodes_general.pdf
IBM. (2015a). SPSS modeler 17 modeling nodes. Accessed September 18, 2015, from ftp://public.
dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/
ModelerModelingNodes.pdf
IBM. (2015b). SPSS modeler 17 source, process, and output nodes. Accessed March 19, 2015,
from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/
ModelerSPOnodes.pdf
Scherbaum, C., & Shockley, K. M. (2015). Analysing quantitative data: For business and
management students, Mastering business research methods. London: Sage.
Vose, D. (2008). Risk analysis: A quantitative guide (3rd ed.). Chichester: Wiley.
Regression Models
5

In the previous chapters, we introduced the SPSS Modeler and described how to use
it for basic data operations, descriptive analytics, and visualization of data. In the
remaining chapters, we will look at explorative and inductive models, as well as
data mining methods that allow us to identify hidden structures in the data. We start
in this chapter by introducing the first and one of most popular classes of models in
data mining, the regression model. The main purpose of this model is to determine
the relationship between a target variable and some input variables. Regression
models are very intuitive, easy to handle, and to interpret. For these reasons, they
are popular and widely used in many different areas, e.g., medicine, finance,
physics, and web analytics, to mention a few.
Building regression models with the SPSS Modeler is quite simple. This chapter
is aimed to the introduction of the different regression types and how these models
are built with the SPSS Modeler. So after finishing this chapter, the reader . . .

1. Knows the differences of the various regression model types and is able to pick a
proper model type for his/her current problem.
2. Is able to build a regression model with the SPSS Modeler for different problems
and is able to apply it to new data for prediction.
3. Can evaluate the quality of the trained regression model and interpret its
outcome.

Before going into detail, we wish to outline some motivating examples, to give
an impression of the type of problems and data that suit the application of a
regression model.

# Springer International Publishing Switzerland 2016 347


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_5
348 5 Regression Models

5.1 Introduction to Regression Models

5.1.1 Motivating Examples

Usually, regression models are used to determine the unknown relationship


between one or multiple (independent) input variable(s) and a (dependent) target
variable, the latter being influenced by the former. If a regression model can
identify the dependency between these variables, then it can easily predict the
value of the target variable for a new set of input values. The following examples
illustrate some applications to give a better understanding of the relevance and
functionalities of regression models.

Example 1
Determining the linear relationship between two variables (Gravity constant of the
earth).
Let’s think back to physics class in school. There, we learned that the gravitation
of the earth accelerates the speed of every falling object by a constant g of about

g ¼ 9:801 m=s2

near the equator. This means that a falling object freely increases its velocity by
9.801 m/s2 per second it falls. School was a long time ago for most of us; however,
and so now, we check this basic constant again, with the help of regression analysis.
According to physics, there is a linear relationship, h ¼ g  S, where h is the
height from which an object is falling, and S is the squared time in seconds that the
object needs to reach the ground. Figure 5.1 shows the data of such an experiment,
and Fig. 5.2 shows the corresponding scatterplot (see Sect. 4.2 for a description of
how to perform a scatterplot). The meters are plotted at the abscissa and the
measured seconds to the power of two are plotted at the ordinate.
Despite some noise and measurement errors, we see that the data points lie in a
straight line, and thus, the presumption of a linear relationship is justifiable.
Estimation of the slope of this line can be easily done with a linear regression
model, and the result should be close to the real constant.
In this example, there is only a single input variable, and so these kinds of
models are called univariate or simple; but regression models can also be used to
estimate more complex linear variable connections. In these cases, the models are
called multiple linear regression models.

Fig. 5.1 Exemplary data


from an experiment to
determine earth’s gravity
constant
5.1 Introduction to Regression Models 349

Fig. 5.2 Scatterplot of the data from Fig. 5.1

Example 2
Determination of variable relations and prediction of variables (Analysis of pretest
and final exam results).
Students often prepare for exams by taking tests in advance. This gives students
the opportunity to become familiar with the type and complexity of questions
asked, as well as a chance to check their degree of readiness for the final exam.
To find out if pretesting is helpful, we can inspect the relationship between the
performances in both exams.
As well as measuring with an appropriate correlation coefficient, we can model
the exact relationship using linear regression. In addition, we are interested in a
prediction of the final exam scores from future students, based on their pretest
scores. This can be done by applying the build regression model on this new dataset
of pretest results.
IBM provides a dataset called “test_scores.sav” (see Sect. 10.1.31), which
comprises data for the kind of analysis described here. In Sect. 5.2.2, a simple
linear regression modeling process will be discussed in detail, based on this
example.

Example 3
Determining the relationship between multiple variables and building a model for
variable prediction (Prediction of house prices).
We can also use regression analysis to model cases where the target variable
depends on more than just one input variable. These models are called multiple
350 5 Regression Models

(linear) regression models. A good example would be the determination of a


suitable house price. Consider the following situation: we own a single-family
home that we want to sell. The job at hand is to find a good market price for our
property. The price obviously depends on a number of variables, such as the size of
the site, year of construction, number of bathrooms, and other parameters. This is a
perfect candidate for the application of regression modeling, as multiple input
variables influence the target variable, in this case the price of the house. In this
situation, multiple regression can be used to identify the relationships between these
variables, and the influence of each input variable on the target variable.
To find an appropriate price, we typically look at other houses, their features, and
the price for which they were sold. For such a dataset, also called training data, we
use a fitting model to determine the coefficients for the input variable. With a record
of the parameters for our single-family home, we can now estimate a realistic
selling price for our house.
In Sect. 5.3.2, we model this situation using multiple regression; the steps to
build the model with the SPSS Modeler are discussed in detail using the Boston
housing dataset (Harrison and Rubinfeld 1978; Gilley and Pace 1996). Alongside
the best-fitting multiple regression model, the SPSS Modeler will also tell us the
significance of each input variable’s contribution.

Other Examples and Areas Where Regression Is Used

• Industrial processes. The goal is to predict the quality of the produced product or
the waste/defective rate in an industrial process based on some process
parameters, such as temperature, acid concentration, and so on.
• Medical research. In a medical study, various blood pressure lowering drugs are
tested on patients with high blood pressure. With regression, the effectiveness of
each drug can be quantified.
• In econometrics, regression analysis is one of the major tools for estimating
important relationships, for example Okun’s law. Okun’s law describes the
relationship between the unemployment rate and losses in a country’s production
rate (Abel and Bernanke 2008).
• Demographic assessment. For many institutions, such as the government, moni-
toring of population size is important, in order to modify policy with a view to
influencing the birth rate.

In this chapter, we introduce various regression models and describe the


corresponding nodes provided by the SPSS Modeler, used to build regression
models that will handle the examples above. In Fig. 5.3, the discussed models
and nodes are listed, with their corresponding chapters in brackets.

5.1.2 Concept of the Modeling Process and Cross-Validation

Following data exploration and preparation, the actual model building process can
begin. When building a model using data mining, the model will only apply to the
5.1 Introduction to Regression Models 351

Fig. 5.3 Regression models and nodes discussed in this chapter

specific data being used in it. In particular this means that the model only knows this
dataset. A general assumption is that the build model fits unknown data, but often
this is not the case; often the model is unable to predict the values of new data
records. This phenomenon is known as overfitting (see section below). To validate
the model and prevent overfitting, the build model is typically applied to unknown
data to verify its universality. This process is called cross-validation. In particular,
this method is used to evaluate classification models, such as Neural Networks or
Support Vector Machines (see Chap. 8). We refer to James et al. (2013) for further
information on the concept and variants of cross-validation.
The key idea behind cross-validation and creating a correct modeling process is
to split data into a training dataset and a test dataset, typically in the proportion
70–30 %. Whereas the training dataset is used to estimate the model parameters and
build the model, the smaller test dataset is used for testing the exactitude of these
calculated parameters. This gives us the chance to measure a model’s ability to
generalize.
There are some important aspects to keep in mind when using a partitioning
method to separate the different subsets:

1. The values in both the training and the test set should have the characteristics of
the original sample. Sampling bias has to be avoided. Therefore, a random
selection process should be used to separate these subsets. For sampling
methods, see Sect. 2.7.8.
2. There is a trade-off between the exactness of the training results and the test
results. When the data are separated into the training and the test sets, the number
of records that are used for training the model is reduced.
3. The records of the test dataset cannot be used in the training procedure. The
model has to be processed using only the training dataset.

Often, the precise parameters of the model are unknown and have to be found
during the modeling process. In this case, the dataset is split into three parts: the
training, testing, and validation sets. After fitting the model for each of the
parameter selections using the training data, the validation dataset is then used to
find the model that will most precisely predict the target value of unknown data.
This so-called “best” model has to then be tested again with the testing dataset, to
352 5 Regression Models

Fig. 5.4 Workflow of the modeling and cross-validation process

verify that it is actually suitable for use with independent data. Here, another dataset
must be utilized, since the validation set is now biased, having already been used to
find the optimal parameters. A typical partitioning of the dataset is training data
60 %, validation data 20 %, and test data 20 %.
Figure 5.4 visualizes the modeling and cross-validation process. A dataset can be
split into different parts with the Partition node, see Sect. 2.7.7. In Sect. 5.4.4, we
will use this partitioning method by fitting a polynomial regression.
As mentioned above, cross-validation is also a very convenient way of
preventing overfitting and is therefore often used in the modeling process for data
mining tasks. For these reasons, we recommend always using a cross-validation
setting when building a prediction model.

Overfitting
In statistics, overfitting occurs when a statistical model describes errors or noise
instead of the underlying relationship. In other words, the model describes the
training data very well, but independent and unknown data cannot be modeled. This
results in poor prediction performance. Overfitting typically occurs when a model is
exorbitantly complex or has too many parameters. In Fig. 5.5 overfitting in regres-
sion is illustrated. Of course, there always exists a polynomic function on which all
data points are located. This is demonstrated in the left graphic. Although this is
very precise for the present data, this doesn’t represent the actual structure of the
data since measurement noise is modeled as well. So a linear relation describes the
data structure better and is therefore a more suitable model (see the right graphic in
Fig. 5.5). We refer to Kuhn und Johnson (2013) and James et al. (2013) for more
information on overfitting.
5.2 Simple Linear Regression 353

Fig. 5.5 Overfitting in regression. A complex regression function can tend to overfitting and fails
to model the data structure

Root Mean Squared Error: Evaluation Measure for Regression Models


A measure which indicates how well a regression model fits the data is the Root
mean squared error (RMSE). This is just the squared sum of the residuals divided
by the number of observations. In other words, it is the standard deviation of
difference between the observed and predicted values (see also Sect. 5.2.1). The
RMSE is a good measure for model accuracy and is used in cross-validation to
compare model and their forecasting errors with each other, in order to decide
which of them fits the data best. For more information on the RMSE, we refer to
Hyndman and Koehler (2006) and Kuhn and Johnson (2013).
In a well-fitting model, the RMSE should be small. If in cross-validation, the
RMSE of the testing or validation data is substantially higher than in the training
data, this is an indicator of a deficient model. If however these values are nearly the
same, the model can be used as a good predictor for independent datasets.

5.2 Simple Linear Regression

We start by introducing the (simple) linear regression model (SLR), which is the
easiest one in this class of regression models, since it comprises only two variables
and the linear connection between them. We need to mention that a linear regres-
sion model is only suitable if a linear relationship between the two variables exists.
To ensure this is the case, we suggest performing a correlation analysis first, to
verify if regression is suitable. This correlation procedure is described in Sect. 4.4.

5.2.1 Theory

Consider two variables, a predictor x and an observation variable y, for which we


assume a linear relationship. If, however, we have a number of realizations for the
input variable x1, . . ., xn and the corresponding observations y1, . . ., yn, we can
combine both into the data points (xi, yi). These points do not lie exactly in a line
354 5 Regression Models

Fig. 5.6 Two linear


functions that describe the
data. Line A fits the data
better than line B

(see, e.g., Figs. 5.2 and 5.6). This is due to noise or measurement errors within the
experiments.
SLR can now be used to quantify and estimate the linear dependency between
these two variables. In other words, the goal is to find a straight line

^y ¼ β0 þ β1  x
that best describes the data. For that, the unknown coefficients β0 and β1 have to be
estimated in an “optimal way”. The found estimators and the (linear) relationship
between these variables is called the (linear) regression model. This model can then
be used to predict the value of the target variable ynþ1 for a new input xnþ1 .
To perform a linear regression, there are a couple of necessary requirements,
which should be checked before trusting the results of the regression.

Necessary Conditions

1. The random noise for each sample has zero mean and the same variance
(homoscedasticity). This means that the noise does not contain any additional
information, but is totally random.
2. The errors are pairwise uncorrelated. This guarantees no systematic effect within
the errors. Instead, the errors are purely random.
3. The number of samples and data points is large enough.
4. Another useful assumption is the Gaussian distribution of noise. In this case,
measures for the goodness of fit for the models can be calculated. This gives us
the opportunity to find out if the linear model fits the data well.

Ordinary Least Square Algorithm


We mentioned above that a linear regression finds a line that best describes the data.
To understand how the parameters could be determined, let’s reconsider the table in
Fig. 5.1 of Example 1, which contains measured data to determine the acceleration
of the earth. In Fig. 5.6, two lines are included in the scatterplot of the data points.
Everybody would immediately agree that line A fits the data better than line B. This
is because the average vertical distance between the points and the line is smaller
5.2 Simple Linear Regression 355

Fig. 5.7 Prediction errors for line A and B

than for line B. In other words, the error made by predicting the measured values
with line A is much smaller than with line B. This is illustrated in Fig. 5.7.
The squared sum of the distances gives us a measure of the total distance from
each line to the observations. The smaller this value, the better the data are
approximated by the line.
The task of linear regression is to now find the parameters βi of the straight line
that minimize the total distance between the data points, i.e.,

X
n
ðyi  ðβ0 þ β1  xÞÞ2 ! min:
i¼1

The standard method for this task is the method of least squares. A detailed
description of this method can be found in Fahrmeir (2013). The line with estimated
coefficients is called the regression line, and is line A in Fig. 5.6. The distances
from the observations to the values predicted by the regression line are called
residuals.

Coefficient of Determination
The Coefficient of determination, denoted by R2, is a measure that indicates how
well the data fits the estimated regression model. It describes the proportion of the
total variation that can be explained by the regression model. R2 is always a number
between 0 and 1, and the higher the R2, the better the model represents the data. If
R2 ¼ 1, then all points lie exactly in a straight line with no variance. In this case, the
correlation coefficient is 1 or 1. In fact, R2 equals the squared (Pearson’s) correla-
tion coefficient [see Fahrmeir (2013) for more information on the Coefficient of
determination].
356 5 Regression Models

Root Mean Squared Error


Another measure which indicates the accuracy of the model is the Root mean
squared error (RMSE) [see Sect. 5.1.2 and Hyndman and Koehler (2006)].

5.2.2 Building the Stream in SPSS Modeler

There are two different nodes that can be used to build a stream for a simple linear
regression (SLR) in the SPSS Modeler. They are the Linear node and the Regres-
sion node, each of which has advantages and disadvantages. For the SLR, the
choice of node does not essentially matter. Here we will present how to build a
stream for SLR with the Linear node. The Regression node is described later in
Exercise 4, in Sect. 5.3.7, where we invite the reader to get to know this node in a
self-study, while building a stream for a multiple linear regression.
We will set up a stream for a simple linear regression with the Linear node based
on the dataset “test_scores.sav” (see Sect. 10.1.31). The question we are trying to
answer is whether a good pre-exam result is an indicator for a good final-exam score
(see Example 2).

Description of the model


Stream name Simple linear regression
Based on dataset test_scores.sav (see Sect. 10.1.31)
Stream structure

Important additional remarks


We need to define the types of the variables as described in Sect. 3.1.1. This can be done in the
data file node. We recommend using an extra separate Type node, however
The target variable should be continuous. If it is categorical, linear regression may not be
adequate to model the data. A logistic regression would be more appropriate in this case (see
Sect. 8.3)
Related exercises: All exercises in Sect. 5.2.6
5.2 Simple Linear Regression 357

Fig. 5.8 Definition of the variable roles in the Type node

1. To build the stream, we first open a new empty stream, add a Source node, and
import the data file. Here, we will use the Statistics File node, as the imported file
is in SAV format (see Sect. 2.1 how to import data files).
2. Now, we add a Type node and connect the data file node to it. If necessary, the
types of the relevant variables have to be modified. Furthermore, we can define
the role of the variables, e.g., the target and input variable (see Fig. 5.8). For an
SLR, there will be one target variable and one input variable. The roles can be
declared later in the model node if preferred.

" A linear regression model is only suitable for data that have a linear
relationship. Thus, before proceeding to the actual regression analy-
sis, we recommend verifying the linear assumption with the following
additional statistical calculations and plots:

1. Calculate the correlation coefficient (see Sect. 4.4).


2. Create a Scatterplot of the raw data, to check the linearity of the
variables (see Sect. 4.2).

3. Next, we add the Linear node to the stream and connect it with the Type node.
Figure 5.9 shows how the stream should look.
358 5 Regression Models

Fig. 5.9 Regression stream

Fig. 5.10 Defining the input and target variables for the regression model

4. Next, we have to set the parameters of the regression model and so we open the
Linear node with a double-click.
5. In the Fields tab, we declare the target and input variable; in this case, there are
exactly one of each in the SLR. This can be done by simply dragging and
dropping. If the target and input variables were already defined, as described
in Step 2, the SPSS Modeler will identify these variables automatically (see
Fig. 5.10).
5.2 Simple Linear Regression 359

Fig. 5.11 Defining the confidence level in the Linear node

6. In the Build Options tab, we can choose between “Building a new model” and
“Continue training existing model”. The former option creates a new model
based on the imported data, whereas the latter one updates an existing model. For
further information, see IBM (2015b). Here, we will use the first option, since we
are starting from scratch and want to build a new model.
In the “Basics” section, we can choose whether or not the data should be
automatically preprocessed and adjusted. This procedure is selected by default,
and we recommend leaving it enabled to improve the prediction power. We can
also define the confidence level for the estimated model parameters. The stan-
dard value is a 95 % confidence level (see Fig. 5.11).
7. All other tabs and options in this node are irrelevant for the SLR. They are
explained in the multiple regression, Sect. 5.3.3.
8. We run the stream to build the model. The estimated model is displayed in a new
node that now appears, called the model nugget.
At the bottom right of Fig. 5.12, the model nugget containing the estimated
model parameter is displayed. The two nodes at the bottom left are the
recommended pre-analyses. They calculate the correlation coefficient and
draw a scatterplot of the data.
360 5 Regression Models

Fig. 5.12 Final stream of a simple linear regression

Fig. 5.13 Output of the model

9. To get the estimated values of the regression model, we can add an output node,
such as the Table node (see Fig. 5.13).

5.2.3 Identification and Interpretation of the Model Parameters

The model, built in the previous section is represented in the SPSS Modeler by the
model nugget (see Fig. 5.12).
The details of the model can be inspected by double-clicking on the model
nugget. Here, we will find the estimated coefficients, goodness of fit statistics, and
other useful information shown in the Model tab. Let’s take a closer look at the most
important ones. Firstly, we will show how to find the estimated coefficients of the
model and identify an equation for predicting the unknown values of the target
variable using these parameters.
In the graphic “Coefficients” (see Fig. 5.14), the SPSS Modeler provides the
information on the estimated coefficients for the regression model. In this simple
5.2 Simple Linear Regression 361

Fig. 5.14 Visualization of the estimated coefficients of the linear regression model

Fig. 5.15 Table of the coefficients of the model

regression model, there are only two coefficients, the intercept β0 and the coeffi-
cient β1 for the input variable, in this case, the pretest results.
If the mouse curser is moved onto the lines of a parameter, a new window pops
up showing the estimated coefficients. The color of the line indicates the sign of the
coefficient. In our example, both coefficients are positive, especially the pretest
results coefficient. This means that a high pretest score will predict a high final
exam score, and a low pretest score will predict a low final exam score.
We can get a more detailed overview of the coefficients (see Fig. 5.15), by
changing the style at the bottom of the dialog window shown in Fig. 5.14 to
“Table”. In the example of the pretest and final exam scores, the estimated regres-
sion equation is

^y ¼ 13:212 þ 0:981  x:

With this equation, we are now able to predict the outcome of the final exam if we
know the student’s pretest result.
362 5 Regression Models

5.2.4 Assessment of the Goodness of Fit

Once we have found the equation of the model, we can predict the final score of an
exam participant based on his/her pretest result, but we should first ensure the
quality of the model, its goodness of fit. We will now discuss several parameters that
quantify the model.

Summary of the Model


At first, a short summary of the model is displayed with the accuracy of the fitted
model (see Fig. 5.16). The accuracy is R2 and this describes how well the model
explains the variation in the data (see Sect. 5.2.1).

Predicted by Observed
In the scatterplot “Predicted by Observed” (see Fig. 5.17), the values of the target
variable are plotted against the values predicted by the model. The graphic does not
show every single point, but instead shows colored circles that represent a set of

Worse Better

90.4%

0% 25% 50% 75% 100%


Accuracy

Fig. 5.16 Visualization of the coefficient of determination in the model nugget

100.000
Predicted Value

80.000

60.000

40.000

40.000 60.000 80.000 100.000

Post-test

Fig. 5.17 Observations versus the predicted values


5.2 Simple Linear Regression 363

Fig. 5.18 Frequency distribution of the residuals

points. The number of data points located in the area of the circle is indicated by the
intensity of the color.
If the model fits the data well, the circles should be arranged in a straight line
along the diagonal. Any other arrangement would indicate that the data do not have
a linear relationship.

Residuals
One of the requirements for linear regression is a Gaussian distribution of the error.
This is verified in the next graphic, which shows the distribution of normalized
residuals in a histogram and the density of a Gaussian distribution (see Fig. 5.18). If
the histogram is approximately the shape of the Gaussian density, the residuals can
be said to follow Gaussian law. This is another indicator of a good fit with the
model; it reflects the assumed distribution of the error. If the residuals are not
Gaussian distributed, then the model is inappropriate for the data, or the data does
not follow a linear structure.
Using the drop down list in Fig. 5.18, that is marked with an arrow, it is possible
to switch to another graphic, the so-called P–P Plot (see Fig. 5.19). This is another
plot used to verify the Gaussian distribution assumption. If the dots are located
around the diagonal, the residuals are distributed normally, if they are not then the
residuals follow a different law. For more detailed information on the P–P Plot, see
Thode (2002).
364 5 Regression Models

Fig. 5.19 P–P Plot of the residuals

Significance of the Coefficients


Let us recall the graphic “Coefficients” in Fig. 5.14. Besides the estimated
coefficients, the SPSS Modeler provides additional information on the significance
of coefficients for the regression model. In this simple regression model, there are
only two coefficients, the intercept and the coefficient for the input variable. The
width of the lines corresponds to the significance levels; that is, the thicker the line,
the more significant the parameters. In this case, both variables are highly signifi-
cant, which means, that they are relevant for the data description and cannot be
removed from the model.
If we move the cursor onto the lines of a parameter, a pop-up window shows the
estimated coefficients and their significance level. Remember, the closer the signif-
icance level is to zero, the more significant the parameter is to the fitted model.
We can expand the table view of the coefficients (see Sect. 5.2.3) by clicking on
the coefficient field (see Fig. 5.20). Then, besides the significance level, additional
values are displayed, such as t-statistics and confidence intervals, used to test the
significance of the estimated coefficients.

Interpretation of the Model (Relationship Between Pretest and Final Exam


Scores)
In this example, the fitted model has an accuracy of R2 ¼ 90:4 % (see Fig. 5.16),
which indicates that the model describes the data very well. This view is
encouraged by the diagonal indicated in the scatterplot. Figure 5.17 shows the
final exam scores against the predicted scores. Furthermore, the residuals have a
Gaussian distribution (see Figs. 5.18 and 5.19), which confirms that the
assumptions of a regression model are fulfilled and it is another indicator of a
well-fitting model.
5.2 Simple Linear Regression 365

Fig. 5.20 Expanded table view of the coefficients with the significance level and additional
parameter

Fig. 5.21 Visualization of the test score data with the regression line

The estimated coefficient for the “pretest” variable in Fig. 5.20 is 0.981, which is
the slope of the regression line. This states that improvement in the final score does
not depend on pretest results. Moreover, each student improves their score by
roughly the same constant: the intercept 13.212. Figure 5.21 shows the data with
the estimated regression line in the SPSS Modeler.

5.2.5 Predicting Unknown Values

We build a model that fits the test score data well and answers the question of the
motivating Example 2. Now, the determined regression equation can be used to
predict final exam scores from the pretest results of other students, e.g., in the next
semester, the final results can be estimated after the students have written the
pretest.
366 5 Regression Models

Prediction of values with a regression model


Stream name Simple linear regression
Based on dataset test_scores.sav (see Sect. 10.1.31)
Stream structure

Important additional remarks


Make sure that the variable names in the new dataset are the same as in the model! Otherwise,
the stream generates an error.

Fig. 5.22 Input values for


the prediction stream

1. Copy the regression model nugget and paste it onto the stream canvas.
2. Then, select an appropriate Source node to import the data. Since in this case the
data are in a standard text file, we choose the Var. File node. Figure 5.22 shows
the table of the input values. The variable names have to be exactly the same as
those in the model, as otherwise, the stream does not work and running it will
generate an error. If the variable names in the new dataset do not correspond to
the names in the model, they can be modified later with an additional Filter node
(see Sect. 2.7.5).
3. Connect the data file node to the model nugget.
4. Now, add a Table node to the stream and connect the model nugget to that. The
stream should look like the one in Fig. 5.23.
5. Run the stream. The output for this example is displayed in Fig. 5.24. A new
column containing the predicted values is added, called $L-posttest. This is the
same value that can also be calculated with the regression equation.
5.2 Simple Linear Regression 367

Fig. 5.23 Final stream for the prediction of variable values

Fig. 5.24 Output of the prediction. The column $L-posttest contains the predicted values

5.2.6 Exercises

Exercise 1: Determination of Earth’s Gravity Constant


The data in Fig. 5.1 can be found in the file “gravity_constant_data.csv” (see Sect.
10.1.16). Use this file to build a linear regression model and determine the gravity
constant of the earth by performing the following steps:

1. Load the data by using an appropriate data node.


2. Verify the linear relationship between the two variables (height, seconds) by
plotting the data and calculating the correlation coefficient (see Sect. 4.4).
3. Build a linear regression model with the Linear node.
4. Determine the estimated coefficients and identify the regression equation.
5. Now, determine the Coefficient of determination and the estimated gravity
constant.
6. Are the coefficients significant for the model and are the residuals distributed
normally?
7. What is the time an object needs, falling from 275 m, until it reaches the ground,
as predicted by your regression model?

Exercise 2: Benchmark Test of Two Computer Processors


Benchmark tests are used to identify the performance of computer processors. The
file “benchmark.xlsx” contains a list of AMD and Intel processors with their prices
in euro and the results of a benchmark test performed with the test program
Cinebench 10 (see Sect. 10.1.3). See Exercise 2 in Sect. 4.8 for an additional
exercise on this data and calculation of the correlation coefficient.
368 5 Regression Models

1. Import the data file into SPSS Modeler.


2. Visualize the data in a scatterplot. Use different colors or point shapes for AMD
and Intel processors. Inspect the data for outliers and remove them from the
dataset.
3. Perform a separate regression analysis for each company (AMD, Intel), for the
benchmark results versus the price. Plot the data with the regression line.
4. Interpret the results. Which company has the better cost-performance ratio?

Exercise 3: Estimation of the Cost of IT Projects


Before starting an IT project and implementing the source code, an IT company
must estimate the cost of the project and the time it will take, in order to give a
reasonable offer to the customer. A common method uses the COCOMO
algorithms (see Boehm 1981). There the cost in person months (PM) is estimated
from the kilo of delivered source instructions (KDSI), and from this result the time
of development (TDEV) is estimated. The file “IT-projects.txt” contains data from
the three mentioned variables for several projects (see Sect. 10.1.19). Build two
regression models to estimate PM from KDSI and TDEV from PM. See Exercise 3
in Sect. 4.8 for an examination of the variables and their relationships.

1. Import the data file into SPSS Modeler.


2. Examine the relationship between the three variables.
3. Build two linear regression models (TDEV versus KDSI and PM versus KDSI).
Don’t forget to first transform the variables into a linear relationship, based on
our results from Step 2.
4. Predict the cost and time for an IT project with these build models, when the
number of code lines is 350 and 420, respectively.

Exercise 4: Multiple Choice Questions


Question Possible answers
1. Linear regression is only suitable if the data ☐ A scatterplot of the two variables
follow a linear relationship. Which ☐ A boxplot of the variables
pre-analysis can be done to verify this ☐ Calculation of the correlation
assumption? coefficient
☐ Compare the variances
2. The purpose of a simple linear regression is ☐ Estimate the parameters of a
to . . . regression function
☐ Estimate the confidence intervals for
the parameters of a regression function
☐ Test if the estimated values are
statistically significant
☐ Build a model that reproduces the
values from the original dataset
☐ Build a model to predict values of a
target variable for new input data
(continued)
5.2 Simple Linear Regression 369

Question Possible answers


Which of these statements are correct. Please answer Yes No
“Yes” or “No”
3. The significance level of the coefficients can ☐ ☐
be ignored for the interpretation of the
regression results.
4. The residuals represent the “vertical” ☐ ☐
distance between the observed data point
and the fitted curve.
5. The method of least squares is used to fit a ☐ ☐
regression line.
6. Each regression model has its limitations, ☐ ☐
i.e., it is only applicable within a restricted
range of the input variables.
7. The coefficient of determination is the ☐ ☐
proportion of variability in a dataset that is
NOT accounted for by the statistical model.
8. The coefficient of determination (R2) is ☐ ☐
exactly 1 if, and only if, the regression line
passes through each and every data point.
9. The P–P Plot can be used to indicate that the ☐ ☐
residuals have a Gaussian distribution.
10. To predict values with the estimated ☐ ☐
regression model in the Modeler, the new
input variable must have the same name as
the predictor in the model.

5.2.7 Solutions

Exercise 1: Determination of earth’s gravity constant


Name of the solution streams gravity_regression
Theory discussed in section Section 5.1.1
Section 5.2

The final stream for this exercise is shown in Fig. 5.25.

1. The data are given in a csv table, so we use the Var. File node to import the data.
2. To calculate the correlation coefficient, we use the Statistics node, and for the
scatterplot, the easiest way is to use the Plot node. The correlation coefficient is
0.995 (see Fig. 5.26).
The data from the gravity example were already plotted at the beginning of
this chapter (see Fig. 5.2). Both the correlation coefficient and the scatterplot of
the data indicate a linear relationship. Hence, a linear regression is adequate for
this problem.
370 5 Regression Models

Fig. 5.25 Stream from the gravity example

Fig. 5.26 Correlation coefficient from the gravity example


5.2 Simple Linear Regression 371

Fig. 5.27 Variable selection of the linear regression of the gravity example

Fig. 5.28 Coefficients of the linear regression model for the gravity data

3. We use the Linear node to build a regression model with the target variable
“seconds squared” and predictor “height”. The variable selection is shown in
Fig. 5.27.
4. After running the stream, the estimated coefficients can be looked up in the
Coefficients view of the model nugget (see Fig. 5.28).
Here the estimated coefficients are β0 ¼  0:680 and β1 ¼ 0:103. Hence, the
regression equation is

^y ¼  0:680 þ 0:103  x:

5. The Coefficient of determination is R2 ¼ 0:987 (see Fig. 5.29).


372 5 Regression Models

Fig. 5.29 The coefficients of determination of the linear regression model for the gravity data

Fig. 5.30 P–P Plot of the residuals of the linear regression model for the gravity data

You may recall from the motivating Example 1 at the beginning of this chapter
that the gravity constant is given by the equation h ¼ g  S, where h is the height
from which an object is falling, and S is the squared time in seconds that it needs
to hit the ground. Thus, we get the gravity constant from our regression equation
by ignoring the intercept β0 and inverting the slope β1, hence,

1 1
g¼ ¼ ¼ 9:709;
β0 0:103

which is close to the real gravity constant (see motivating Example 1).
6. The coefficient β1 (the slope) is significant in this model (see Fig. 5.28). The
significance level of the intercept is very high, however, at 0.499, but it is not
very important for the regression model and so can be ignored. The model still
represents the data adequately.
Since we have hardly any data points, the distribution of the residuals cannot
be very accurately calculated. The P–P Plot of the residuals indicates a Gaussian
distribution, however (see Fig. 5.30).
7. To predict the time an object needs, falling from 275 m, until it reaches the
ground, we build a prediction stream by copying the model nugget and adding a
User Input node, where we define the new predictor value as 275 (see Fig. 5.31).
We connect these two nodes and add a Table node for the output. The stream
should look like Fig. 5.32.
The predicted value of the time an object needs to reach the ground from a
height of 275 m is 27.592 square seconds, hence, 5.253 s (see Fig. 5.33).
5.2 Simple Linear Regression 373

Fig. 5.31 Input of the new height for the prediction of its falling time

Fig. 5.32 Prediction stream for the gravity example

Fig. 5.33 Predicted squared seconds of an object falling from a height of 275 m
374 5 Regression Models

Exercise 2: Benchmark Test of Two Computer Processors


Name of the solution streams Processor_benchmark
Theory discussed in section Section 5.2.1
Section 5.2

Fig. 5.34 Stream of the benchmark exercise

The complete stream for this exercise looks like Fig. 5.34.

1. First, we import data from the xlsx File (see Exercise 2 in Sect. 2.5 for details).
To visualize the data, we use the Graphboard node and use the scatterplot
graphic. To distinguish the two companies by color and shape, we select the
variable “firm” for both (See Figs. 5.35 and 5.36 for the parameters and output of
the scatterplot).
2. The point to the top right of Fig. 5.36 is located far away from the other data.
Thus, we omit this outlier from further analysis. This is done with the Select
node, and the formula given is in Fig. 5.37. The scatterplot now looks like
Fig. 5.38, with the points located around an imaginary straight line.
After these preprocessing steps, the stream should look like Fig. 5.39.
5.2 Simple Linear Regression 375

Fig. 5.35 Selection of the scatterplot options

3. To perform a regression for each company individually, we use the Select node
to split the dataset into two disjointed sets (See Fig. 5.40 for selection of the
AMD processor data). We proceed in the same way with the Intel data.
For each processor dataset, we add a Linear node to perform a linear regres-
sion. Select “CB” for the target and “EUR” for the predictor variable, as shown
in Fig. 5.41.
Now, run the stream. Afterwards, it should look like Fig. 5.42.
To visualize the data and the regression line, we add two Plot nodes and
connect them to the two model nuggets. As an overlay, we insert the regression
line formula in the function field (See Fig. 5.43 for a definition of the plot
parameter and Fig. 5.44 for the actual plot of the AMD data). The values of the
coefficients can be found in the coefficients view in the model nugget.
376 5 Regression Models

Fig. 5.36 Scatterplot of the benchmark data. The companies are displayed with different colors
and shapes

Fig. 5.37 Removal of the outlier from the top right corner
5.2 Simple Linear Regression 377

Fig. 5.38 Scatterplot of the data without the outlier

Fig. 5.39 Stream after data preprocessing

4. The estimated slopes of the two regression lines are 30.559 for the AMD
processors and 28.707 for the Intel processors. Both coefficients are significant.
Thus, we conclude that the AMD processors have a better cost-performance
ratio, based on this particular benchmark test, since performance levels grow
faster as the cost of a processor increases. To visualize this result, we need to
378 5 Regression Models

Fig. 5.40 Selection of the AMD processor data

Fig. 5.41 Variable selection for the linear regression

combine the estimated data from both regression models. This can be done with
the Append node. We add this node to the stream and connect it with both model
nuggets. Afterwards, we add another Plot node and connect it with the Append
node. We select “EUR” as the X and “L-CB” as the Y variable. Furthermore, we
select “firm” as the color overlay, to plot the regression lines in a different color
(see Fig. 5.45). In the Options tab, we can further choose between a line or dot
plot.
5.2 Simple Linear Regression 379

Fig. 5.42 Stream of the benchmark data after regression

Fig. 5.43 Parameter selection of the scatterplot with regression line


380 5 Regression Models

Fig. 5.44 Visualization of the AMD data with the regression line

Fig. 5.45 Parameter selection of the comparison plot of the regression lines
5.2 Simple Linear Regression 381

Fig. 5.46 Visualization of both regression lines

The plot should look like Fig. 5.46, which shows that the cost-performance
ratio of the AMD processors is better than the ratio for Intel.

Exercise 3: Estimation of the Cost of IT Projects


Name of the solution streams IT projects
Theory discussed in section Section 5.2.1
Section 5.2

The final stream for this exercise looks like Fig. 5.47.

1. First, we import the data from “IT-projects.txt” (see Sect. 10.1.19), and we
define the types of the variable.
2. We have already examined the relationships between the three variables in
Exercise 3 in Sect. 4.8, using diverse scatterplots and correlation coefficients.
Refer to that for the first part of the stream. The Pearson’s correlation coefficient
is in all cases quite high, and thus, the correlation is strong (see Fig. 5.48). This
indicates linearity in all cases.
382 5 Regression Models

Fig. 5.47 Stream of the IT-project exercise

Fig. 5.48 Pearson’s correlation of the three variables KDSI, TDEV, and PM
5.2 Simple Linear Regression 383

Fig. 5.49 Scatterplot of the three variables KDSI, TDEV, and PM

It should be noted however, that the scatterplot of the variables (see Fig. 5.49)
shows a linear relationship between KDSI and PM, but a more polynomic
relationship between KDSI and TDEV and/or PM and TDEV. Thus, we assume
that the variables PM and TDEV follow a polynomial relationship, i.e.,
TDEV ¼ b*PMm ;

with b and m being positive real numbers. This is also the assumption of the
COCOMO model. To build a suitable linear regression model, we transform this
equation into a linear one through logarithmic calculus, i.e.,
logðTDEVÞ ¼ logðbÞ þ m*logðPMÞ:

Now, we can estimate log(b) and hence b and m with a linear regression.
3. Before building the actual model, we need to perform logarithmic calculus on
the variables PM and TDEV. This is done with the Derive node. We add this
node to the stream after the Type node and select the Multiple mode (see
Fig. 5.50). Also select the two variables, TDEV and PM, which we want to
transform, choose a suitable Field name extension and insert the formula log
384 5 Regression Models

Fig. 5.50 Setup of the Derive node to calculate the logarithm of the variables PM and TDEV

(@FIELD) into the corresponding field. The Derive node now takes each of the
two variables, calculates its logarithm, and adds a new variable to the data with
the extension we selected, in this case, “_trans” (see Fig. 5.51).
These new variables have a linear dependency, as evidenced by a scatterplot
(see Fig. 5.52).
We also have the option of omitting the one outlier that is located above the
line, to improve the regression. We can remove the outlier with the Select node
and the formula shown in Fig. 5.53. This increases the correlation coefficient to
almost one (see Fig. 5.54), which confirms linearity.
5.2 Simple Linear Regression 385

Fig. 5.51 Data with the new logarithmic variables PM_trans and TDEV_trans

Fig. 5.52 Scatterplot of PM_trans and TDEV_trans


386 5 Regression Models

Fig. 5.53 Removing the outlier in the data

Fig. 5.54 Correlation coefficient of the logarithmic data

To finish these data preparation steps, we add another Type node to the stream
and define the correct data types (continuous) for the variables PM_trans and
TDEV_trans.
Now, we are ready to perform the two linear regressions. We add two Linear
nodes to the stream and connect each of them with the last Type node. For each
regression, we select the appropriate variables, i.e., KDSI and PM for the first
5.2 Simple Linear Regression 387

Fig. 5.55 Selection of the variables for the regression KDSI versus PM

one and PM_trans and TDEV_trans for the second (see Figs. 5.55 and 5.56).
Then run the stream and the two model nuggets will appear.
All estimated coefficients are significant and have the values displayed in
Table 5.1.

" Warning: The coefficients for the second regression model describe
the linear relationship of the logarithmic variables. To predict the
original variable TDEV from PM, the predicted values have to be
retransformed via the exponential function.

4. Figure 5.57 shows the complete stream for predicting the TDEV and PM of
KDSI 350 and KDSI 420.
First, we use the User Input node to insert the variable KDSI with the new
values of 350 and 420 and to add the common Type node. To predict PM, we
copy the model nugget for the regression model KDSI vs. PM and add it to the
stream. As in the stream for building the model, we need to add the new variable
PM_trans with the logarithmic values of the just estimated values of PM. Use the
Derive node for that purpose (see previous step). We copy the second regression
model nugget and add it to the stream. This will predict the logarithmic value of
388 5 Regression Models

Fig. 5.56 Selection of the variables for the regression PM versus TDEV

Table 5.1 Coefficients of Intercept Linear coefficient


the two regressions
KDSI vs. PM 14.231 3.317
log(KDSI) vs. log(TDEV) 0.974 0.370

Fig. 5.57 Stream to predict the TDEV and PM from KDSI

TDEV. To get the prediction of TDEV and not of log (TDEV), we have to add a
further Derive node, which will transform the estimated values back to the
original ones by calculating their exponentials (see Fig. 5.58 for the setup of
this Derive node).
5.2 Simple Linear Regression 389

Fig. 5.58 Retransformation of the logarithmic values to the original variable TDEV

Fig. 5.59 Prediction of PM and TDEV for KDSI ¼ 320 and 450

Finally, we add a Table node to output the predicted values. The predictions of
PM and TDEV if KDSI ¼ 320 and 450, respectively, are displayed in Fig. 5.59.

Exercise 4: Multiple Choice Questions


Theory discussed in section Section 5.2
390 5 Regression Models

1. Both, a scatterplot and correlation coefficients (see Sect. 4.2 and 4.4), can be
used to indicate a linear relationship. A boxplot and variances can only give
insight into the distribution of the variables. Furthermore, two variables can
have the same variance, but have no relationship to each other.
2. All answers except the fourth are correct. A regression model does not repro-
duce the original data. Instead, it is used to predict values for new input values,
based on the training data. This even means that the model outputs an approxi-
mation using the input of original data.
3. No. A high significance level indicates that the coefficient is irrelevant for the
model fit, and thus, it is probably more suitable to remove this variable to
improve its data fit. Thus, the significance levels of the coefficients are crucial
for the interpretation of the model.
4. Yes (see Sect. 5.2.1).
5. Yes. The method of least squares is the most common method used in this
context. There are other goodness of fit measures that can be considered
however, for example, the method of least absolute deviations.
6. No. The whole range of the input variable can be considered.
7. No. It is exactly the opposite. The coefficient of determination gives the
proportion of variability that can be explained by the model (see Sect. 5.2.1).
8. Yes (see Sect. 5.2.1).
9. Yes (see Sect. 5.2.4).
10. Yes.

5.3 Multiple Linear Regression

A generalization of the Simple Linear Regression Model is Multiple Linear Regres-


sion (MLR), which is used to determine the linear relationship between a target
variable and multiple input variables. As in the simple linear case, the SPSS
Modeler assumes normally distributed residuals. Furthermore, this model doesn’t
consider correlations between input parameters. If the data contains such
dependencies between input variables, these should be eliminated by variable
reduction (e.g., via principal component analysis, see Sect. 6.3), or a General Linear
(mixed) Model should be used (see Sect. 5.4).

5.3.1 Theory

Multiple Linear Regression (MLR) is actually quite similar to Simple Linear


Regression. It differs only in the number of predictors, i.e.,
 
h x1 ; . . . ; xp ¼ β0 þ β1  x1 þ . . . þ βp  xp :

This equation can also be described in the shorter vector form, i.e.,
5.3 Multiple Linear Regression 391

xÞ ¼ ~
t
hð~ β ~
x;

where ~ β and ~ x are p+1-dimensional column vectors, with entries β0, . . ., βp and
1, x1, . . ., xp. Furthermore, ~
β is the transpose of ~
t
β and  is the typical scalar product
for vectors.
Thus, instead of a regression line, here, we have to find a hyperplane in a high
dimensional space that fits the data in an optimal way. In other words, we have to
estimate the coefficients β0, . . ., βp such that h minimizes the squared distance to the
observations.
The estimation of these coefficients is done with the method of least squares, just
as in SLR. We recommend Fahrmeir (2013) for a precise description of how the
coefficients are estimated in cases of multiple regression.

Necessary Conditions
The conditions that are necessary for MLR are similar to the ones for SLR.

1. The random noise for each sample should have zero mean and the same variance
(homoscedasticity). This means that the noise does not contain any additional
information, but is totally random.
2. Errors should be pairwise uncorrelated. This guarantees no systematic effect
from the errors. Instead, the errors are purely random.
3. The number of samples should be large enough. Hence, the coefficients can be
calculated.
4. The last assumption is a Gaussian distribution of noise. In this case, measures for
the goodness of fit for the models can be calculated. This gives us an opportunity
to find out how well the linear model fits the data.

Adjusted R2
You may recall from SLR that R2, the coefficient of determination, gives a measure
of how well the model fits the data. R2 however has the drawback that it grows as
the number of input variables increases, even if these variables do not have any
effect on the output variable. Due to this fact, MLR often uses an adjusted R2
instead, which is a slight modification of R2 and takes the number of input variables
into account. In the case of one input variable, the adjusted R2 is exactly the
coefficient of determination, as described in the SLR section (see James
et al. (2013) and Fahrmeir (2013) for information on adjusted R2).

Theory for Variable Selection


In many applications, a huge number of potential impact variables are available, but
not all these variables have a significant influence on the observation, and so some
should be excluded from the regression model. The selection of the input variables
is important, and the wrong choice can lead to a nonoptimal and incorrect model
estimation. So, before estimating the coefficients in the MLR model, the selection
of the best set of input variables must be considered.
392 5 Regression Models

Table 5.2 Variable selection methods for linear regression in the Linear and Regression node
Linear Regression
Method Description node node
Include all This is the naive approach, where all input variables X X
predictors/ are included.
Enter
Forwards This is an iterative method. Starting with no X
variables, in each step a variable is added, the one
that improves the model the most, until
improvement is not possible anymore. The
improvement is quantified by a selection criteria.
(Forward) Like the Forwards method, but with the adding and X X
Stepwise removing of variables in each step.
Best subset The subset of variables that gives the best criteria X
value is selected as the input variable for the
regression model.
Backwards This is the opposite of the Forwards method. X
Starting with the complete model, the input
variables are removed stepwise, according to the
selection criteria, until the model can no longer be
improved.

There are a number of algorithms and methods for selecting the variables for
inclusion in the final regression model. The SPSS Modeler provides the selection
methods that are listed in Table 5.2. There are many other techniques however,
which can be looked up in Fahrmeir (2013).
The main step in all of these selection methods is the validation of models with
different subsets of the input variables. There are a variety of validation criteria.
The following are implemented in the SPSS Modeler: Information criteria (AICC),
F-Statistics, Adjusted R2, and Averaged squared errors. From these validation
criteria, the AICC is the most common and thus typically consulted. For a detailed
description of these methods and further criteria, see Fahrmeir (2013).

5.3.2 Building the Model in SPSS Modeler

Just as in SLR (see Sect. 5.1.2), there are two ways to build a Multiple Linear
Regression MLR stream, one of which is using the Linear node and the Regression
node. Both these nodes assume a Gaussian distributed target variable and no
interactions between the input variables. If one of these conditions is deemed
invalid, a GenLin or GLMM node may be more appropriate since they can be
used for more general models. They are capable of considering different
distributions of the target variable and correlations between the input parameters.
In this section, we will show how to perform an MLR with the Linear node, which is
much simpler and more intuitive than the GLMM node. For a description of the
GLMM and GenLin nodes, see Sect. 5.4.
5.3 Multiple Linear Regression 393

We set up a stream for an MLR with the Linear node using the Boston housing
dataset (see Sect. 10.1.17). This dataset describes the house prices (MEDV) in
certain neighborhoods of Boston. The Regression node is described in Exercise 4 of
Sect. 5.3.7, and we recommend doing the exercise in order to get to know that node
too, and to identify differences with the Linear node, and the advantages and
disadvantages of its use.

Description of the model


Stream name Multiple linear regression
Based on dataset Boston housing dataset, housing.data.txt (see Sect. 10.1.17)
Stream structure

Important additional remarks


We need to define the types of the variables. This can be done with the Type node
The target variable should be continuous. If it is categorical, the MLR model may not be
adequate to represent the data. A logistic regression is more appropriate in this case (see Sect. 8.
3).
If a predictor is categorical, a coefficient for each value of this variable is estimated.
Related exercises: 1, 2, 3, 6

1. We open a new empty stream, add a Source node, and import the data file. Here,
we use the Var. File node, since the imported file is in .txt format.
2. To perform an MLR, we need to assign a target variable and input variables. This
can be done in the Type node (see Fig. 5.60), or later in the Linear node, where
the model parameters are defined.
3. We add the Linear node to the stream and connect it to the Type node.
Figure 5.61 shows the stream.
4. Now, we open the Linear node with a double-click, to set the parameters of the
regression model.
394 5 Regression Models

Fig. 5.60 Detailed view of the Type node. The variable MEDV is the target variable and
describes the house prices in a neighborhood of Boston

Fig. 5.61 Stream of an MLR before running it

5. In the Fields tab, the target and input variable, which should be considered when
building the model, are declared. This can be done by simply dragging and
dropping. If the target and input variables are defined as described in step 2, the
SPSS Modeler automatically identifies the correct roles of the variables
(Fig. 5.62).
6. As in SLR, we can choose between “Building a new model” or “Continue
training existing model” in the Build Options tab. Here, we chose to build a
new model.
5.3 Multiple Linear Regression 395

Fig. 5.62 Selection of the target variable and input variables that should be considered when
building the model. The meaning of the variables can be looked up in Sect. 10.1.17

Moreover, we can also define the model selection method and the validation
criteria (see Fig. 5.63). Possible methods are “Include all predictors”, “Forward
stepwise”, and “Best subset”. The entry/removal criteria (validation criteria) for
the variables offered by the SPSS Modeler are “Information Criterion (AICC)”,
“F statistics”, “Adjusted R2”, and “Overfit Prevention Criterion (ASE)”. For
further information, see Sect. 5.3.1 and Table 5.2. Here, we decide to use the
selection method “Best subset” and the “Information criteria (AICC)”.
7. We run the stream to build the model. The model nugget appears and is included
in the stream (see Fig. 5.64).
396 5 Regression Models

Fig. 5.63 Options within the model selection method and validation criteria

Fig. 5.64 Final stream of the MLR


5.3 Multiple Linear Regression 397

5.3.3 Final MLR Model and Its Goodness of Fit

As in SLR, we can get a more detailed view of the estimated coefficients, accuracy
values, and other useful information about the model by double-clicking on the
model nugget. In the following explanations of the model nugget, we only focus on
relevant and interesting results and information that differ from SLR, or are not
explained in Sect. 5.2.4, because of unimportance there. The parameters and
graphics, which are exactly the same as in the simple linear case, are, for example,
“Summary of the model” or “Predicted by Observed”. We refer to Sect. 5.2.4 for a
description of these graphics and views. Please note that the accuracy in the model
summary is the adjusted R2.
We start by showing where the estimated coefficients can be found, and how to
determine the regression equation.

Coefficients and the Model Equation


As in SLR, the estimated coefficients can be looked up in the Coefficients view (see
Fig. 5.65).

Fig. 5.65 Visualization of the coefficients of an MLR in the model nugget


398 5 Regression Models

Fig. 5.66 List of the


estimated coefficients in the
MLR model

As before, the color of the line indicates the sign of the variables’ coefficient.
Here, a darker color means that the variable has a positive influence, and a lighter
line indicates a negative effect on the predicted variable. More precisely, an
increase in the crime rate (CRIM) will decrease the price of the house, whereas a
better accessibility to rapid highways (RAD) will increase the selling price of the
house.
This graphic only displays the variables included in the model by the selection
method. Hence, we can get a first impression and list of the final chosen variables of
the MLR model.
We can get a better overview of the coefficients by changing the style at the
bottom to “Table” (see Sect. 5.2.4). The table that appears shows a list of the
coefficients and their estimates (see Fig. 5.66).
We would like to point out the incident when an ordinal/categorical variable is
included in the model. In this case, for each possible value of this variable, a
coefficient is estimated and displayed in the list (see the rectangle in Fig. 5.66 for
an example of the CHAS variable in the Boston housing dataset).
5.3 Multiple Linear Regression 399

The regression equation of an MLR can be easily determined by inserting the


estimated coefficients into the equation presented in Sect. 5.3.1, but we have to pay
attention to whether a variable is categorical. In this case, we get several regression
equations, one for each category of this variable; the number of input variables in
these equations is reduced by one, and the Intercept also changes by the estimated
coefficient for this value. More precisely, for the Boston housing dataset and the
MLR model calculated here, the Intercept of the regression equation is β0 ¼ 37:501
if the Charles River dummy variable (CHAS) is 1 and β0 ¼ 34:915 if CHAS is
0. We refer to Fahrmeir (2013) for more information on how to determine the
regression equation.
With this equation, we can now easily predict the price of a house, by inserting
the value of each variable for the new house into this equation.

Automatic Data Preparation


SPSS Modeler includes an automatic data preprocessing, involving, amongst other
functions, the detection and trimming of outliers. It also changes the type of a
variable, if it finds another type more appropriate for regression analysis. For
example, the Modeler changes the type from numeric to ordinal if the predictor
has an infinite number of values. This has been done for the CHAS variable in the
Boston dataset (see Fig. 5.67).
These data preparations are listed in the Automatic Data Preparation view of the
linear regression model nugget (see Fig. 5.67).

Model Building Summary


In the Model Building Summary view, we can find a table that gives an overview of
the variable selection process. The selection method and criterion can be chosen in
the Linear node (see Sect. 5.3.2). More precisely, each row in the table represents an
input variable and the different used models, which were involved in the selection
process, are ordered in columns. Thereby, for each variable considered in the
model, a tick is displayed in the corresponding row. At the top of each column,
the criteria value is shown, and the optimal model that is then used for the
regression is framed in black.
Figure 5.68 shows the model selection for the Boston housing dataset with the
best subset method and the AICC criterion. The optimal model found is the first one
in the table, and it contains 11 of the 13 input variables, which were included at the
beginning of the stream and the variable selection process.

Predictor Importance
In the Predictor Importance view, the relative importance of the predictors in the
model is visualized (see Fig. 5.69). This is the relative effect of the various variables
on the prediction of the observation value in the estimated model. The length of the
bars indicates the importance of the variables in the regression model. The impor-
tance values of the variables are positive and sum up to 1.
400 5 Regression Models

Fig. 5.67 List of trimming


and outlier detection and
elimination operations for
each variable

The values of the predictor importance are not the relative proportion of the
variable coefficients, although they give the effect to the target variable if the input
variables change. The input variables are scaled differently however, which means
that increasing one variable by one unit has a different effect than increasing
another variable by one unit. Thus, to compare the effects of the variables with
each other, and to give a significant (relative) importance to the predictors, one has
to put more effort into the calculation, which is done by the Modeler automatically.
For a detailed description of the algorithm used here, we refer to IBM (2015a).

Significance of the Model and Variable Effects


The Modeler automatically calculates several F-statistics (see Fahrmeir (2013);
James et al. (2013)) to quantify the significance of the whole regression model and
the effect of every variable. The results are visualized in the Effects slide (see
Fig. 5.70). Here, the variables are ordered by their importance, and the width of the
lines corresponds to the effect significance, which pops up when sliding over the
line with a mouse.
5.3 Multiple Linear Regression 401

Fig. 5.68 Summary of the “best subset” variable selection process for the Boston housing data

Fig. 5.69 Predictor importance of MLR on the Boston housing data


402 5 Regression Models

Fig. 5.70 Coefficient effects view of the regression model

As with the Coefficients view, we can switch to a Table view at the bottom of the
window (see arrow in Fig. 5.70), which then shows the significance levels and other
statistical values of the ANOVA, such as the F-statistic, of each variable effect and
of the whole model (Fig. 5.71). For information on ANOVA, see Kutner
et al. (2005) and Winer et al. (1991). The first small table can be expanded into a
detailed one, with all the above-mentioned statistics, by clicking on the “Corrected
Model” field.

Significance of the Coefficients


As in the SLR, the graphic “Coefficients” (Fig. 5.65) provides information of the
estimated coefficients and their significance to the regression model build. The
thickness of the various lines represents the significance of the variable in the model
(see Sect. 5.2.4 for a detailed description of this visualization and its options).
We can get a detailed overview of the coefficients and their significance
parameters by expanding the table view of the coefficients (see Sect. 5.2.4) with a
click on the Coefficient field (see Fig. 5.72). As before, additional parameters
relevant to the significance level are displayed, such as t-statistics and confidence
intervals.
5.3 Multiple Linear Regression 403

Fig. 5.71 Table view of the coefficient effects of the regression model

Fig. 5.72 Overview of the coefficients and their significance level with additional statistical
parameters
404 5 Regression Models

5.3.4 Prediction of Unknown Values

Building a stream to predict unknown values with the MLR model is done in
exactly the same way as described for SLR. Thus, we omit a detailed description
here and refer to Sect. 5.2.5. Do recall, however, that the variable names in the new
dataset must coincide with the names of the model variables.

5.3.5 Cross-Validation of the Model

Up until now, the estimated model was tested as to how well it fits the data it was
built on. The model would already know these data records, although usually a
model is used to predict values from unknown data. So a model should only really
be deemed suitable if it gives a good approximation for general and unknown data,
and hence, overfitting of the model onto the training data should be avoided. The
necessary test for this suitability is called cross-validation. It is the process of
assessing how well the regression generalizes with an independent dataset (see
Sect. 5.1.2).
We can utilize the Partition node to perform a cross-validation in the SPSS
Modeler. With this node, we split the data into two disjointed datasets, the training
data and the testing data. Afterwards, the model is built with the training data and
validated with the unknown test dataset (see Sect. 5.1.2).
The multiple regression model with cross-validation is presented for the Boston
Housing dataset. This stream can be found under the name “cross-validation MLR”.

Description of the model


Stream name cross-validation MLR
Based on dataset housing.data.txt (see Sect. 10.1.17)
Stream structure

Related exercise: 5, 6
5.3 Multiple Linear Regression 405

Fig. 5.73 Partitioning the data into training and testing data

1. We open the “008_Template-Stream_Boston_data.str” and save it under the


name “cross-validation MLR.str”. We add a Partition node and split the data
into two separate datasets, as described in Sect. 2.7.7. Afterwards, we select the
training data and the test data with Select nodes. The total data is commonly split
into 70 % for training and 30 % for the test data (see Fig. 5.73). This step was
already discussed in Exercise 9 in Sect. 2.7.11.
The separate datasets can be selected via the Select node as displayed in
Fig. 5.74 for the training data.

" The select nodes can be automatically created by the SPSS Modeler.
To do this, we use the “Generate” field in the Partition node (see
upper arrow in Fig. 5.73 and Exercise 9 in Sect. 2.7.11.

2. We build a linear regression model using the training data as described in


Sect. 5.3.2.
406 5 Regression Models

Fig. 5.74 Selection of the training set

" Warning, a regression model depends highly upon generated training


data. Therefore, the model and its predictions may differ upon multi-
ple runs. To get the same sampling of the training data and thus the
same results for every run, the sampling seed can be fixed in the
Partition node (see Fig. 5.73).

3. To validate the estimated model, we copy the model nugget, paste it into the
stream canvas, and connect it to the test data.
4. We add an Analysis node to each of the model nuggets and run the stream. The
Analysis nodes compare the predicted values with the original values and
calculate various statistics for their difference or errors (see Figs. 5.75 and 5.76).
To decide if the regression model is applicable to unknown data and capable
of predicting the values, compare the stats from the errors of both the training
and test data, particularly the standard deviation which is called RMSE and an
indicator for the accuracy of the model (see Sect. 5.1.2). The value of the
standard deviation of the test set (5.432) is a bit higher, which is usual, but not
far away from the one of the training set (4.413). So, the model can be used as a
good predictor for independent datasets.

5.3.6 Boosting and Bagging (for Regression Models)

Theory
Instead of a standard model, we can also create ensemble models to improve
stability and accuracy. Ensemble models, or methods, combine multiple models,
each predicting the target variable, but these predictions are aggregated afterwards
to one prediction. Typical aggregation functions are mean and median. In Fig. 5.77
this concept of ensemble model prediction is illustrated. With this procedure,
5.3 Multiple Linear Regression 407

Fig. 5.75 Analysis of the training data errors

Fig. 5.76 Analysis of the test data errors


408 5 Regression Models

Fig. 5.77 Structure of an ensemble model

ensemble models are more likely to reduce the bias or prevent overfitting of the
training data, thus obtaining better predictions. These models are not only useful for
regression, but also even more common and applied in classification problems, for
example in decision trees (see Sect. 8.8). See Tuffery (2011) and Zhou (2012) for
more information on ensemble methods.
The SPSS Modeler provides us with two ensemble methods: bagging (bootstrap
aggregation) and boosting.

Bagging (Bootstrap Aggregation)


This method generates multiple models on subsamples of the dataset of equal size,
which are sampled with replacement from the original dataset. Each subsample
created by the SPSS Module has the same size as the original dataset. Then, the
models build on each subsample form the ensemble model. This concept of bagging
is shown in Fig. 5.78. The ensemble model then predicts the target values by taking
the mean or median of the model predictions. See Tuffery (2011) for a more
detailed description on bagging and IBM (2015a) for the algorithm used by the
SPSS Modeler.

Boosting
This method generates a sequence of models to increase the accuracy of the
predictions. To build the successive model, the records are weighted based on the
previous model’s residuals. Records with large residuals get a higher weight than
ones with small residuals. So, the next component model focuses on the records
with large residuals and, thus, makes good predictions too. All component models
are built on the entire dataset. Models generated in this way form the ensemble
model. Since boosting is commonly used in classification, we refer to Sect. 8.8 for
further explanations, or see Tuffery (2011) for a more detailed description on
boosting and IBM (2015a) for the algorithm used by the SPSS Modeler.
5.3 Multiple Linear Regression 409

Fig. 5.78 The concept of bagging. On each new sampled dataset a model is build which is then
integrated in the ensemble model

Ensemble Methods in the Linear Node


These ensemble methods are implemented in the Linear node, and we here describe
how it is applied with the Modeler based on the Boston Housing data.

Description of the model


Stream name Bagging_MLR
Based on dataset housing.data.txt (see Sect. 10.1.17)
Stream structure

Related exercise: 6
410 5 Regression Models

Fig. 5.79 Selection option of bagging and boosting in the Linear node

1. We open the “cross-validation MLR.str” stream of the previous section about


the cross-validation process and save it under the name “bagging_MLR.str”.
Then, open the Linear node.
2. We can select these methods in the Linear node in the Build options tab, as
shown in Fig. 5.79. Here, we selected the bagging method. Note that building
ensemble models can take much longer than building a standard model, since
multiple component models have to be created.
3. The method used for the aggregation, as well as the number of component
models, can also be chosen in the Linear node, as displayed in Fig. 5.80.
Possible aggregation functions for bagging are mean and median. Here, we
chose the mean option.
4. Now, we run the stream and open the model nugget to inspect the model
quality.
5.3 Multiple Linear Regression 411

Fig. 5.80 Aggregation methods and number of component models

5. In the Model Summary view, we see the quality of the ensemble model and the
reference model measured on accuracy, i.e., the coefficient of determination.
The reference model is thereby the one which would be created when chosen
the standard model option (see Fig. 5.79). We observe that the ensemble
method, i.e., bagging, has increased the quality of the prediction (see Fig. 5.81).
6. The calculated predictor importance differs naturally from the importance of
the standard model, since these measures are now a combination of several
single models. This can even result in a different order of variable importance.
The predictor importance is displayed in the Predictor Importance view.
7. How often each input variable was picked in the selection process for the
different models is displayed in the Predictor Frequency view (see Fig. 5.82).
With the slider at the bottom, the number of displayed variables can be
changed.
8. The accuracy of each component model in the ensemble can be inspected in the
Component Model Accuracy view. These are visualized through the dots on the
left of Fig. 5.83. Furthermore, the accuracy of the ensemble model (aggregation
of each component model) and the reference model are plotted. We see that the
ensemble model fits the data better than the reference model and every compo-
nent model, which seems to be surprising at first. But this is a great example
that aggregating several models can increase the quality (accuracy) since the
412 5 Regression Models

Fig. 5.81 Quality of the ensemble model compared to the reference model

Fig. 5.82 Visualization of the frequencies how often an input variable was selected for a model

errors of the single components compensate each other, which lead to better
and more robust predictions.
9. A more detailed list of each component model, including the accuracy and
number of input variables, is shown in the Component Model Details view (see
Fig. 5.84).
10. At the top of the model nugget window, we can declare the model which should
be used for prediction, the ensemble, or reference model. In case of the
ensemble model, we are even able to choose the aggregation method. Here,
we select the mean aggregation method (see arrows in Fig. 5.84).
5.3 Multiple Linear Regression 413

Fig. 5.83 Graphic of the accuracy of the model in comparison with the reference model

Fig. 5.84 Detailed list of the several models in the ensemble

11. Figures 5.85 and 5.86 show the statistics of the ensemble model predictions
calculated by the Analysis node for the training and the testing set. Comparing
these with the errors (RMSE) of the standard model (Figs. 5.75 and 5.76), we
see that the ensemble model improves the fit of the model to the data and its
prediction of unknown data.
414 5 Regression Models

Fig. 5.85 Output of the


Analysis node for the
training data

Fig. 5.86 Output of the


Analysis node for the
testing data
5.3 Multiple Linear Regression 415

5.3.7 Exercises

Exercise 1: Prediction of the Prize Money Potential of LPGA Golfers


The dataset “LPGA2009.csv” (see Sect. 10.1.22) contains performance and success
statistics from 146 female golfers who competed in the LPGA tour, 2009. Build an
MLR model that predicts the prize money a player could earn based on her
performance stats. The prize money is given on a logarithmic scale [prize money
(log)], which will be the target variable in the model.

1. Import the data file by using an appropriate data node.


2. Divide the dataset into a training set and a test set. The training dataset should
comprise 70 % of the entire data.
3. Use the Linear node to build an MLR model with the training set generated in
step 2. Select the variables using the Forward stepwise selection method and the
AICC criteria.
4. Which variables are included in the model and what is the value of the model
validation criteria?
5. Determine the estimated coefficients and identify the regression equation.
6. Are the coefficients significant for the model and are the residuals normally
distributed? What is the value of the Coefficient of determination?
7. Use your model to predict the potential prize money of the golfers from the test
data. Compare your results with the actual prize money they won.

Exercise 2: Determination of the Effect of a New Teaching Method


You may recall the data from the pretest and final exam scores “test_scores.sav”
(see Sect. 10.1.31). The dataset contains more than just these two variables. In
particular, the variable “teaching_method” is included, indicating which of two
teaching methods was used. All these additional variables can also have an influ-
ence on the outcome of the final exam. Perform an MLR to identify the effect of
these variables.

1. Import the dataset with an appropriate data node.


2. Build an MLR model with the Linear node. Use the “Best subset” method for
variable selection.
3. Determine the selected variables in the model. What is their influence on or
importance to the final exam score?
4. Interpret the model. Has the teaching method an influence on the final test score?
416 5 Regression Models

Exercise 3: Multiple Choice questions


Question Possible answers
1. Identify the steps needed to build ☐ The selection of only useful variables based on
a valid regression model. a valid theory
☐ The selection of all variables with a correlation
greater than 0.5 or smaller than 0.5
☐ Producing and interpreting the scatterplot and
calculating the correlation matrix
☐ Estimating the parameters after formulation of
the model type
☐ Checking model assumptions
2. What are proper partition rates ☐ 50–50 %
for training and test sets, in order ☐ 70–30 %
to perform a cross-validation? ☐ 80–20 %
☐ 30%–70 %
3. The purpose of multiple linear ☐ Estimate only the parameters of simple linear
regression is to . . . models
☐ Determine a regression function for every
possible combination of two variables
☐ Estimate the parameters and confidence
intervals of a regression function
☐ Test if the estimated values are statistically
significant
☐ Build a model that reproduces the values from
the original dataset
☐ Build a model to predict values of a target
variable for new input data
4. Which of the following are ☐ Adjusted R2
variable selection criteria? ☐ Best subset method
☐ Overfit prevention Criterion (AES)
☐ F-statistics
☐ Information criterion (AICC)
Are the following statements correct? Yes No
Please mark “Yes” or “No”
5. If the model has n input variables, ☐ ☐
the regression line is an area of
dimension n  1 a.
6. Knowledge of the problem field ☐ ☐
is critically important in the
initial selection of variables a.
7. The objective of correlation ☐ ☐
analysis is to gain insight into the
strength of the relationship of two
or more variables a.
8. Ensemble models can be used to ☐ ☐
improve accuracy and robustness
of the regression model a.
(continued)
5.3 Multiple Linear Regression 417

Question Possible answers


9. If your model is lacking due to ☐ ☐
overfitting, a solution is to
include more input variables a.
10. The variable importance ☐ ☐
describes how target variables
are affected, in comparison to all
other input variables a.

Exercise 4: Linear Regression with the Regression Node


In this exercise, we build a regression model for the Boston housing data, housing.
data.txt (see Sect. 10.1.17), with the Regression node.

1. Import the data and specify the variable types with the Type node.
2. Add a Regression node to the stream and select MEDV as the target variable and
all other variables as the input.
3. Choose the Backwards method to find the significant input variables and then run
the stream. This can be done in the Model tab.
4. Inspect the model nugget and identify the estimated coefficients and the regres-
sion equation. Which variables are included in the final model, and which
variable has a coefficient of 3.802?
5. What is the value of R2 and the adjusted R2?

The following exercise is optional and includes adding a cross-validation to the


stream.

6. Include a Partition node in the stream and divide the dataset into 70 % training
data and 30 % test data.
7. Select the partition field in the Fields tab of the Regression node, setting it to use
only the training data in the model building procedure.
8. Add an Analysis node to the model nugget and run the stream again. Is the model
suitable for processing unknown data?

Exercise 5: Polynomial Regression with Cross-Validation


The dataset “mtcars.csv” (see Sect. 10.1.23) contains performance and design data
from 1974 on 32 automobiles, including their fuel consumption. The task is to
perform a polynomial regression analysis to compare and predict the miles per
gallon (mpg) from the horse power (hp) of a car. The polynomial relationship
thereby means that

mpg ¼ β0 þ β1  hp þ β2  hp2 þ . . . þ βp  hpp ;


418 5 Regression Models

where the degree p of the polynomial term is unknown and has to be determined by
cross-validation.

1. Import the data file and familiarize yourself with the data by performing
descriptive analysis. What relationship can you surmize between the variables
“hp” and “mpg”?
2. Divide the dataset into training, validation, and test datasets with the proportion
60–20–20, respectively.
3. Build a polynomial regression model of the training data for several degrees
p with the Regression node. These should include at least the degrees 1 and 2.
4. Compare the R2 and AIC values of the models. Which of our models fits the training
data best? Perform a cross-validation of the models by adding Analysis nodes to the
model nuggets and comparing their output to the validation set. Which of the models
is the best candidate to predict the “mpg”? Use the statistics of the test set to decide if
the model is appropriate for prediction of “mpg” based on the “hp”.
5. Reflect on the previous step. The cross-validation was performed on a dataset,
independent to the training data, in order to find the best model. Why is the test
step on the selected model still important, in order to guarantee a proper
prediction model?

Exercise 6: Boosted Regression


We once again use the dataset “mtcars.csv”, see exercise before. The goal of this
exercise is to build a boosting regression model to predict the miles per gallon
(mpg) from the displacement (disp) and weight (wt) of a car.

1. Import the dataset and divide it into training and testing sets with the proportion
70–30, respectively.
2. Build a boosting regression model with forward selection mechanism.
3. Compare the ensemble model with the standard model. Which one is more accurate?

5.3.8 Solutions

Exercise 1: Prediction of the Prize Money Potential of LPGA Golfers


Name of the solution streams LPGA tour MLR
Theory discussed in section Section 5.2.1
Section 5.3.1
Section 5.3

The final stream for this exercise is shown in Fig. 5.87. The stream can be found in
the solutions, under the name “LPGA tour MLR”.

1. We import the data with the Var. File node. Pay attention to the quotes on the
column names. We select the value “Pair and discard” for the Double quotes
variable in the source node. The data consist of 13 variables.
5.3 Multiple Linear Regression 419

Fig. 5.87 Stream of the MLR of the LPGA data

2. We divide the dataset into a training dataset (70 %) and a test dataset (30 %) with
the Partition node, as shown in Fig. 5.73. Afterwards, select the training data and
test data with Select nodes as described in Sect. 5.3.5.
Warning, the selection of included variables and the regression model itself
depends highly upon the generated training data. Therefore, the model and your
results may differ from the ones presented here. To get the same sampling of the
training data, and thus the same results as presented here, choose the same seed
in the Partition node as we did in this solution, which is 1234567.
3. To build an MLR model, we add a Linear node to the stream and connect it to the
Select node of the training set. We choose the options as required by the
exercise, i.e., the “prize money (log)” is the target variable and all other variables
are predictors, except for the “golfers id”, since this is irrelevant information for
an MLR. The variable selection method is Forward stepwise with AICC criteria
(see Figs. 5.88 and 5.89).
We run the stream to build the model. Afterward, the stream should look like
Fig. 5.90.
4. The variables that are included in the final regression model are listed in “Model
Building Summary”. Here, these are rounds completed, average percentile in
tournaments, percent of fairway hits, green in regular putts per hole, percent of
greens reached in regulation, average drive, average strokes per round. This
can be seen in detail in Fig. 5.91. The final selection for the model of the
predictor variables is displayed in the frame.
The AICC value of this model is

AICC ¼  157:252:

See the arrow in Fig. 5.91.


420 5 Regression Models

Fig. 5.88 Predictor and target variable definition of the MLR for the LPGS data

Fig. 5.89 Model selection method and criteria definition


5.3 Multiple Linear Regression 421

Fig. 5.90 Stream after building the regression model

Fig. 5.91 Summary of the variable selection for the final model
422 5 Regression Models

Fig. 5.92 Coefficients of the


regression model

5. The estimated coefficients can be viewed in the “Coefficients” field and are
listed in Fig. 5.92.
Note that the coefficient of the last variable (average strokes per round) is
displayed as 0. The true estimated coefficient is a bit smaller than 0; however,
this is indicated by the negative sign. Since the Modeler automatically rounds
every real number to the third digit, the actual estimate is not visible to the user,
and so we therefore interpret the last coefficient as 0.
Hence, the regression equation can be determined as

xÞ ¼ ~
t
hð~ β ~
x, with
~
β ¼ ð22:984, 0:03, 0:036,  0:037,  6:165, 0:073,  0:022, 0Þ;

and the variables are numbered as listed from the top down in Fig. 5.92.
6. All coefficients except for average strokes per round are significant to a 1 %
level, as can be seen in Fig. 5.92. The coefficient of the mentioned variable is
still significant to a 10 % level. The residuals can be assumed as Gaussian
distributed by appeal to the graphs in the Residual field of the model nugget
(see Figs. 5.93 and 5.94). Finally, the coefficient of determination is 0.904 (see
Fig. 5.95).
5.3 Multiple Linear Regression 423

Fig. 5.93 Histogram of the residuals

Fig. 5.94 P–P Plot of the residuals


424 5 Regression Models

Fig. 5.95 Coefficient of


determination

Fig. 5.96 Prediction and comparison part of the LPGA stream

7. To predict the prize money potential of the golfers from the test data, we copy
the model nugget, paste it into the stream, and connect it to the Select node of the
test data. We add an output node to it, such as a Table node, to display the
predicted values.
To compare the estimated prize money with the actual prize money, we use a
scatterplot and the Analysis node. This looks like Fig. 5.96, which shows the
second part, with the prediction and comparison of the results of the LPGA
stream.
In Fig. 5.97, we see the distribution statistics of the difference between the
actual prize money and the predicted prize money. As can be seen, the model
predicts the true values very well, since the difference varies only between 1
5.3 Multiple Linear Regression 425

Fig. 5.97 Distribution


statistics of the predicted and
actual prize money

and 1 around 0, the mean difference is almost 0, and the variation (RMSE) is
very small. This can also be seen through a scatterplot, which is not shown here,
but is easy for interested readers to create themselves.

Exercise 2: Determination of the Effect of a New Teaching Method


Name of the solution streams test scores MLR
Theory discussed in section Section 5.1.1
Section 5.2.1
Section 5.3.1
Section 5.3

Figure 5.98 shows the final stream for this exercise.

1. We use the Statistics File node to read the data from the “test_scores.sav” file
(see Sect. 10.1.31). Then, we add the usual Type node.
2. We connect the Type node to a Linear node, select appropriate predictors and the
target variable (posttest), and select the “Best subset” method for variable
selection (see Figs. 5.99 and 5.100). We choose the AICC for the variable
selection criteria. The reader can pick another one if he or she prefers.
Here, we excluded “school” and “classroom” from the set of predictors, as we
want to diminish a possible school or classroom related bias. We can include
these variables in the model. As an exercise, we recommend building this model
too and comparing it with the model presented here. What are the differences?
We run the stream!
3. The predictors included in the final model are school_setting, teaching_method,
lunch, n_students, pretest, school_type. This can be inspected in the Model
Building Summary view of the model nugget (see Fig. 5.101).
426 5 Regression Models

Fig. 5.98 Stream of the MLR of the test scores

Fig. 5.99 Variable selection for the MLR of the test scores
5.3 Multiple Linear Regression 427

Fig. 5.100 Definition of the variable selection method

The estimated coefficients of the input variables, which influence the posttest
results, are displayed in Fig. 5.102. There, we can also read the relative impor-
tance of the variables for the prediction of the target variable (posttest score). We
can see that only the variables “pretest” and “teaching method” have a notice-
able importance for the prediction, while the other three predictors are quite
dispensable, as their importance factor is almost 0 (see Fig. 5.102).
4. The teaching method has a significant influence on the posttest score, and its
coefficient is approximately 6. This means that there is a difference in the
results of the posttest scores of about six between the two methods of teaching,
and thus, this has to be considered when picking an approach and explaining the
material to the pupils.
428 5 Regression Models

Fig. 5.101 Model building summary view. Variables included in the final model

Fig. 5.102 Coefficients and the relative variable importance of the MLR for test score

Exercise 3: Multiple Choice Questions


Theory discussed in section Section 5.3
5.3 Multiple Linear Regression 429

1. The following steps are necessary for building a regression model:

– The selection of only useful variables based on a valid theory.


– Estimating the parameters after formulation of the model type.
– Checking the model assumptions.

2. 70–30 % and 80–20 %


3. The purpose of an MLR is

– Estimation of the parameters and confidence intervals of a regression


function.
– To test if estimated values are statistically significant.
– To build a model to predict values of a target variable for new input data.

4. Variable selection criteria are

– Adjusted R2
– Overfit prevention criterion (AES)
– F-statistics
– Information criteria (AICC)

5. Yes
6. Yes
7. Yes
8. Yes
9. No
10. Yes

Exercise 4: Linear Regression with the Regression Node


Name of the solution streams multiple_linear_regression_REGRESSION_NODE

Figure 5.103 shows the final stream for this exercise.

Fig. 5.103 Stream of the


MLR with the
Regression node
430 5 Regression Models

Fig. 5.104 Stream before selection of variables and methods

Fig. 5.105 Selection of target and input variables in the Regression node

1. After importing the data, we add the Type node and the Regression node, as
shown in Fig. 5.104.
2. We open the Regression node and in the Fields tab choose the target variable
MEDV, and the input variables as in Fig. 5.105. Thereby, all variables except for
MEDV are input variables.
5.3 Multiple Linear Regression 431

Fig. 5.106 Selection of Backwards method as the building method

3. In the Model tab, we choose the Backwards variable selection method from the
drop-down menu (see the arrow in Fig. 5.106). Now, run the stream.
4. The estimated coefficients and the regression equation are located in the Sum-
mary tab of the model nugget under the Analysis header. This is shown in
Fig. 5.107. In summary, there are 11 variables included in the final model, and
RM is the variable with the coefficient 3.802.
5. In the Advanced tab, several tables are displayed, which summarize the model
selection process for each step. Here, we can retrace each step of the variable
selection process, see which variables are included or removed, and the
associated assessment values and test statistics. In particular, in the first table,
the variables selection process can be reviewed with the removal and inclusion
of the variables in each step (see Fig. 5.108). Here, the AGE is removed in the
first step, and the INDUS variable in the second, as we are using backwards
selection.
The next table, labeled “ANOVA”, lists goodness of fit statistics for each
model in the variable selection step (see Fig. 5.109). We can see that the model
quality increases with every step, by considering the sum of squares, for
example.
In the “Model Summary” table, the values of R2 and adjusted R2 can be
viewed in the Advanced tab of the model nugget in the table “Model Summary”
(see Fig. 5.110). R2 is 0.741 and the adjusted R2 parameter has the value 0.735.
432 5 Regression Models

Fig. 5.107 Regression equation and the estimated coefficients

In the last table, the “Coefficients” table, the estimated coefficients with their
level of significance are listed for every model in the variable selection process
(see Fig. 5.111 for the first part of this table).
In the Expert tab in the Regression node, further statistical outputs can be
chosen, which are then displayed here in the Advanced tab (see Fig. 5.121 for
possible, additional output statistics).

Solution of the Optional Subtasks


This stream is also included in the “multiple_linear_regression_REGRES-
SION_NODE” file and looks like Fig. 5.112

6. We include the Partition node in the stream before the Type node, and divide the
dataset into training data (70 %) and a test dataset (30 %), as described in Sect. 2.7.7.
5.3 Multiple Linear Regression 433

Fig. 5.108 Overview of the variable selection process

Fig. 5.109 Goodness of fit measures for the models considered during the variables selection
process
434 5 Regression Models

Fig. 5.110 Table with goodness of fit values, including R2 and adjusted R2, for each of the model
selection steps

Fig. 5.111 First part of the coefficients table


5.3 Multiple Linear Regression 435

Fig. 5.112 The MLR stream with cross-validation and the Regression node

Fig. 5.113 Selection of the partition field in the Regression node

7. In the Fields tab, we select the partition field, generated in the previous step as
the Partition. This results in the model being built using only the training data
(see Fig. 5.113).
8. After adding the Analysis node to the model nugget, we run the stream. The
window in Fig. 5.114 pops up. There, we see probability statistics for the training
436 5 Regression Models

Fig. 5.114 Output of the Analysis node. Separate evaluation of the training data and testing data

data and test data separately. Hence, the two datasets are evaluated indepen-
dently of each other.
The mean error is near 0, and the standard deviation (RMSE) is, as usual, a bit
higher for the test data. The difference between the two standard deviations is
not optimal, but still okay. Thus, the model can be used to predict the MEDV of
unknown data (see Sect. 5.1.2 for cross-validation in this chapter).

Exercise 5: Polynomial Regression with Cross-Validation


Name of the solution streams polynomial_regression_mtcars
Theory discussed in section Section 5.3
Section 5.1.2

Figure 5.115 shows the final stream of this exercise.

1. We import the mtcars data with the Var. File node. To get the relationship
between the variables “mgp” and “hp”, we connect a Plot node to the Source
node and plot the two variables in a scatterplot. The graph is displayed in
Fig. 5.116. In the graph, a slight curve can be observed and so we suppose that
a nonlinear relationship is more reasonable. Hereafter, we will therefore build
two regressions models with exponents 1 and 2 and compare them with each
other.
To perform a polynomial regression of degree 2, we need to calculate the
square of the “hp” variable. For that purpose, we add a Derive node to the stream
5.3 Multiple Linear Regression 437

Fig. 5.115 Stream of the polynomial regression of the mtcars data via the Regression node

Fig. 5.116 Scatterplot of the variables “hp” and “mpg”


438 5 Regression Models

Fig. 5.117 Definition of the squared “hp” variable in the Derive node

and define a new variable “hp2”, which contains the squared values of “hp” (see
Fig. 5.117 for the setting of the Derive node).
2. To perform cross-validation, we partition the data into training (60 %), valida-
tion (20 %), and test (20 %) sets via the Partition node (see Sect. 2.7.7, for details
on the Partition node). Afterwards, we add a Type node to the stream and
perform another scatterplot of the variables “hp” and “mpg”, but this time we
color and shape the dots depending on their set partition (see Fig. 5.118). What
we can see once again is that the model barely depends on the explicit selection
of the training, validation, and test data. In order to get a robust model, a bagging
procedure should also be used.
3. Now, we add two Regression nodes to the stream and connect each of them with
the Type node. One node is used to build a standard linear regression while the
other fits a degree polynomial function to the data. Here, we only present the
settings for the second model. The settings of the first model are analog. For that
purpose, we define “mpg” as the target variable and “hp” and “hp2” as the input
variables in the Fields tab of the Regression node. The Partition field is once
again set as Partition (see Fig. 5.119).
In the Model node, we now enable “Use partitioned data” to include cross-
validation in the process and set the variable selection method to “Enter”. This
guarantees that both input variables will be considered in the model (see
Fig. 5.120).
5.3 Multiple Linear Regression 439

Fig. 5.118 Scatterplot of “mpg” and “hp,” colored and shaped depending on how they belong to
the dataset

Fig. 5.119 Definition of the variable included in the polynomial regression


440 5 Regression Models

Fig. 5.120 Using partitioning in the Regression node

To calculate Akaike information criteria (AIC), see Fahrmeir (2013), which


should be compared between the models in part 4 of this exercise; we open the
Expert tab and switch from the simple to the Expert options. Then, we click on
output and mark the Selection criteria. This ensures the calculation of further
variable selection criteria, including the AIC (see Fig. 5.121).
After setting up the options for the linear regression too, we then run the
stream and the model nuggets appear.
4. When comparing the R2 and AIC values of the models, which can be viewed in
the Advanced tab of the model nuggets, we see that the R2 of the degree 2 model
is higher than the R2 of the degree 1 model. Furthermore, the inverse holds true
for the AIC values of the models (see Figs. 5.122 and 5.123). Both measures of
the statistics indicate that the polynomial model, comprising a degree 2 term, fits
the data better.
To perform a cross-validation and evaluate the model on the validation set and
test set, we connect an Analysis node to each of the model nuggets, and then we
run the stream to get the error statistics. These are displayed in Figs. 5.124 and
5.125, for the model with only a linear term and the model with a quadric term.
We see that the RMSE of the degree model are lower for all, the training,
validation, and test sets, in the degree 2 model, in comparison with the linear
model. Hence, cross-validation suggests the inclusion of a quadratic term in the
regression equation. The large difference in the RMSE of the validation set
5.3 Multiple Linear Regression 441

Fig. 5.121 Enabling of the selection criteria, which include the AIC

Fig. 5.122 Goodness of fit statistics and variable selection criteria for the degree 1 model

Fig. 5.123 Goodness of fit statistics and variable selection criteria for the degree 2 model
442 5 Regression Models

Fig. 5.124 Error statistics in the degree 1 model

Fig. 5.125 Error statistics in the degree 2 model


5.3 Multiple Linear Regression 443

(4.651) compared to the other two sets (2.755 and 1.403) results in the very low
number of occurrences.
Now, even the test set has a mean error close to 0 and a low standard deviation
for the degree 2 model. This states the universality of the model, which is thus
capable of predicting the “mpg” of a new dataset.
5. After training several models, the validation set is used to compare these models
with each other. The model that best predicts the validation data is then the best
candidate for describing the data. Thus, the validation dataset is part of the
model building and finding procedure. It is possible that the validation set
somehow favors a specific model, because of a bad partition sampling for
example. Hence, due to the potential for bias, another evaluation of the final
model should be done on an independent dataset, the test set, to confirm the
universality of the model. The final testing step is therefore an important part in
cross-validation and finding the most appropriate model.

Exercise 6: Boosted Regression


Name of the solution streams Boosting_regression_mtcars
Theory discussed in section Section 5.1.2
Section 5.3.6

Figure 5.126 shows the final stream of this exercise.

1. We import the mtcars data with the Var. File node and add a Partition node to
split the dataset into training (70 %) and testing (30 %) set. For the description of
the Partition node and the splitting process, we refer to Sect. 2.7.7 and
Sect. 5.3.5. After adding the typical Type node, we add a Select node, to restrict
the modeling only on the training set. The Selection node can be generated in the
Partition node. For details see Sect. 5.3.5 and Fig. 5.73.
2. To build the boosting model, we add a Linear node to the stream and connect it
with the Select node. Afterwards, we open it and select “mpg” as the target and
“disp” resp. “wt” as input variables in the Fields tab (see Fig. 5.127).

Fig. 5.126 Stream of the boosting regression of the mtcars data with the Linear node
444 5 Regression Models

Fig. 5.127 Definition of the target and input variables

Next, we choose the boosting option in the Building Options tab to ensure a
boosting model building process (see Fig. 5.128). In the Ensemble options, we
define 100 as the maximum components in the final ensemble model (see
Fig. 5.129).
Now, we run the stream and the model nugget appear.
3. To inspect the quality of the ensemble model, we open the model nugget and
notice that the boosting model is more precise (R2 ¼ 0.734) than the reference
model (R2 ¼ 0.715) (see Fig. 5.130). Thus, boosting increases the fits of the
model to the data, and this ensemble model is chosen for prediction of the “mpg”
(see arrow in Fig. 5.130).
In the Ensemble accuracy, we can retrace the accuracy progress of the
modeling process. Here, we see that only 26 component models are trained
5.3 Multiple Linear Regression 445

Fig. 5.128 Definition of the boosting modeling technique

since further improvement by additional models wasn’t possible, and thus, the
modeling process stopped here (Fig. 5.131).
For cross-validation and the inspection of the RMSE, we now add an Analysis
node to the stream and connect it with the model nugget. As a trick, we further
change the data connection from the Selection node to the Type node (see
Fig. 5.132). This will ensure that both the training and testing partitions are
considered simultaneously by the model. Hence, the predictions can be shown in
one Analysis node.
After running the stream again, the output of the Analysis node appears (see
Fig. 5.133). We can see that the difference of the RMSE in the training and
testing data do not differ much, hence, indicating a robust model which is able to
predict the “mpg” from unseen data.
446 5 Regression Models

Fig. 5.129 Definition of the maximum components for the boosting model

Fig. 5.130 Quality of the boosting and reference model


5.3 Multiple Linear Regression 447

Fig. 5.131 Accuracy progress while building the boosting regression

Fig. 5.132 Change of the data connection from the Select node to the Type node

Fig. 5.133 Statistics of the Analysis node


448 5 Regression Models

5.4 Generalized Linear (Mixed) Model

Generalized Linear (Mixed) Models (GLM and GLMM), are essentially a generali-
zation of the linear and multiple linear models described in the two previous
sections. They are able to process more complex data, such as dependencies
between the predictor variables and a nonlinear connection to the target variable.
A further advantage is the possibility of modeling data with non-Gaussian
distributed target variables. In many applications, a normal distribution is inappro-
priate and even incorrect to assume. An example would be when the response
variable is expected to take only positive values that vary over a wide range. In this
case, constant change of the input leads to a large change in the prediction, which
can even be negative in the case of normal distribution. In this particular example,
assuming an exponentially distributed variable would be more appropriate. Besides
these fixed effects, the Mixed Models also add additional random effects that
describe the individual behavior of data subsets. In particular this helps us to
model the data on a more individual level.
This flexibility and the possibility of handling more complex data comes with a
price: the loss of automatic procedures and selection algorithms. There are no black
boxes that help to find the optimal variable dependencies, link function to the target
variable, or its distribution. YOU HAVE TO KNOW YOUR DATA! This makes
utilization of GLM and GLMM very tricky, and they should be used with care and
only by experienced data analysts. Incorrect use can lead to false conclusions and
thus false predictions.
The Modeler provides two nodes for building these kinds of models, the GenLin
node and the GLMM node. With the GenLin node, we can only fit a GLM to the
data, whereas the GLMM node can model Linear Mixed Models as well. After a
short explanation of the theoretical background, the GLMM node is described here
in more detail. It comprises many of the same options as the GenLin node, but, the
latter has a wider range of options for the GLM, e.g., a larger variety of link
functions, and is more elegant in some situations. A detailed description of the
GenLin node is omitted here, but we recommend completing Exercise 2 in the later
Sect. 5.4.5, where we explore the GenLin node while fitting a GLM model.

5.4.1 Theory

The GLM and GLMM can process data with continuous or discrete target variables.
Since GLMs and Linear Mixed Models are two separate classes of model, with the
GLMM combining both, we will explain the theoretical background of these two
concepts in two short sections.

Generalized Linear Model


You may recall the setting for the MLR. In it we had data records (xi1, . . ., xip), each
consisting of p input variables, and for each record an observation yi, but instead of
a normal distribution, y1, . . ., yn can be assumed to follow a different law, such as
5.4 Generalized Linear (Mixed) Model 449

Poisson-, Gamma-, or Exponential distribution. Furthermore, the input variables


fulfill an additive relationship, as in the MLR
 
h xi1 ; . . . ; xip ¼ β0 þ β1  xi1 þ . . . þ βp  xip :

The difference from an ordinary MLR is that this linear term is linked to the mean
of the observation distribution via a so-called link function g, i.e.,
 
gðMean of the target variableÞ ¼ h xi1 ; . . . ; xip :

This provides a connection between the target distribution and the linear predictor.
There are no restrictions concerning the link function, but, its domain should
coincide with the support of the target distribution, in order to be practical. There
are, however, a couple of standard functions which have proved to be adequate. For
example, if the target variable is binary and g is the logistic function, the GLM is a
logistic regression, which is discussed in Sect. 8.3. We refer to Fahrmeir (2013) for
a more detailed theoretical introduction to GLM.

Linear Mixed Model


Understanding Linear Mixed Models mathematically is quite challenging, and so
here we just give an heuristic introduction, so that the idea of adding random effects
is understood, and we refer to Fahrmeir (2013) for a detailed description of Mixed
Models.
We start in the same situation and have the same samples as with MLR: a record
of input variables and an observed target variable value. With Linear Mixed Models
though, the samples form clusters where the data structure is slightly different.
Think, for example, of a medical study with patients receiving different doses of a
new medication. Patients with different medical treatments probably have a differ-
ent healing process. Furthermore, patients of different ages, and other factors, react
differently to the medication.
The Linear Mixed Models allow for these individual and diverse behaviors
within the cluster. The model consists of two different kinds of coefficients. First,
as in a typical regression, there are effects that describe the data or population in
total, when we think of the medical study example. These are called fixed effects.
The difference from ordinary MLR is an additional linear term of random effects
that are cluster or group specific. These latter effects now model the individual
behaviors within the cluster, such as different responses to the medical treatment in
different age groups. They are assumed to be independent and normally distributed
(see Fahrmeir (2013) for more information on the Linear Mixed Model).
The GLMM is then formed through a combination of the GLM with the Linear
Mixed Model.

Estimation of the Coefficients and the Goodness of Fit


The common method for estimating the coefficients that appear in these models is
the “maximum likelihood approach” (see Fahrmeir (2013) for a description of this
450 5 Regression Models

estimation). We should mention that other methods exist for estimating coefficients,
such as the Bayesian approach. For an example see once again Fahrmeir (2013).
A popular statistical measure for comparing and valuing models is the “Akaike
criteria (AICC)”. As in the other regression models, the mean squared error is also a
commonly used value for determining the goodness of fit of a model. This parame-
ter is consulted during cross-validation in particular. Another criterion is the
“Bayesian information criterion (BIC)”, see Schwarz (1978), which is also calcu-
lated in the Modeler.

5.4.2 Building a Model with the GLMM Node

Here, we set up a stream for a GLMM with the GLMM node using the “test_scores”
dataset. It comprises data from schools that used traditional and new/experimental
teaching methods. Furthermore, the dataset includes the pretest and the posttest
results of the students, as well as variables to describe the student’s characteristics.
The learning quality not only depends on the method but also, for example, on the
teacher. How he/she delivers the educational content to the pupils and to the class
itself. If the atmosphere is noisy and twitchy, the class has a learning disadvantage.
These considerations justify a mixed model and the use of the GLMM node to
predict the posttest score.

Description of the model


Stream name Generalized linear mixed model
Based on dataset test_scores.sav (see Sect. 10.1.31)
Stream structure

Important additional remarks


The GLMM requires knowledge of the data, the variable dependencies, and the distribution of
the target variable. Therefore, a descriptive analysis is recommended before setting the model
parameters
Related exercises: 1
5.4 Generalized Linear (Mixed) Model 451

1. To start the process, we use the template stream “001 Template-Stream


test_scores” (see Fig. 5.134). We open it and save it under a different name.
2. To get an intuition of the distribution of the target variable, we add a Histogram
node to the Source node and visualize the empirical distribution (see Sect. 3.2.3
for a description of the Histogram node). In Fig. 5.135, we see the histogram of

Fig. 5.134 Template stream “test_scores”

Fig. 5.135 Histogram of the target variable with a normal distribution


452 5 Regression Models

the target values with the curve of a normal distribution. This indicates a
Gaussian distribution of the target variable.
3. We add the GLMM node from the SPSS Modeler toolbar to the stream, connect
it to the Type node, and open it to define the data structure and model
parameters.
4. In the Data Structure tab, we can specify the dependencies of the data records, by
dragging the particular variables from the left list and dropping them onto the
subject field canvas on the right. First, we drag and drop the “school” variable,
then the “classroom” variable, and lastly the “student_id” variable, to indicate
that students in the same classroom correlate in the outcome of the test scores
(see Fig. 5.136).
5. The random effects, induced by the defined data structure, can be viewed in the
Fields and Effects tab. See Fig. 5.137 for the included random effects in our
stream for the test score data, which are “school” and the factor
“school*classroom”. These indicate that the performance of the students
depends on the school as well as on the classrooms and are therefore clustered.
We can manually add more random effects by clicking on the “Add Block. . .”
button at the bottom (see arrow in Fig. 5.137). See IBM (2015b, pp. 180–181) for
details on how to add further effects.

Fig. 5.136 Definition of the clusters for the random effects in the GLMM node
5.4 Generalized Linear (Mixed) Model 453

Fig. 5.137 Random effects in the GLMM node

6. In the “Target” options, the target variable, its distribution, and relationship to
the linear predictor term can be specified. Here, we choose the variable “post-
test” as our target. Since we assume a Gaussian distribution and a linear
relationship to the input variables, we choose the “Linear model” setting (see
arrow in Fig. 5.138).
We can choose some other models predefined by the Modeler for continuous,
as well as categorical target variables [see the enumeration in Fig. 5.138 or IBM
(2015b, p. 175)]. We can further use the “Custom” option to define the distribu-
tion and manually link functions if none of the above models are appropriate for
our data. A list of the distributions and link functions can be found in IBM
(2015b, pp. 178–179).
7. In the “Fixed Effects” options, we can specify the variables with deterministic
influence on the target variable, which should be included in the model. This can
be done by dragging and dropping. We select the particular variable in the left
list and drag it to the “Effect builder canvas”. There we can select multiple
variables at once and we can also define new factor variables to be included in
the model. The latter can be used to describe dependences between single input
variables. The options and differences between the columns in the “Effect
builder canvas” are described in the Infobox below.
In our example, we want to include the “pretest”, “teaching_method”, and
“lunch” variables as single predictors. Therefore, we select these three variables
in the left list and drag them into the “Main” column on the right. Then we select
the “school_type” and “pretest” variable and drag them into the “2-way” col-
umn. This will bring the new variable “school_type*pretest” into the model
since we assume a dependency between the type of school and the pretest results.
454 5 Regression Models

Fig. 5.138 Definition of the target variable, its distribution, and the link function

To include the intercept in the model, we check the particular field (see left
arrow in Fig. 5.139).

There are four different possible drop options in the “Effect builder canvas”:
Main, *, 2-way, and 3-way. If we drop the selected variables, for example A,
B, and C, into the Main effects column, each variable is added individually to
the model. If, on the other hand, the 2-way or 3-way is chosen, all possible
variable combinations of 2 or 3 are inserted. In the case of the 2-way
interaction, this means for the example here, that the terms A*B, A*C, and
B*C are added to the GLMM. The * column adds a single term to the model,
which is a multiplication of all the selected variables.
Despite these given options, we can specify our own nested terms and add
them to the model with the “Add custom term” button on the right [see the
right arrow in Fig. 5.139 and for further information see IBM (2015b)].

8. In the “Build options” and “Model Options”, we can specify further advanced
model criteria that comprise convergence settings of the involved algorithms.
We refer the interested reader to IBM (2015b).
9. Finally, we run the stream and the model nugget appears.
5.4 Generalized Linear (Mixed) Model 455

Fig. 5.139 Definition of the fixed effects

5.4.3 The Model Nugget

Some of the tables and graphs in the model nugget of the GLMM are similar to the
ones of MLR, for example the Predicted by Observed, Fixed Effects, and Fixed
Coefficients views. For details, see Sect. 5.3.3, where the model nugget of the MLR
is described. It should not be surprising that these views equal the ones of the MLR,
since the fixed input variables form a linear term, as in the MLR. The only
difference to the MLR model is the consideration of co-variate terms, here,
“pretest*school” (see Fig. 5.140). These terms are treated as normal input variables,
and are therefore no different to handle.
The single effects are significant, whereas the product variable “school_ty-
pe*pretest” has a relatively high p-value of 0.266. Therefore, a more suitable
model can be one without this variable.

Model Summary
In the summary view, we can get a quick overview of the model parameters, the
target variable, its probability distribution, and the link function. Furthermore, two
model evaluation criteria are displayed, the Akaike (AICC) and a Bayesian criteria,
456 5 Regression Models

Fig. 5.140 Coefficients of the test scores GLMM with the co-variate term “pretest*school_type”

Fig. 5.141 Summary of the GLMM

which can be used to compare models with each other. A smaller value thereby
means a better fit of the model (Fig. 5.141).

Random Effects and Covariance Parameters


The views labeled “Random Effects Covariances” and “Covariance Parameters”
display the estimates and statistical parameters of the random effects variance
within the clusters, as well as covariances between the clusters; in our example,
these are the school and the classrooms.
5.4 Generalized Linear (Mixed) Model 457

Fig. 5.142 Covariance matrix of the random effects

The covariances of the separate random effects are in the “Random Effect
Covariances” view. The color indicates the direction of the correlation, darker
color means positively correlated and lighter means negatively correlated. If there
are multiple blocks, we can switch between them in the bottom selection menu (see
Fig. 5.142). A block is thereby the level of subject dependence, which was defined
in the GLMM node, in this case the school and the classroom (see Fig. 5.137). The
value, displayed in Fig. 5.142 for example, is a measure of the variation between the
different schools.
In the “Covariance Parameter” view, the first table gives an overview of the number
of fixed and random effects that are considered in the model. In the bottom table,
parameters of the residual and covariance estimates are shown. As well as the
estimates, these include the standard error and confidence interval, in particular (see
Fig. 5.143). At the bottom, we can switch between the residual and the block estimates.
In the case of the test score model, we find that the estimated effect of school
subjects is much higher (22.676) than the effect of the residuals (7.868), and the
“school*classroom” subject, which is 7.484 (see Fig. 5.144 for school subject
variation and Fig. 5.143 for the residual variation estimate). This indicates that
most of the variation in the test score set, which is unexplained by the fixed effects,
458 5 Regression Models

Fig. 5.143 Covariance Parameters estimate view in the model nugget

Fig. 5.144 Estimate of school subject variation

can be described by the between-school variation. However, since the standard


error is high, the actual size of the effect is uncertain and cannot be clearly specified.

5.4.4 Cross-Validation and Fitting a Quadric Regression Model

In all previous examples and models, the structure of the data and the relationships
between the input and target variables were clear before estimating the model. In
many applications, this is not the case, however, and the basic parameters of an
assumed model have to be chosen before fitting the model. One famous and often
referenced example is polynomial regression, see Exercise 5 in Sect. 5.3.7. Polyno-
mial regression means that the target variable y is connected to the input variable
x via a polynomial term

y ¼ β0 þ β1  x þ β2  x2 þ . . . þ βp  xp :
5.4 Generalized Linear (Mixed) Model 459

The degree p of this polynomial term is unknown however, and has to be deter-
mined by cross-validation. This is a typical application of cross-validation, to find
the optimal model, and afterwards test for universality (see Sect. 5.1.2). Recall
Fig. 5.4, where the workflow of the cross-validation process is visualized.
Cross-validation has an advantage over normal selection criteria, such as for-
ward or backward selection, since the decision of the exponent is made on an
independent dataset, and thus overfitting of the model is prevented. Selecting
criteria in this manner can fail, however.
In this section, we perform a polynomial regression with cross-validation, with the
GLMM node based on the “Ozone.csv” dataset, containing meteorology data and
ozone concentration levels from the Los Angeles Basin in 1976 (see Sect. 10.1.27).
We will build a polynomial regression model to estimate the ozone concentration from
the temperature.
We split the description into two parts, in which we describe first the model
building and then the validation of these models. Of course, we can merge the two
resulting streams into one single stream: we have saved this under the name
“ozone_GLMM_CROSS_VALIDATION”.

Building Multiple Models with Different Exponents

Description of the model


Stream name ozone_GLMM_1_BUILDING_MODELS
Based on dataset Ozone.csv (see Sect. 10.1.27)
Stream structure

Important additional remarks


Fix the seed in the Partition node so that the validation and testing sets in the second validation
stream are independent of the training data used here.

1. We start by importing the comma separated “Ozone” data with a Var.


File node.
2. Since we only need the ozone (O3) and temperature (temp) variable, we add a
Filter node to exclude all other variables from the further stream (see
460 5 Regression Models

Fig. 5.145 Filter node to exclude all irrelevant variables from the further stream

Fig. 5.145). For this we have to click on all the arrows from these variables,
which are then crossed out, so they won’t appear in the subsequent analysis.
3. Now, we insert the usual Type node, followed by a Plot node, in order to inspect
the relationship between the variables “O3” and “temp” (see Sect. 4.2 for
instructions on the Plot node). The relationship between the two variables can
be seen by the scatterplot output of the Plot node in Fig. 5.146.
As can be seen in scatterplot Fig. 5.146, the cloud of points foretell a slight
curve, which indicates a quadric relationship between the O3 and temp
variables. Since we are not entirely sure, we choose cross-validation to decide
if a linear or quadric regression is more suitable.
4. To perform cross-validation, we insert a Partition node into the stream, open it,
and select the “Train, test and validation” option to divide the data into three
parts (see Fig. 5.147). We choose 60 % of the data as the training set and split
the remaining data equally into validation and testing sets. Furthermore, we
should point out that we have to fix the seed of the sampling mechanism, so that
the three subsets are the same in the later validation stream. This is important,
as otherwise the validation would be with already known data and therefore
5.4 Generalized Linear (Mixed) Model 461

Fig. 5.146 Scatterplot of the “O3” and “temp” variables from the ozone data

Fig. 5.147 Partitioning of the test score data into training, validation and test sets
462 5 Regression Models

Fig. 5.148 Fields and Effects tab of the GLMM node. Target variable (O3) and the type of model
selection

inaccurate, since the model typically performs better on data it is fitted to. Here,
the seed is set to 1234567.
5. We generate a Selection node for the training data, for example with the
“Generate” option in the Partition node (see Sect. 2.7.7). Add this node to the
stream between the Type and the GLMM node.
6. Now, we add a GLMM node to the stream and connect it to the Select node.
7. Open the GLMM node and choose “O3” as the target variable in the Fields and
Effects tab (see Fig. 5.148). We also select the Linear model relationship, since
we want to fit a linear model.
8. In the Fixed Effects option, we drag the temp variable and drop it into the main
effects (see Fig. 5.149). The definition of the linear model is now complete.
9. To build the quadric regression model, copy the GLMM node, paste it into the
stream canvas, and connect it with the Select node. Open it and go to the Fixed
Effects option. To add the quadric term to the regression equation, we click on
the “Add a custom term” button on the right (see arrow in Fig. 5.149). The
custom term definition window pops up (see Fig. 5.150). There, we drag the
temp variable into the custom term field, click on the “By*” button, and drag
and drop the temp variable into the custom field once again. Now the window
should look like Fig. 5.150. We finish by clicking on the “Add Term” button,
5.4 Generalized Linear (Mixed) Model 463

Fig. 5.149 Input variable definition of the linear model. The variable temp is included as a
linear term

Fig. 5.150 Definition of the quadric term of the variable temp


464 5 Regression Models

Fig. 5.151 Input variables of the quadric regression

which adds the quadric temp term to the regression as an input variable. The
Fixed Effects window should now look like Fig. 5.151.
10. Now we run the stream and the model nuggets appear. The stream for building
the three models is finished, and we can proceed with validation of those
models.

Cross-Validation and Testing


This is the second part of the stream, to perform a cross-validation on the Ozone
data. Here, we present how to validate the models from part one and test the best of
these models. Since this is the sequel to the above stream, we continue with the
enumeration in the process steps.
5.4 Generalized Linear (Mixed) Model 465

Description of the model


Stream name ozone_GLMM_2_VALIDATION
Based on dataset Ozone.csv (see Sect. 10.1.27)
Stream structure

Important additional remarks


Use the same seed in the Partition node as in the building model stream, so that the validation
and testing are independent of the training data.

11. We start with the just built stream “ozone_GLMM_1_BUILDING_MODELS”


from the previous step and save it under a different name.
12. Delete the Selection and both GLMM nodes, but keep the model nuggets.
13. Open the Partition node and generate a new Selection node for the validation
data. Connect the Selection node to the Partition node and both model nuggets.
14. To analyze the model performances, we could add an Analysis node to each of
the nuggets and run the stream, but we wish to describe an alternative way to
proceed with the validation of the models, which gives a better overview of all
the statistics. This will make it easier to compare the models later.
We add a Merge node to the canvas and connect each nugget to it. We open
it and go to the “Filter” tab. To distinguish the two output variables, we rename
them from just “$L-O3” to “$L-O3 quadric” and “$L-O3 linear”, respectively.
Now, we switch off one from ALL remaining duplicate fields, by clicking on
the particular arrows in the “Filter” column (see Fig. 5.152). After the merge,
all input variables remain just once with each of the target variables.
15. Now, we connect the Merge node with an Analysis node that was already put
on the canvas.
16. We run the stream. The output of the Analysis node is displayed in Fig. 5.153.
There, the statistics from both models are displayed in one output window,
which makes comparison easy.
As we can see, the mean errors of all three are close to 1, and the standard
deviations of the models are very close to each other. This indicates that
466 5 Regression Models

Fig. 5.152 Merge of the model outputs. Switching off of duplicate filters

Fig. 5.153 Validation of the models with the output of the Analysis node
5.4 Generalized Linear (Mixed) Model 467

quadric regression does not strongly improve the goodness of fit, but when
computing the RMSE (Sect. 5.1.2), the error of the quadric regression is
smaller than the error of the linear model. Hence, the former outperforms
the latter and so describes the data a bit better.
17. We still have to test this by cross-validating a chosen model with the
remaining testing data. The stream is built similarly to the other testing
streams (see, e.g., Sect. 5.2.5 and Fig. 5.154).
18. To test the model, first, we generate a Select node, connect it to the Type node,
and select the testing data.
19. We copy the model nugget from the most validated model and paste it into the
canvas. Then, connect it to the Select node of the test dataset.
20. We add another Analysis node at the end of this stream and run it. The output
is shown in Fig. 5.155. We can see that the error mean is almost 0 and the
RMSE is of the same order as the standard deviation of the training and
validation data. Hence, a quadric model is suitable for discerning ozone
concentration from temperature.

Fig. 5.154 Test stream for


the selected model during
cross-validation

Fig. 5.155 Validation of the


final model with the test set.
Analysis output
468 5 Regression Models

5.4.5 Exercises

Exercise 1: Comparison of a GLMM and Ordinary Regression


The dataset “Orthodont.csv” (see Sect. 10.1.26) contains measures of orthodontic
change (distance) over time of 27 teenagers. Each subject thereby has different
tooth positioning and thus an individual therapy prescription, which will result in
movements of various distances. So, the goal of this exercise is to build a
Generalized Linear Regression Model with the GLMM node to model the situation
on an individual level and to predict the distance of tooth movement using the age
and gender of a teenager.

1. Import the data file and perform a boxplot of the “distance” variable. Is the
assumption of adding individual effects suitable?
2. Build two models with the GLMM node, one with random effects and one
without. Then, use the “Age” and “Sex” variables as fixed effects in both
models, while adding an individual intercept as a random effect in the latter
model.
3. Does the random effect improve the model fit? Inspect the model nuggets to
answer this question.

Exercise 2: The GenLin Node


In this exercise, we introduce the GenLin node and give readers the opportu-
nity to familiarize themselves with this node and learn the advantages and
differences, compared with the GLMM node. The task is to set up a stream of
the test scores dataset (see Sect. 10.1.31) and perform cross-validation with the
GenLin node.

1. Import the test score data and partition it into three parts, training, validation, and
test set, with the Partition node.
2. Add a GenLin node to the stream and select the target and input variables.
Furthermore, select the field as Partition, which will indicate the relationship to
the subsets.
3. In the Model tab, choose the option “Use partitioned data” and select “Include
intercept”.
4. In the Expert tab, we set the normal distribution for the target variable and
choose Power as the link function with parameter 1.
5. Add two additional GenLin nodes to the stream by repeating steps 2–4. Choose
0.5 and 1.5 as power parameters for the first and second models, respectively.
6. Run the stream and validate the model performances with the Analysis nodes.
Which model is most appropriate for describing the data?
7. Familiarize yourself with the model nugget. What are the tables and statistics
shown in the different tabs?
5.4 Generalized Linear (Mixed) Model 469

Exercise 3: Generalized Model with Poisson-Distributed Target Variable


The “ships.csv” set (see Sect. 10.1.30) comprises data on ship damage caused
by waves. Your task is to model the counts of these incidents with a Poisson
regression, through the predictors ship type, year of construction, and
operation period. Furthermore, the aggregated months of service are stated for
each ship.

1. Reflect why a Poisson regression is suitable for this kind of data. If you are not
familiar with it, inform yourself about Poisson distribution and the kind of
randomness it describes.
2. Use the GenLin node to build a Poisson regression model with the three
predictor variables mentioned above. Use 80 % for the training dataset and
20 % for the test data. Is the model appropriate for describing the data and
predicting the counts of ship damage? Justify your answer.
3. Inform yourself on the “Offset” of a Poisson regression, for example, on
Wikipedia. What does this field actually model, in general and in the current
dataset?
4. Update your stream by a logarithmic transformation of the “months of
service” and then add these values as offset. Does this operation increase
the model fit?

5.4.6 Solutions

Exercise 1: Comparison of a GLMM and Ordinary Regression


Name of the solution streams glmm_orthodont
Theory discussed in section Section 5.4

Figure 5.156 shows the final stream for this exercise.

1. We import the “Othodont.csv” data (see Sect. 10.1.26) with a Var. File node.
Then, we add the usual Type node to the stream.
Next, we add a Graphboard node to the stream and connect it with the Type
node. After opening it, we mark the “distance” and “subject” variables and click
on the Boxplot graphic (see Fig. 5.157). After running the node, the output
window appears, which displays the distribution of each subject’s distance via a
boxplot (see Fig. 5.158). As can be seen, the boxes are not homogeneous, which
confirms our assumption of individual effects. Therefore, the use of a GLMM is
highly recommended.
2. Next, we add a GLMM node to the stream canvas and connect it with the Type
node. We open it and select the “distance” variable as the target variable and
choose the linear model option as the type of the model (see Fig. 5.159).
470 5 Regression Models

Fig. 5.156 Complete stream of the exercise to fit a GLMM to the “Orthodont” data with the
GLMM node

Fig. 5.157 Options in the Graphboard node for performing a boxplot


5.4 Generalized Linear (Mixed) Model 471

Fig. 5.158 Boxplot for each subject in the Othodont dataset

Now, we add the “Age” and “Sex” variables as fixed effects to the model, in
the Fields and Effects tab of the GLMM node (see Fig. 5.160). This finishes the
definition of the model parameters for the nonrandom effects model.
Now, we copy the GLMM node and paste it into the stream canvas. After-
wards, we connect it with the Type node and open it. In the Data Structure tab,
we drag the “Subject” field and drop it into the Subject canvas (Fig. 5.161). This
will include an additional intercept as a random effect in the model and finishes
the parameters definition of the model with random effects.
We finally run the stream, and the two model nuggets appear.
3. We open both model nuggets and take a look at the “Predicted by Observed”
scatterplots. These can be viewed in Figs. 5.162 and 5.163. As can be seen, the
points of the model with random effects lie in a straight line around the diagonal,
whereas the points of the model without random effects have a more cloudy
472 5 Regression Models

Fig. 5.159 Target selection and model type definition in the GLMM node

Fig. 5.160 Definition of the fixed effects in the GLMM node


5.4 Generalized Linear (Mixed) Model 473

Fig. 5.161 Adding an individual random effect to the GLMM model

Fig. 5.162 Predicted by Observed plot of the model without random effects
474 5 Regression Models

Fig. 5.163 Predicted by


Observed plot of the model
with random effects

Fig. 5.164 Complete stream of the exercise to fit a regression into the test score data with the
GenLin node

shape. Thus, the model with random effects explains the data better and is more
capable of predicting the distance moved, using the age and gender of a young
subject.

Exercise 2: The GenLin Node


Name of the solution streams Test_scores_GenLin
Theory discussed in section Section 5.4

In this exercise, we get to know the GenLin node. Figure 5.164 shows the complete
stream of the solution.
5.4 Generalized Linear (Mixed) Model 475

Fig. 5.165 Partitioning of the test score data into three parts: training, testing and validation

1. The “001 Template-Stream test_scores” is the starting point for our solution of
this exercise. After opening this stream, we add a Partition node to it and select a
three part splitting of the dataset with the ratio 60 % training: 20 % testing: 20 %
validation (see Fig. 5.165).
2. Now, we add a GenLin node to the stream and select “posttest” as the target
variable, “Partition” as the partitioning variable, and all other variables as input
(see Fig. 5.166).
3. In the Model tab, we choose the option “Use partitioned data” and select
“Include intercept” (see Fig. 5.167).
4. In the Expert tab, we set the target variable as normal distribution and choose
“Power” as the link function with parameter 1 (see Fig. 5.168).
5. We add two additional GenLin nodes to the stream and connect the Partition
node to them. Afterwards, we repeat steps 2–4 for each node and set the same
parameters, except for the power parameters, which are set to 0.5 and 1.5 (see
Fig. 5.169 for how the stream should look like).
6. Now, we run the stream and the three model nuggets should appear in the
canvas. We then add an Analysis node to each of these nuggets and run the
476 5 Regression Models

Fig. 5.166 Specification of the target, input and partitioning variables

Fig. 5.167 The model tab of the GenLin node. Enabling of partitioning use and intercept inclusion
5.4 Generalized Linear (Mixed) Model 477

Fig. 5.168 Expert tab in the GenLin node. Definition of the target distribution and the link
function

Fig. 5.169 Stream after adding three different GenLin models


478 5 Regression Models

Fig. 5.170 Output of the Analysis node for the GenLin model with power 1

Fig. 5.171 Output of the Analysis node for the GenLin model with power 0.5

stream a second time to get the validation statistics of the three models. These
can be viewed in Figs. 5.170, 5.171, and 5.172.
We see that all standard deviations appear very close to each other, to the
overall models and to the dataset parts. This suggests that there is no real
difference in the performance of the models for this data. The model with
5.4 Generalized Linear (Mixed) Model 479

Fig. 5.172 Output of the Analysis node for the GenLin model with power 1.5

power 1 has the smallest values of these statistics, however, and is thus the most
appropriate for describing the test score data.
7. To get insight into the model nugget and its parameters and statistics, we will
only inspect the model nugget of the GenLin power 1 model here. The other
nuggets have the same structure and tables.
First, we observe that the GenLin mode nugget can also visualize the predictor
importance in a graph. This is displayed in the Model tab. Not surprisingly, the
“Pretest” variable is by far the most important variable in predicting the “post-
test” score (see Fig. 5.173).
Now, we take a deeper look at the “Advanced” tab. Here, multiple tables are
displayed, but the first four (Model Information, Case processing Summary, Cate-
gorical Variable Information, Continuous Variable Information) summarize the
data used in the modeling process.
The next table is the “Goodness of Fit” table; it contains a couple of measures for
validating the model using the training data. These include, among others, the
Pearson Chi-Squared value and Akaike’s Information Criterion (AICC) (see
Fig. 5.174). For these measures, the smaller the value, the better the model fit. If
we compare, for example, the AICC of the three models with each other, we get the
same picture as before with the Analysis nodes. The model with a power of
1 explains the training data better than the other two models.
The next table shows the significance level and test statistics from validation of
the model with the naive model. That is, the model consisting only of the intercept.
As displayed in Fig. 5.175, this model is significantly better than the naive model.
The last but one table contains the results of the significance test of the individual
input variables. The significance level is located in the column furthest right (see
Fig. 5.176). In the case of the test scores, all effects are significant, except for the
480 5 Regression Models

Fig. 5.173 Predictor importance graphic in the GenLin model nugget

Fig. 5.174 Goodness of fit


parameter for the GenLin
model
5.4 Generalized Linear (Mixed) Model 481

Fig. 5.175 Test of the


GenLin model against the
naive model

Fig. 5.176 Test of the


effects of the GenLin model

school_setting, school_type, and gender variables. If we compare this with Exercise 2 in


Sect. 5.3.7, where the “posttest” score is estimated by an MLR, we see that there the
variables school_type and gender were omitted by the selection method (see Fig. 5.101).
Thus, this is consistent with related models.
The last table finally summarizes the estimated coefficients of the variables, with
the significance test statistics. The coefficients are in column B and the significance
levels are in the last column of the table (see Fig. 5.177). As in the previous table,
the school_setting, school_type, and gender coefficients are not significant. This
means that a model without these input variables would be a more suitable fit for
this data. We encourage the reader to build a new model and test this.
482 5 Regression Models

Fig. 5.177 Parameter estimate and validation for the GenLin model

Exercise 3: Generalized Model with Poisson-Distributed Target Variable


Name of the solution streams Ships_genlin
Theory discussed in section Section 5.1.2
Section 5.4

Figure 5.178 shows the complete solution stream.

1. Poisson regression is typically used to model count data and assumes a Poisson-
distributed target variable. Furthermore, the expected value is linked to both the
linear predictor term via a logarithmic function, hence,
 
logðMean of the target variableÞ ¼ h xi1 ; . . . ; xip ;

and the target variable, here “incidents”, following a Poisson law. A Poisson
distribution also describes rare events and is ideal for modeling random events
that occur infrequently. We refer to Fahrmeir (2013) and Cameron and Trivedi
(2013) for more information on Poisson regression.
Since ship damage caused by a wave is very unusual, the assumption is of
Poisson-distributed target variables and so Poisson regression is suitable. Plot-
ting a histogram of the “incidents” variable also affirms this assumption (see
Fig. 5.179).
2. To build a Poisson regression with the GenLin node, we first split the data into a
training set and a test set using a Partition node in the desired proportions
(Fig. 5.180).
After adding the usual Type node, we insert the GenLin node to the stream by
connecting it to the Type node. Now, we open the GenLin node and define the
5.4 Generalized Linear (Mixed) Model 483

Fig. 5.178 Complete stream of the exercise to fit a Poisson regression

Fig. 5.179 Histogram of the “incidents” variable, which suggests a Poisson distribution

target, Partition, and input variables in the Fields tab (see Fig. 5.181). In the
Model tab, we enable the “Use partitioned data” and “Include intercept” options,
as in Fig. 5.167.
In the Expert tab, we finally define the settings of a Poisson regression. That is,
we choose “Poisson” distribution as the target distribution and “logarithmic” as
the link function (see Fig. 5.182). Afterwards, we run the stream and the model
nugget appears.
To evaluate the goodness of our model, we add an Analysis node to the
nugget. Figure 5.183 shows the final stream of part 2 of this exercise. Now, we
484 5 Regression Models

Fig. 5.180 Partitioning of the ship data into a training set and a test set

Fig. 5.181 Definition of the target, partition, and input variables for Poisson regression of the
ship data
5.4 Generalized Linear (Mixed) Model 485

Fig. 5.182 Setting the model settings for the Poisson regression. That is, the “Poisson” distribu-
tion and “logarithmic” as link function

Fig. 5.183 Final stream of part 2 of the exercise


486 5 Regression Models

Fig. 5.184 Output of the Analysis node for Poisson regression on the ship data

run the stream again, and the output of the Analysis tab pops up. The output can
be viewed in Fig. 5.184. We observe that the standard deviations of the training
and test data differ a lot: 5.962 for the training data, but 31.219 for the test data.
This indicates that the model describes the training data very well, but is not
appropriate for independent data. Hence, we have to modify our model, which is
done in the following parts of this exercise.
3. Often occurrences of events are counted on different timescales, and so appear to
happen equally often, although it is not the case. For example, in our ship data,
the variable “service” describes the aggregated months of service, which differ
for each ship. So, discovered damage is based on different timescales for each
ship and thus is not exactly comparable. The “offset” is an additional tool that
balances this disparity in the model, so that the damages are observed at nearly
the same time intervals. For more information on the offset, we refer to
Hilbe (2014).
4. To update our model with an offset tool, we have to calculate the logarithm of
the “service” variable. Therefore, we insert a Derive node between the Source
and the Partition node. In this node, set the formula “log(service)” to calculate
the offset term and name this new variable “log_month_service” (see
Fig. 5.185).
Then, we open the GenLin node and set the “log_month_service” variable as
the Offset field. This is displayed in Fig. 5.186. Now, we run the stream again,
5.4 Generalized Linear (Mixed) Model 487

Fig. 5.185 Calculation of the offset term in the Derive node

Fig. 5.186 Specification of the offset variable. Here, the log_month_service


488 5 Regression Models

Fig. 5.187 Output for the Poisson regression with offset term

and the output of the new model pops up (see Fig. 5.187). We observe that both
standard deviations do not differ much from each other. Thus, by setting an
offset term, the accuracy of the model has increased, and the model is now able
to predict ship damage from ship data not involved in the modeling process.

5.5 The Auto Numeric Node

The SPSS Modeler provides us with an easy way to build multiple models in one
step, using the Auto Numeric node. The Auto Numeric node considers various
models, which use different methods and techniques, and ranks them according to a
quantification measure. This is very advantageous to data miners for two main
reasons. First, we can easily compare the different settings of a mining method,
such as the variable selection method or validation criteria, within a single stream
instead of running multiple steams with diverse settings. This helps to quickly find
the best method setup. A second reason to use the Auto numeric node is that it
comprises models from different mining approaches, such as regression models, as
described in this chapter, but also neuronal networks or decision trees. See
Fig. 5.188 and Kuhn and Johnson (2013) for descriptions of regression trees and
other regression modeling techniques which are also provides by the Auto numeric
5.5 The Auto Numeric Node 489

Fig. 5.188 Nodes included within the Auto numeric node. The darker circles are the nodes for
regression models which are described in this chapter. The lighter circles are further regression
nodes of other models within the Auto numeric node

node. This can help find the most appropriate approach for understanding the data.
With either way, the Auto numeric node takes the best-fitting models and joins them
together into a model ensemble. That means, when predicting the target variable
value, each of the models in the ensemble processes the data and predicts the
outputs, which are then aggregated into one final prediction using mean. This
aggregation process of values, predicted from different models, has the advantage
that data points favored as outliers or leveraged by one model type are smoothed out
by the output of the other models. Furthermore, overfitting becomes less likely.
We would like to point out that building a huge number of models is very time
consuming. That’s why a large number of considered models in the Auto Numeric
node may take a very long time to calculate, up to a couple of hours.
490 5 Regression Models

5.5.1 Building a Stream with the Auto Numeric Node

In the following, we show how to effectively use the Auto numeric node to build an
optimal model for our data and mining task. A further advantage of this node is its
capability of running a cross-validation within the same stream. This results in more
clearly represented streams. In particular, no additional stream for validation of the
model has to be constructed. We include this useful property in our example stream.

Description of the model


Stream name Auto numeric node
Based on dataset housing.data.txt (see Sect. 10.1.17)
Stream structure

Important additional remarks


The target variable must be continuous in order to use the Auto numeric node
Related exercises: All exercises in Sect. 5.5.3

1. First, we import the data, in this case the Boston housing data, housing.data.txt.
2. To cross-validate the models within the stream, we have to split the dataset into
two separate sets, the training data and the test data. Therefore, we add the
Partition node to the stream and partition the data appropriately: 70 % of the
data to train the models and the rest for the validation. The Partition node is
described in more detail in Sect. 2.7.7. Afterwards, we use the Type node to
assign the proper variable types.
3. Now, we add the Auto numeric node to the canvas and connect it with the Type
node, then open it with a double-click. In the “Fields” tab, we define the target
and input variables, where MEDV describes the mean house prize and is set as
the target variable, and all other variables, except for the partitioning field, are
chosen as inputs. The partition field is selected in the Partition drop-down menu
for the Modeler, to indicate that this field defines both the training and the test
sets (see Fig. 5.189 for details).
5.5 The Auto Numeric Node 491

Fig. 5.189 Definition of target and input variables and the partition field

4. In the “Model” tab, we enable the “Use partitioned of data” option (see the top
arrow in Fig. 5.190). This option will lead the model to be built based on the
training data alone.
In the “Rank models by” selection field, we can choose the score that
validates the models and compares them with each other. Possible measures
are:

– Correlation: This is the correlation coefficient between the observed target


values and the, by the model estimated, target values. A coefficient near 1 or
1 indicates a strong linear relationship and that the model fits the data
potentially better than a model with a coefficient close to 0.
– Number of fields: Here, the number of variables that are included in the
model are considered. A model with fewer predictors might give a more
effective performance and have a smaller chance of overfitting.
– Relative error: This is a measure of how well the model predicts the data,
compared to the naive approach that just uses the mean as an estimator. In
492 5 Regression Models

Fig. 5.190 Model tab with the criteria that models should be included in the ensemble

particular, the relative error equals the variance of the observed values from
those predicted, divided by the variance of the observed values from
the mean.

With the “rank” selection, we can choose if the models should be ranked by
the training or the test partition, and how many models should be included in
the final ensemble. Here, we select that the ensemble should have four models
(see the bottom arrow in Fig. 5.190). At the bottom of the tab, we can define
more precise exclusion criteria, to ignore models that are unsuitable for our
purposes. If these thresholds are too strict however, we might end up with an
empty ensemble, that is, no model fulfills the criteria. If this happens, we should
loosen the criteria.
In the Model tab, we can also choose to calculate the predictor importance,
and we recommend enabling this option each time. For predictor importance,
see Sect. 5.3.3.
5. The next tab is the “Expert” tab. Here, the models that should be calculated and
compared can be specified (see Fig. 5.191). Besides the known Regression,
5.5 The Auto Numeric Node 493

Fig. 5.191 Selection of the considered data mining methods

Linear and GenLin nodes, which are the classical approaches for numerical
estimation, we can also include decision trees, neuronal networks, support
vector machines, and decision trees as candidates for the ensemble. Although
these models were actually invented for different types of data mining tasks,
such as classification analysis, which are the typical applications, these models
are also capable of estimating numeric values. We omit a description of the
particular models here and refer to the corresponding Chap. 8 on classification
and IBM (2015a) and Kuhn and Johnson (2013) for further information on the
methods and algorithms.
We can also specify multiple settings for one node, in order to include more
model variations and to find the best model of one type. Here, we used two
setups for the Regression node and eight for the Linear node (see Fig. 5.191).
We demonstrate how to pick settings with the Linear node. For all other nodes,
the steps are the same, but the options are of course different for each node.
494 5 Regression Models

Fig. 5.192 Specification of the Linear model building settings

6. To include more models of the same type in the calculations, we click on the
“Model parameters” field next to the particular model. We choose the option
“Specify” in the opening selection bar. Another window pops up in which we
can define all model variations. For the Linear node, this looks like Fig. 5.192.
7. In the opened window, we select the “Expert” tab and in the “Options” field
next to it, we click on each parameter that we want to change (see the arrow in
Fig. 5.192). Another, smaller window, opens with the possible selectable
parameter “values” (see Fig. 5.193). We mark all parameter options that we
wish to be considered, here, the “Forward stepwise” and “Best subset” model
selection methods, and confirm by clicking the “ok” button. With all other
parameters, proceed in an analog fashion.
5.5 The Auto Numeric Node 495

Fig. 5.193 Specification of the variable selection methods to consider in the Linear node model

8. If we have set all the model parameters of our choice, we run the model, and the
model nugget should appear in the stream. For each possible combination of
selected parameter options, the Modeler now generates a statistical model and
compares it to all other build models. If it is ranked high enough, the model is
included in the ensemble.
We would again like to point out that running a huge number of models can
lead to time-consuming calculations.
9. We add an Analysis node to the model nugget to calculate the statistics of the
predicted values by the ensemble, for the training and test sets separately. We
make sure that the “Separate by partition” option is enabled, as displayed in
Fig. 5.194.
496 5 Regression Models

Fig. 5.194 Analysis node to deliver statistics of the predicted values. Select “Separate by
partition” to process the training and test sets separately

10. Figure 5.195 shows the output and it shows distribution values from the
training data and test data. The subsequent section describes how these statis-
tics are calculated. To evaluate if the ensemble model can be used to process
independent data however, we have to compare the statistics and especially the
standard deviation. Since this is not much higher for the test data than for the
training, the model passes the cross-validation test and can be used for
predictions of further unknown data.
5.5 The Auto Numeric Node 497

Fig. 5.195 Analysis output with statistics from both the training and the test data

5.5.2 The Auto Numeric Model Nugget

In this short section, we take a closer look at the model nugget, generated by the
Auto numeric node, and the options it provides.

Model Tab and the Selection of Models Contributing to the Ensemble


In the Model tab, the models suggested for the ensemble by the Auto numeric node
are listed (see Fig. 5.196). In our case, we wanted the ensemble model to comprise
of four models: these are two decision trees (C&R and CHAID trees), a neuronal
network, and a linear regression model. The models are ordered by their correlation,
as this is the rank chosen in the node options (see previous section). The order can
be manually changed in the drop-down menu on the left, labeled “Sort by”, in
Fig. 5.196.
Here, the model statistics, which are the correlation and relative error, are
calculated for the test set, and thus rank according to these values (see the right
arrow in Fig. 5.196). We can change the basis of the calculations to be the training
data, on which all ranking and fitting statistics will then be based. The test set,
however, has an advantage in that the performance of the models is verified on
unknown data. Whereas, to inspect if each model fits the data well, we recommend
looking at the individual model nuggets manually, to inspect the parameter values.
Double-clicking on the model itself will allow the inspection of each model,
its included variables, and the model fitting process, with its quantity values such as
498 5 Regression Models

Fig. 5.196 Model tab of the Auto numeric model nugget. Specification of the models in the
ensemble, which are used to predict output

R2. This is highlighted by the left arrow in Fig. 5.196. This will open the model
nugget of each particular node in a new window, and the details of the model will be
displayed. Since each of the model nuggets is introduced and described separately
in the associated chapter, we refer to these chapters, as we omit a description of
each individual node here.
In the furthest left column labeled “Use?”, we can choose which of the models
should process data for prediction. More precisely, each of the enabled models
takes the input data and estimates the target value individually. Then, all outputs are
averaged to one single output. This process of aggregating can prevent overfitting
and minimize the impact of outliers, which will lead to more trustworthy
predictions.

Predictor Importance and Visualization of the Accuracy of Prediction


In the “Graph” tab, the predicted values are plotted against the observations on the
left-hand side (see Fig. 5.197). There, not every single data point is displayed, but
instead bigger colored dots that describe the density of data points in the area. The
darker the color, the higher the density, as explained in the legend right next to the
scatterplot. If the points adumbrate a line around the diagonal, the ensemble
describes the data properly.
In the graph on the right, the importance of the predictors is visualized in the
known way. We refer to Sect. 5.3.3 and Fig. 5.69 for explanations of the predictor
importance and the shown plot. The predictor importance of the ensemble model is
calculated on the averaged output data.
5.5 The Auto Numeric Node 499

Fig. 5.197 Graph tab of the Auto numeric model nugget. Predictor importance and scatterplot of
observed and predicted values

Fig. 5.198 Settings tab of the Auto numeric model nugget. Specification of the output

Setting of Additional Output


The “Settings” tab provides us with additional output options (see Fig. 5.198). If the
first option is checked, the predictions of each individual model are removed from
the prediction output. Otherwise, if we want to display or further process the
non-aggregated values, uncheck the box for this option.
500 5 Regression Models

Furthermore, we can add the standard error to our output, estimated for each
prediction (see the second check box in Fig. 5.198).
We recommend playing with these options and previewing the output data, to
understand the differences between each created output. For more information, we
recommend consulting IBM (2015b).

5.5.3 Exercises

Exercise 1: Longley Economics Dataset


The “longley.csv” (see Sect. 10.1.21) data contains annual economic statistics from
1947 to 1962 of the United Sates of America, including, among other variables:
gross national product; the number of unemployed people; population size; and the
number of employed people. The task in this exercise is to find the best possible
regression model with the Auto numeric node, for predicting the number of
employed people from the other variables in the dataset. What is the best model
node, as suggested from the Auto numeric procedure? What is the highest correla-
tion achieved?

Exercise 2: Cross-Validation with the Auto Numeric Node


Recall Exercise 2 from Sect. 5.4.5, where we built a stream with cross-validation in
order to find the optimal Generalized Linear Model, to predict the outcome of a test
score. There, three separate GenLin nodes were used to build the different models.
Use the Auto numeric node to combine the model building and validation processes
in a single step.

5.5.4 Solutions

Exercise 1: Longley Economics Dataset


Name of the solution streams Longley_regression
Theory discussed in section Section 5.5

For this exercise, there is no definite solution, and it is just not feasible to consider
all possible model variations and parameters. This exercise serves simply as a
practice tool for the Auto numeric node and its options. One possible solution
stream can be seen in Fig. 5.199.

1. To build the “Longley_regression” stream of Fig. 5.199, we import the data and
add the usual Type node. The dataset consists of 7 variables and 16 data records,
which can be observed with the Table node. We recommend using this node in
order to inspect the data in the data file and to get an impression of the data.
2. We then add the Auto numeric node to the stream, define the variables, with the
“Employed” variable as the target, and all other variables as predictor variables
(see Fig. 5.200).
5.5 The Auto Numeric Node 501

Fig. 5.199 Stream of the regression analysis of the Longley data with the Auto numeric node

Fig. 5.200 Definition of the variables for the Longley dataset


502 5 Regression Models

Fig. 5.201 Model build options

In this solution to the exercise, we use the default settings of the Auto numeric
node. That is, using the partitioned data for ranking the models, with the
correlation as indicator. We also want to calculate the importance of the
predictors (see Fig. 5.201). Furthermore, we include only the standard models,
which are uniformly specified by the Auto numeric node, in the evaluation and
comparison process (see Fig. 5.202). We strongly recommend also testing other
models and playing with the parameters of these models, in order to find a more
suitable model and to optimize the accuracy of the prediction.
3. After running the stream, the model nugget pops up. When opening the nugget,
we see in the “model rank” tab that the best three models and nodes are the
Regression node, the GLMM node, and the Linear model (see Fig. 5.203). There
the GenLin node tried to estimate the data relationship using a line: the default
setting. We observe that for all of the three models, the Correlation coefficient is
extremely high, 0.998. The models then show ranking results of further decimal
places, which are not displayed. Moreover, the model stemming from the Linear
node is slimmer, insofar as fewer predictors are included in the final model, i.e.,
4 instead of 6.
5.5 The Auto Numeric Node 503

Fig. 5.202 Definition of the models included in the fitting and evaluation process

As mentioned above, this is not the definite solution. We recommend


experimenting with the parameters of the models, in order to increase the
correlation coefficient.
4. High correlation between the predicted values and the observed values in the
Longley dataset is also affirmed by the Scatterplot in the “Graph” tab, displayed
in Fig. 5.204. The plotted points form an almost perfect straight line.
Furthermore, Fig. 5.204 visualizes the importance of the predictors. All
predictors are almost equally important. There is no input variable that outranks
the others.
504 5 Regression Models

Fig. 5.203 Evaluation overview of the best models selected by the Auto numeric node

Fig. 5.204 Scatterplot to visualize the model performance and the predictor importance

Exercise 2: Cross-Validation with the Auto Numeric Node


Name of the solution streams test_scores_Auto_numeric_node
Theory discussed in section Section 5.1
Section 5.1.2
Section 5.5
5.5 The Auto Numeric Node 505

Fig. 5.205 Complete stream of the exercise to fit a regression into the test score data with the
Auto numeric node

Fig. 5.206 Definition of the input, target, and partitioning variables


506 5 Regression Models

Figure 5.205 shows the final stream of this exercise.

1. To build the above stream, we first open the “001 Template-Stream test_scores”
stream and save it under a different name. Now, add a Partition node to the
stream and split the dataset into training, test, and validation sets, as described in
Fig. 5.165.
2. To include both the model building and the validation process in a single node,
we now add the Auto numeric node to the stream and select the “posttest” field as
the target variable, the Partition field as the partition and all other variables as
inputs (see Fig. 5.206).
3. In the “Expert” tab, we specify the models that should be used in the modeling
process. We only select the Generalized Linear Model (see Fig. 5.207). Then, we

Fig. 5.207 Definition of the GLM that is included in the modeling process
5.5 The Auto Numeric Node 507

Fig. 5.208 Definition of the Link function

click on the “Specify” option to specify the model parameters of the different
models that should be considered. This can be viewed in Fig. 5.207.
4. In the opened pop-up window, we go to the “Expert” tab and click on the right
“Option” field in the Link function, to define its parameters (see Fig. 5.208).
Then, we select the “Power” function as the link function, see Fig. 5.209, and
0.5, 1, and 1.5 as the exponents (see Fig. 5.210). After confirming these
selections by clicking on the “ok” button, all the final parameter selections are
shown (see Fig. 5.211). Now, three modes are considered in the Auto numeric
node, each having the “Power” function as link function, but with different
exponents.
508 5 Regression Models

Fig. 5.209 Definition of the Power function as Link function

Fig. 5.210 Definition of the


exponents of the link
functions

5. After running the stream, the model nugget appears. In the “Model” tab of the
nugget, we see that the three models all fit the training data very well. The
correlations are all high, around 0.981. As in Exercise 2 in Sect. 5.4.5, the
highest ranked is the model that has exponent 1 (see Fig. 5.212). The differences
between the models are minimal, however, and thus can be ignored.
6. If we look at the ranking of the test set, the order of the model changes. Now, the
best-fitting model is model number 3, with exponent 1.5 (see Fig. 5.213). We
double-check that all the models are properly fitted, and then we choose model
3 as the final model, based on the ranking of the test data.
5.5 The Auto Numeric Node 509

Fig. 5.211 Final definition of the model parameters

7. After adding an Analysis node to the model nugget, we run the stream again (see
Fig. 5.214 for the output of the Analysis node). We see once again that the model
fits the data very well (low and equal RMSE), which in particular means that the
validation of the final model is as good as for the testing and training sets. This
coincides with the results of Exercise 2 in Sect. 5.4.5.
510 5 Regression Models

Fig. 5.212 Build models ranked by correlation with the training set

Fig. 5.213 Build models ranked by correlation with the testing set
Literature 511

Fig. 5.214 Output of the Analysis node

Literature
Abel, A. B., & Bernanke, B. (2008). Macroeconomics (Addison-Wesley series in economics).
Boston: Pearson/Adison Wesley.
Boehm, B. W. (1981). Software engineering economics (Prentice-Hall advances in computing
science and technology series). Englewood Cliffs, NJ: Prentice-Hall.
Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data (Econometric society
monographs 2nd ed., Vol. 53). Cambridge: Cambridge University Press.
Fahrmeir, L. (2013). Regression: Models, methods and applications. Berlin: Springer.
Gilley, O. W., & Pace, R. (1996). On the Harrison and Rubinfeld data. Journal of Environmental
Economics and Management, 31(3), 403–405.
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air.
Journal of Environmental Economics and Management, 5(1), 81–102.
Hilbe, J. M. (2014). Modeling of count data. Cambridge: Cambridge University Press.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy.
International Journal of Forecasting, 22(4), 679–688.
IBM. (2015a). ‘SPSS modeler 17 algorithms guide. Accessed September 18, 2015, from ftp://public.
dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/AlgorithmsGuide.pdf
IBM. (2015b). SPSS modeler 17 modeling nodes. Accessed September 18, 2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerModelingNodes.pdf
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol. 103). New York: Springer.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.
512 5 Regression Models

Kutner, M. H., Nachtsheim, C., Neter, J., & Li, W. (2005). Applied linear statistical models (The
McGraw-Hill/Irwin series operations and decision sciences 5th ed.). Boston: McGraw-Hill
Irwin.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Thode, H. C. (2002). Testing for normality, statistics, textbooks and monographs (Vol. 164).
New York: Marcel Dekker.
Tuffery, S. (2011). Data mining and statistics for decision making (Wiley series in computational
statistics). Chichester: Wiley.
Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design
(McGraw-Hill series in psychology 3rd ed.). New York: McGraw-Hill.
Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms (Chapman & Hall/CRC
machine learning & pattern recognition series). Boca Raton, FL: Taylor & Francis.
Factor Analysis
6

After finishing this chapter, the reader is able to . . .

1. Evaluate data using more complex statistical techniques such as factor analysis
2. Explain the difference between factor and cluster analysis
3. Describe the characteristics of principal component analysis and principal factor
analysis as well as
4. Apply especially the principal component analysis and explain the results

Ultimately, the reader will be called upon to propose well thoughtout and
practical business actions from the statistical results.

6.1 Motivating Example

Factor analysis is used to reduce the number of variables in a dataset, identify


pattern, and reveal hidden variables. There are a wide range of applications: in
social science, factor analysis is being used to identify hidden variables that can
explain or that are responsible for behavioral characteristics. The approach can also
be used in complex applications, such as face recognition.
Various types of factor analysis are similar, in terms of calculating the final
results. The steps are generally the same but the assumptions, and therefore the
interpretation of the results, are different. In this chapter, we want to present the key
idea of factor analysis. Statistical terms are discussed if they are necessary for
understanding the calculation and helping us to interpret the results.
We use an example that represents the dietary characteristics of 200 respondents,
as determined in a survey. The key idea of the dataset and some of the
interpretations that can be found in this book are based on explanations from
Bühl, 2012. Here though, we use a completely new dataset.
Using the categories “vegetarian”, “low meat”, “fast food”, “filling”, and
“hearty”, the respondents were asked to rate the characteristics of their diet on an

# Springer International Publishing Switzerland 2016 513


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_6
514 6 Factor Analysis

Fig. 6.1 The dietary characteristics of three respondents

ordinal scale. The concrete question was “Please rate how the following dietary
characteristics describe your preferences . . .”.
As depicted in Fig. 6.1, the answer of each respondent can be visualized in a
profile chart (columns). Based on these graphs, we can find some similarities
between the variables “filling” and “hearty”.
If we analyze the profile charts row by row, we find that the variables are
somehow similar because the answers often go in the same direction. In statistical
terms, this means that the fluctuation of the corresponding answers (the same items)
is “approximately” the same.

" Factor analysis is based on the idea of analyzing and explaining


common variances in different variables.

" If the fluctuation of a set of variables is somehow similar, then behind


these variables a common “factor” can be assumed. The aim of factor
analysis is to determine these factors, to define subsets of variables
with a common proportion of variance.

" The factors are the explanation and/or reason for the original fluctua-
tion of the input variables and can be used to represent or substitute
them in further analyses.
6.2 General Theory of Factor Analysis 515

6.2 General Theory of Factor Analysis

If we think about fluctuation in terms of the amount of information that is in the


data, we can typically divide the volatility in two components:

1. The part or percentage of fluctuation that different variables have in common,


because a common but hidden variable exists in the “background” that is
responsible for this volatility.
2. The residual fluctuation that is not related to the fluctuation of the other variables
and therefore cannot be explained.

So in the example of analyzing the characteristics of respondents’ diets, we want


to extract and describe the reasons for fluctuation and to explain and understand the
habits of different types of consumers. Therefore, let us define what we want to call
a factor loading, as well as the communality.

" Factor analysis can be used to reduce the number of variables. The
algorithm determines factors that can explain the common variance
of several variable subsets. The strength of the relationship between
the factor and a variable is represented by the factor loadings.

" The variance of one variable explained by all factors is called the
communality of the variable. The communality equals the sum of the
squared factor loadings.

" The factors determined by a Principal Factor Analysis (PFA) can be


interpreted as “the reason for the common variance”. Whereas, the
factors determined by a Principal Component Analysis (PCA) can be
described as “the general description of the common variance” (see
Backhaus 2011, p. 357). PCA is used much more often than PFA.

" Both methods, PFA and PCA, can be also differentiated by their
general approaches used to find the factors: the idea of the PCA is
that the variance of each variable can be completely explained by the
factors. If there are enough factors there is no variance left. If the
number of factors equals the number of variables, the explained
variance proportion is 100 %.

We want to demonstrate the PCA as well as the PFA in this chapter. Figure 6.2
outlines the structure of the chapter.
516 6 Factor Analysis

Fig. 6.2 Structure of the factor analysis chapter

Fig. 6.3 Using correlation or covariance matrix as the basis for factor analysis

Assessing the Quality of a Factor Analysis


In the SPSS Modeler, factor analysis can be done using a PCA/Factor node. In the
expert settings of the node that is shown in Fig. 6.3, the correlation or the covariance
matrix can be defined as the basis of the calculations. The correlation matrix is used
6.2 General Theory of Factor Analysis 517

in the majority of the applications. This is why the covariance depends on the units
of the input variables.
The quality of the factor analysis heavily depends on the correlation matrix;
because of this different measures have been developed to determine if a matrix is
appropriate, and a reliable basis for the algorithm. We want to give here an
overview of different aspects that should be assessed:

Test for the Significance of Correlations


The elements of the matrix are the bivariate correlations. These correlations should
be significant. The typical test of significance should be used.

Bartlett-Test/Test of Sphericity
Based on Dziuban and Shirkey (1974), the test tries to ascertain that the sample
comes from a population where the input variables are uncorrelated. Or in other
words, that the correlation matrix is only incidentally different from the unit matrix.
Based on the chi-square statistic, however, the Bartlett test is necessary but not
sufficient. Dziuban and Shirkey (1974, p. 359) stated: “That is, if one fails to reject
the independence hypothesis, the matrix need be subjected to no further analysis.
On the other hand, rejection of the independence hypothesis with the Bartlett test is
not a clear indication that the matrix is psychometrically sound”.

Inspecting the Inverse Matrix


Many authors recommend verifying that the non-diagonal elements of the inverse
matrix are nearly zero (see Backhaus 2011, p. 340).

Inspecting the Anti-image-Covariance Matrix


Based on Guttman (1953), the variance can more generally be divided into an
image and an anti-image. The anti-image represents the proportion of the variance
that cannot be explained by the other variables (see Backhaus 2011, p. 342).
For a factor analysis, the anti-image matrix should be a diagonal matrix. That
means that the non-diagonal elements should be nearly zero (see Dziuban and
Shirkey 1974). Some authors prefer values smaller than 0.09.

Measure of Sampling Adequacy (MSA): Also Called Kaiser–Meyer–Olkin


Criterion (KMO)
Each element of the correlation matrix represents the bivariate correlation of two
variables. Unfortunately, this correlation can also be influenced by other variables.
For example, the correlation between price and sales volume typically depends also
on marketing expenditures. With partial correlations, one can try to eliminate such
effects of third variables and try to measure the strength of the correlation between
only two variables. The MSA/KMO compares both types of correlation based on
the elements of the anti-image-correlation matrix. For details see IBM (2011).
518 6 Factor Analysis

Based on Kaiser and Rice (1974), values should be larger than 0.5, or better still,
larger than 0.7 (see also Backhaus 2011, p. 343).
In particular, the MSA/KMO criterion is widely used and strongly recommended
in literature on the subject. Unfortunately, the SPSS Modeler does not offer this or any
of the other statistics or measures described above. The user has to be sure that the
variables and the correlation matrix are a reliable basis for a factor analysis. In the
following section, we will show possibilities for assessing the quality of the matrix.

Number of Factors to Extract


Thinking about the aim of a factor analysis, the question at hand is how to
determine the number of factors that will best represent the input variables.
A pragmatic solution is to look at the cumulative variance explained by the
number of factors extracted. Here, the user can decide if the proportion is accept-
able. Also for this reason, the Modeler offers specific tables.
Eigenvalues can help to determine the number of components to extract too, as
illustrated in Fig. 6.4, for a dataset with just three different values. We can consider
a plane with the objects as points pinned on it. After determining this plane, we can
orientate by using just two vectors. These vectors are called eigenvectors. For each
eigenvector exists an eigenvalue. The eigenvalue represents the volatility that is in
the data in this direction. So if we order the eigenvalues in descending order, we can
stepwise extract the directions with the largest volatility. We call these the principal
components.

Fig. 6.4 Dimension reduction and eigenvectors


6.3 Principal Component Analysis 519

Normally, principal components with an eigenvalue larger than one should be


used (see Kaiser 1960, p. 146). This rule is also implemented and activated in the
SPSS Modeler, as shown in Fig. 6.3.
Based on the eigenvalues and the number of factors extracted, a so-called “scree
plot” can be created (see Cattell 1966). Other statistic software packages offer this
type of diagram. The SPSS Modeler unfortunately does not. We will show in an
example how to create it manually.

6.3 Principal Component Analysis

6.3.1 Theory

Principal Component Analysis (PCA) is a method for determining factors that can
be used to explain common variance in several variables. As the name of the
method suggests, let us assume the PCA tries to reproduce the variance of the
original variables (principal components) after determining subsets of them. As
outlined in the previous section, the identified factors or principal components can
be described as “the general description [not reason!] of the common variance” in
the case of a PCA (see Backhaus 2011, p. 357).
In this chapter, we want to extract the principal components of variables and
therefore reduce the actual number of variables. We are more interested in finding
“collective terms” (see Backhaus 2011, p. 357) or “virtual variables”, and not the
cause for the dietary habits of the survey respondents. That’s because here we use
the PCA algorithm.
There are several explanations of the steps in a PCA calculation. The interested
reader is referred to Smith (2002).

" PCA identifies factors that represent the strength of the relationship
between the hidden and the input variables. The squared factor
loadings equal the common variance in the variables. The factors
are ordered by their size or by the proportion of variance in the
original variables that can be explained.

" PCA can be used to . . .

1. Identify hidden factors that can be used as “collective terms”


to describe (not explain) the behavior of objects, consumers, etc.
2. Identify the most important variables and reduce the number of
variables.
520 6 Factor Analysis

6.3.2 Building a Model in SPSS Modeler

Description of the model


Stream name Related to the explanation in this chapter:
pca_nutrition_habits.str
Extended version with standardized values:
pca_nutrition_habits_standardized.str
Based on dataset pca_nutrition_habites.sav
Stream structure

Related exercises: all exercises in Sect. 6.3.3

The data we want to use in this section describe the answers of respondents to
questions on their dietary habits. Based on the categories “vegetarian”, “low meat”,
“fast food”, “filling”, and “hearty”, the respondents rated the characteristics of their
diets on an ordinal scale. The concrete question was “Please rate how the following
dietary characteristics describe your preferences . . .”. The scale offered the
values “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. For details see also
Sect. 10.1.24.
We now want to create a stream to analyze the answers and to find common
factors that help us to describe the dietary characteristics. We also will explain how
PCA can help to reduce the number of variables (we call that variable clustering),
and help cluster similar respondents, in terms of their dietary characteristics.

1. We open the Template stream‚ “Template-Stream_nutrition_habits”, which is


also shown in Fig. 6.5. This stream is a good basis for adding a PCA calculation,
and interpreting the results graphically.
To become familiar with the data, we run the Table node. In the dialog
window with the table, we activate the option “Display field and value labels”,
which is marked in Fig. 6.6 with an arrow.
6.3 Principal Component Analysis 521

Fig. 6.5 Template-Stream_nutrition_habits

Fig. 6.6 Records from the dataset “nutrition_habites.sav”

As we can see, the respondents answered five questions regarding the


characteristics of their diets. The allowed answers were “never”, “sometimes”,
and “(very) often”.
Checking the details of the Filter node predefined in the stream, we realize
that here the respondents ID is removed from the calculations. This is because
the ID is unnecessary for any calculation (Fig. 6.7).

" There are at least three options for excluding variables from a stream.

1. In a (Statistics) File node, the variables can be disabled in the tab “Filter”.
2. The role of the variables can be defined as “None” in a Type node.
3. A Filter node can be added to the stream to exclude a variable from
usage in any nodes that follow.
522 6 Factor Analysis

Fig. 6.7 Filter node and its parameter

" The user should decide which option to use. We recommend creating
transparent streams that each user can understand, without having to
inspect each node. So we prefer Option 3.

Now we want to reduce the number of variables and give a general description of
the behavior of the respondents. Therefore, we perform a PCA. To do so we should
verify the scale types of the variables and their role in the stream. We double-click
the Type node. Figure 6.8 shows the result. The ID was excluded with a Filter node.
All the other variables with their three codes, “1 ¼ never”, “2 ¼ sometimes”, and
“3 ¼ (very) often” are ordinally scaled.
As explained in detail in Sect. 4.5, we want to calculate the correlations between
the variables. That means we determine the elements of the correlation matrix.
Here, we have ordinal input variables that ask for Spearmans rho as an appropriate
measurement (see for instance Scherbaum and Shockley 2015, p. 92). Pearson’s
correlation coefficient is an approximation, based on the assumption that the
distance between the scale items is equal.
The number of correlations that are at most weak (smaller 0.3) should be relatively
low. Otherwise, the dataset or the correlation matrix is not appropriate for a PCA.

" The SPSS Modeler does not provide the typical measures used to
assess the quality of the correlation matrix, e.g., the inverse matrix,
the Anti-Image-Covariance Matrix, and the Measure of Sampling
Adequacy (MSA)—also called Kaiser–Meyer–Olkin criterion (KMO).
6.3 Principal Component Analysis 523

Fig. 6.8 Scale types and the roles of variables in “nutrition_habites.sav”

" After reviewing the scale types of the variables, the correlation matrix
should be inspected. The number of correlations that are at most
weak or very low (below 0.3) should be relatively small.

" It is important to realize that the Modeler determines Pearsons


correlation coefficient, which is normally appropriate for metrically
(interval or ratio) scaled variables. Assuming constant distances
between the scale items, the measure can also be used for ordinally
scaled variables, but this would only be an approximation.

To calculate the correlation matrix, we can use a Sim Fit node as explained in
Sect. 4.5. We add this type of node to the stream from the Output tab of the Modeler
and connect it with the Type node (Fig. 6.9).

2. Running the Sim Fit node and reviewing the results, we can find the correlation
matrix as shown in Fig. 6.10. With 200 records, the sample size is not too large.
The number of small correlations that are at most weak (smaller 0.3) is 4 out of
10, or 40 %, and so not very small. However, reviewing the correlations shows
that they can be explained logically. For example, the small correlation of
0.012 between the variables “fast_food” and “vegetarian” makes sense. We
therefore accept the matrix as a basis for a PCA.
3. As also explained in Sect. 4.5, the Sim Fit node calculates the correlations based
on an approximation of the frequency distribution. The determined approxima-
tion of the distributions can result in misleading correlation coefficients. The
524 6 Factor Analysis

Fig. 6.9 Sim Fit node is added to calculate the correlation matrix

Fig. 6.10 Correlation matrix is determined with the Sim Fit node

Sim Fit node is therefore a good tool, but the user should verify the results by
using other functions, e.g., the Statistics node.
To be sure that the results reflect the correlations correctly, we want to
calculate the Pearson correlation with a Statistics node. We add this node to
the stream and connect it with the Type node (see Fig. 6.11).
4. The parameter of the Statistics node must be modified as follows:

– We have to add all the variables in the field “Examine” by using the drop-
down list button on the right-hand side. This button is marked with an arrow
6.3 Principal Component Analysis 525

Fig. 6.11 Statistics node is added to the stream

in Fig. 6.12. Now all the statistics for the frequency distributions of the
variables can be disabled.
– Finally, all the variables must also be added to the field “Correlate”. Once
more the corresponding button on the right-hand side of the dialog window
must be used.
Now we can run the Statistics node.

5. The results shown in Fig. 6.13 can be found in the last row or the last column
of the correlation matrix in Fig. 6.10. The values of the matrix are the same.
So we can accept our interpretation of the correlation matrix and close the
window.
N.B.: We outlined that there are several statistics to verify and to ensure the
correlation matrix is appropriate for a PCA. In the download section of this book,
an R-Script for this PCA is also available. Here, the following statistics can be
determined additionally:

KMO-statistics 0.66717 (>0.5 necessary)


Bartlett significance 4.8758e152 (<0.001)

This shows that the matrix is indeed quite appropriate for demonstration
purposes. For more details using R with the Modeler see Sect. 9.
526 6 Factor Analysis

Fig. 6.12 Parameter of the Statistics node

Fig. 6.13 Correlations determined with a Statistics node


6.3 Principal Component Analysis 527

6. Now we can apply the concept of the PCA to the given data. For this, we add a
PCA/Factor node from the SPSS Modeler tab “Modeling” and connect the node
to the Type node (see Fig. 6.14).
7. By double-clicking on the new node, we can modify the details of the calculation
procedure. As we have carefully checked the available variables, as well as their
scale type in the stream, we can use the option “Use type node settings” in the
first tab of the dialog window shown in Fig. 6.15.

Fig. 6.14 PCA/Factor node is added to the stream

Fig. 6.15 Variables used in the PCA/Factor node are defined


528 6 Factor Analysis

Fig. 6.16 Determination of the communality extraction method in the PCA/Factor node

8. We can find the most important options in the tab “Model” of the parameter
dialog (see Fig. 6.16). As decided before, we want to perform here a Principal
Component Analysis (PCA). It computes factors in order to extract the maxi-
mum of their variance. Further details related to all other algorithms offered by
the Modeler can be found in IBM (2015, p. 163). We will explain in the next
section the application of “Principal Axis Factoring (PFA)”.
9. In the third parameter tab of the PCA/Factor node, shown in Fig. 6.17, the
calculation procedure can be determined in detail. We suggest using the corre-
lation matrix here. Theoretically, we can determine the results based just on the
covariance matrix. In the majority of applications, the correlation matrix should
be used. The correlation matrix should be preferred because of the dependency
of the covariance on the units of the input variables, especially in the case of
different dimensions; also when the variance in the input variables is different
(see Jackson 2003, pp. 64–65; Jolliffe 2002, p. 24).
Additionally, in the “Expert” tab we can determine how many factors to
extract. The aim of a PCA is dimension reduction. Therefore, eigenvectors are
determined to transpose the given data (see Fig. 6.4). Every eigenvector has an
eigenvalue. This eigenvalue measures the amount of variance that is in the data
in the direction of the eigenvector.
Normally, principal components with an eigenvalue larger than one should be
extracted (see Kaiser 1960, p. 146). This rule is also implemented and activated
in the SPSS Modeler. The Modeler does not offer a wide range of statistics,
6.3 Principal Component Analysis 529

Fig. 6.17 Expert settings tab of the PCA/Factor node

however, as in the case of the PCA/Factor node. For instance, a scree plot would
be helpful here but is not available. We will show how to create this plot later on
(see Fig. 6.18). Meanwhile, we recommend extracting all eigenvalues in the case
of a PCA. To do this, we set the parameter “Eigenvalues” in Fig. 6.17 to zero.
Based on the results of the PCA, we can easily determine the number of factors
to use.

" The PCA should be applied based on the correlation matrix. This is
especially useful for input variables with different units or
dimensions.

" All eigenvalues (above zero) should be extracted in the first run.
Reviewing the results, the user can determine the number of factors
to use later on. The Kaiser criterion and the scree plot shown in
Fig. 6.18 can help to determine an appropriate number of factors.
530 6 Factor Analysis

Fig. 6.18 Scree plot (not offered by the SPSS Modeler)

Fig. 6.19 Model nugget is added after running a PCA/Factor node

10. We run the PCA and we get the additional model nugget node with the
determined model and its parameters (see Fig. 6.19). Usage of the model
nugget is explained in detail in Sect. 5.
11. If we double-click on the model nugget node, the Modeler shows us the results
in the tab “Advanced” (see Fig. 6.20). The initial communalities—the amount
of common variance of the variables explained—is 1.0. That’s because here we
performed a PCA. The PCA is based on the assumption that the whole variance
can be explained by the factors (see Sect. 6.1).
6.3 Principal Component Analysis 531

Fig. 6.20 PCA non-rotated solution

In Fig. 6.20, we can find the cumulative percentage of variance explained.


The left and the right part of the table are equal because we did not rotate the
factors. The most interesting part is the column “Cumulative %”, with the
cumulated percentage of variance in the input variables, which can be
explained by the different number of factors.
The scree plot—not offered by the Modeler—shown in Fig. 6.18 visualizes
the result based on the unrotated components. The number of extracted
components versus the cumulated percentage of variance explained is drawn
here. To do this, we exported the table “Total Variance Explained” in Fig. 6.20
to Microsoft Excel and created a diagram. The file “pca_nutrition_habits_
scree_plot.xlsx” can be found in the solution folder.
We recommend extracting all components and then determining how many
of them to use. The scree plot, as a rule of thumb, tells us to use as many
components as can be found before and including the elbow. Contrary to this,
the Kaiser criterion suggests extracting all components with an eigenvalue
larger than one. In this case, there would then be two components. Figure 6.20
shows that the third component adds 5.129 % of explained variance, but also
with two components 89.167 % is explained.

" There are many rules when determining the number of factors to
extract.
532 6 Factor Analysis

" First the user can determine the number of factors by deciding how
large the proportion of cumulative variance explained should be.

" At least all factors with an eigenvalue larger than one should be
extracted. This is called the Kaiser criterion (see, e.g., Guttman 1954).

" A Scree plot can be created, based on the unrotated solution. The plot
visualizes the number of components/factors extracted vs. the per-
centage of explained variance. A rule of thumb is to use as many
factors as eigenvalues before and including the elbow. Sometimes
the scree plot tends to encourage using too many factors (see Patil
et al. 2010). The Modeler does not provide the option of creating a
scree plot, but it can be produced easily with the results shown in the
“Advanced” tab of the factor analysis model nugget.

" Both the Kaiser criterion and the scree plot help to determine a useful
number of components to extract. The researcher should find the
best option with this information.

An explained percentage of approximately 90 % should be enough in this


example. So we can focus on the first two components shown in the second part
of the analysis in Fig. 6.21.
We can see that the variables are clearly associated with the first two
components, but the variable “low_meat” loads to both components. That’s because
both values are larger than 0.5. This shows us that the results of a PCA cannot

Fig. 6.21 Results of the PCA model


6.3 Principal Component Analysis 533

Fig. 6.22 Variables vs. unrotated principal components

always be interpreted that easily. Besides this uncertainty, we can also draw a
diagram as shown in Fig. 6.22.

" It is important to note that the factors load to the variables and not
the other way around!

" To analyze the component matrix (see Fig. 6.21) row by row means to
determine which factor or component loads to which variable. If one
variable is associated with more than one factor (more than one value
is equal or larger 0.5), then the interpretation of the corresponding
factors is not that simple.

Finally, the components should be rotated to get a better result. Varimax has an
appropriate procedure for doing this.

12. Before we interpret the components, we want to rotate them and improve the
solution. This means that the coordinate system will be rotated. Based on
Backhaus (2011, p. 363), we can distinguish two types of rotation:
534 6 Factor Analysis

Fig. 6.23 Rotation settings in the Expert Tab of a PCA/factor node

– If the factors (not the input variables) are not correlated with each other, then
an orthogonal rotation such as Varimax can be used.
– If the rotation should be done in respect to a correlation between the factors,
then an oblique-angled rotation should be used.

In the Modeler we once more double-click on the PCA/Factor node and


activate the “Expert” tab. Here, we should now set the number of factors for
extraction to 2. Then we click the button “Rotation . . .” as also shown in
Fig. 6.23.
13. As explained above we use here the Varimax rotation (see Fig. 6.24). That’s
because we do not expect a correlation between the components.
6.3 Principal Component Analysis 535

Fig. 6.24 Rotation settings of a PCA/factor node

Fig. 6.25 Rotated components in PCA/factor node

" It is important to define finally the number of factors to extract. As


outlined in Fig. 6.23 the Expert tab of the PCA/Factor node should be
used to do so.

14. Finally, we can run the PCA node once more. Figure 6.25 shows the result.
Here, we can see a better association of all factors with the input variables. See
also Fig. 6.26 for the association of the factors to the variables. It is important to
note that the factors load to the variables.
536 6 Factor Analysis

Interpretation of Factor Loadings


We exported the results by copying and pasting to the Microsoft Excel file,
“pca_eating_habits_index_coefficients”. The following calculations can also be
found there.
In general, the sum of all squared factor loadings per factor equals the percent of
variance explained by this factor. For instance, factor 2 explains 0.9602 ¼ 92.2 % of
the variance of the vegetarian preference. Or in other words, it cannot explain 1 –
0.9216 ¼ 7.8 %. Both factors together explain (0.041)2 + 0.9602 ¼ 92.3 % of the
variance. So the communality of the variable “vegetarian” is 92.3 %.
In Fig. 6.26, we only used the best association of a factor with a variable. The
percentage of explained variance by the selected component/factor is assigned to
the arrows. For instance, factor 2 explains 0.9602 ¼ 92.2 %, as mentioned above.

" The squared loadings represent the proportion of variance of an input


variable that can be explained by that factor. The sum of the squared
loadings of all factors equals the total variance that can be explained
and is called the communality of a variable.

The sum of all squared factor loadings per component equals the sum of the
squared values per column in Fig. 6.25. This sum equals the eigenvalues and can
be found in Fig. 6.27 in the column “Rotation sums of Squared Loadings/Total”.
For instance,

Fig. 6.26 Variables vs. rotated principal components


6.3 Principal Component Analysis 537

Fig. 6.27 PCA non-rotated solution

 2 
 0:041 þ 0:2412 þ 0:9612 þ 0:9132 þ 0:9132 ¼ 2:65

The proportion of the variance explained by each factor equals the eigenvalues
divided by the number of variables. As the eigenvalues equal the sum of all the
squared factor loadings, they must be divided by the number of variables. In this
case, factor 1 can explain
((0.041)2 + 0.2412 + 0.9612 + 0.9132 + 0.9132)/5 ¼ 2.65/5 ¼ 53 %, and factor
2 can explain (0.9602 + 0.9242 + 0.0182 + 0.1212 + 0.1252)/5 ¼ 36.1 %. All in all,
53 % + 36.1 % ¼ 89 %. This is what we determined earlier in the PCA (see Fig. 6.20).

" The squared factor loadings equal the percentage of the variance of
the input variable that can be explained by that factor.

" The sum of all squared factor loadings equals the eigenvalues of this
factor.

" The proportion of variance explained by each factor equals the


eigenvalues divided by the number of variables.

Interpretation of the results of a PCA involves describing the determined factors


using the association of the input variables to them. Keeping in mind the aim of
finding “the general description [not reason!] for the common variance” in subsets
of variables, we want to describe component 1 with “heartiness of diet” and
component 2 with “nativeness/naturalness of nutrition”. Here also the description
“health-seeking diet” would be possible, but this could be perceived as a “reason”,
and we do not want to determine reasons using a PCA.

Extension of the Stream to Visualize the Results


In the PCA, we found that two components can explain approximately 90 % of the
volatility of the input variables. We want to show now a two-dimensional visuali-
zation of the components, based on the so-called “factor scores” of each record.
With the diagram, we can also demonstrate how to interpret a PCA, to identify
clusters of objects that are related (Fig. 6.28).
To show the data in detail, we add a Table node and connect it to the model
nugget node.
538 6 Factor Analysis

Fig. 6.28 Table node is added to the stream

Fig. 6.29 Results of a PCA in the Table node

In the last two columns of the table in Fig. 6.29, we can see the so-called “factor
scores” for each case. This is a linear combination of the input variables calculated
with the PCA components. So each case is expressed in terms of the determined
components.
Using these results here, the answers given by the respondents can be
represented with only a loss of 100 %  89.16 % ¼ 10.84 % accuracy. The formula
given by the Modeler in the “Model” tab of the model nugget is used to calculate
these factor scores (see Fig. 6.30). For respondent number 1, the following equation
is used

0:1452*1 þ 0:0132*2 þ 0:464*1 þ 0:4102*2 þ 0:4094*2 þ ð1:837Þ ¼ 0:147

In reference to Janssen and Laatz (2010, pp. 571–573), the factor scores can be
precisely determined for a PCA, but for all other factor analyses types, a multivari-
ate regression model is used. The SPSS Modeler manual does not provide any
detailed information here (see IBM 2015, p. 165). It is obvious though that a
6.3 Principal Component Analysis 539

Fig. 6.30 Equations for factor score calculation in the model nugget node

multivariate regression, based on the rotated factor loadings, is used in PCA cases.
See the structure of the equation in Fig. 6.30.

" The factor scores or better still the principal component scores express
the input variables in terms of the determined factors, by reducing the
amount of information represented. The loss of information depends
on the number of factors extracted and used in the formula.

" The factor scores are standardized. So they have a mean of zero and a
standard deviation of one.

" Other multivariate methods, such as cluster analysis, can be used


based on the factor scores. The reduced information and the reduced
number of variables (factors) can help the more complex algorithms
to converge or to converge faster.

To visualize the factor scores, we add a Plot node from the Modelers Graph tab
to the stream and connect the model nugget node to this new node (see Fig. 6.31).
540 6 Factor Analysis

Fig. 6.31 Plot node is added to the stream

Fig. 6.32 Parameter of the Plot node for a simple 2D Plot

In the parameter section of the Plot node, we first want to assign factor 1 to the
x-axis and factor 2 to the y-axis (see Fig. 6.32).
Running the Plot node, we get the diagram in Fig. 6.33.
6.3 Principal Component Analysis 541

Fig. 6.33 Simple 2D Plot of the factor scores

Figure 6.34 is an extended version of Fig. 6.33. It shows three clusters and the
numbers of some records corresponding to the Table node in Fig. 6.29. It is
important to note that these records or row numbers are not representative. Due
to the discrete character of the questionnaire data, or more to the point, because of
the ordinal scale, the data points are overlapping.
This example shows that PCA can also be used as a clustering algorithm. PCA
can do more than just find “clusters of related variables”, as depicted in Fig. 6.26.
Clusters of objects can also be identified. We will discuss this in Exercise 4 of Sect.
7.4.3. This PCA-object-clustering characteristic can only be visualized in cases
with two or three extracted components, however. As we want to show in the
following chapter, a cluster algorithm can also be used to identify clusters based on
the factor scores of the records.
542 6 Factor Analysis

Fig. 6.34 Clustered cases based on the factor scores

Fig. 6.35 Second Plot node is added to the stream

To add more information to the graph, we can modify its parameters or add a
second Plot node to the stream (see Fig. 6.35).
We double-click on the new Plot node and modify the parameters as follows (see
Fig. 6.36):
6.3 Principal Component Analysis 543

Fig. 6.36 Parameters of the Plot node

– Assign Factor 1 to the x-axis.


– Assign Factor 2 to the y-axis.
– The color should depend on the answers to the item “vegetarian”.
– The size of the points should depend on the answers to the item “fast_food”.
– We want to add a slider to the graph to determine when to show the points. Here,
it should depend on the answer to the item “low_meat”.

Figure 6.37 shows the complete stream. The animated diagram produced by the
second Plot node is depicted in Fig. 6.38. We can see that . . .

– Respondents with similar dietary characteristics are “clustered” (see also


Fig. 6.34).
– As the size of the points represents the answers for the item “low_meat”, and
factor 1 represents the “heartiness of diet”, we find the respondents with an
affinity to meat on the right side.
– Component 2 stands for the “nativeness/naturalness of nutrition”. The more the
respondents prefer this type of nutrition, the larger the value for factor 2 and the
more likely that points will be found at the end of the y-axis.
– Additionally, the color represents the amount of vegetarian food preferred. The
darker the color, the more often the respondent eats vegetarian.
544 6 Factor Analysis

Fig. 6.37 Final stream

Fig. 6.38 Animated diagram with the factor scores


6.3 Principal Component Analysis 545

Fig. 6.39 Diagram with the PCA results (part 1), “low_meat ¼ never”

Fig. 6.40 Diagram with the PCA results (part 2), “low_meat ¼ sometimes”

Depending on the slider position, and the answer to the preference regarding
“low_meat”, we get results as depicted in Figs. 6.39, 6.40, and 6.41.
Summarizing all our findings, we can state that visualization of the factor scores
helps find clusters of respondents with similar dietary characteristics. Figure 6.42
shows three different respondents. Each of them represents one of the clusters. For
an overview of the clusters, see Fig. 6.43 and Table 6.1.
546 6 Factor Analysis

Fig. 6.41 Diagram with the PCA results (part 3), “low_meat ¼ (very) often”

Fig. 6.42 Respondents that represent the cluster


6.3 Principal Component Analysis 547

Fig. 6.43 Simple 2D Plot with cluster numbers

Table 6.1 Description of the clusters shown in Fig. 6.43


Number of cluster related
to Fig. 6.43 Description
1. A consumer that does not like vegetarian food and avoids “low
meat meals”.
2. Respondents that like meat but sometimes eat vegetarian food.
3. Respondents with a preference for vegetarian food. Nevertheless
some of them eat hearty meals also, but without meat.

6.3.3 Exercises

Exercise 1: PCA Versus Cluster Algorithms


In this chapter, we explained the usage of the PCA algorithm in detail. The aim of
factor analysis is to find subsets of variables, whereas the aim of cluster analysis is
to cluster objects. Explain the difference between both approaches in your own
words.

Exercise 2: Index Construction Using PCA Results


We discussed usage of the PCA algorithm, based on the dataset “nutrition_habites.
sav”. The results are in the model nugget included in the stream “pca_nutri-
tion_habits.str” (see also Fig. 6.44).
548 6 Factor Analysis

Fig. 6.44 Variables vs. rotated principal components

If the components in the diet example represent the “heartiness” or the “nativeness/
naturalness” of the food consumed by the respondents, then indices for even the
type of preference can be calculated. The indices then measures the level of
“heartiness” or “nativeness/naturalness” a respondent prefers.
The key idea behind index calculation is to use a linear combination for each subset
of the input variables identified by the PCA. The coefficients represent the weight or
the importance of each variable for the index (see also Wendler 2004, pp. 187–196).
To determine the coefficients, factor loadings can be used.
Answer the following questions . . .

1. In Fig. 6.44, the percentage of explained variance (squared factor loadings) is


assigned to each input variable. This percentage can be used as a basis for the
calculation of the weight or coefficients. Calculate the coefficients. Keep in mind
that the coefficients have to be “standardized”, so that the range of index values
equals the range of the input variables and can be interpreted in the same way. N.
B. For the calculation of the factor loadings, use the result of the stream
“pca_nutrition_habits.str”. Export the correct table to a Microsoft Excel file,
by using a right mouse-click.
2. Based on the template stream “Template-Stream_nutrition_habits.str”, and the
calculation of the indices, create a new stream where the “heartiness” and
“nativeness/naturalness” indices are being calculated. Use a Derive node to
do this.
6.3 Principal Component Analysis 549

3. Merge the calculation results for both indices using a Merge node. Don’t forget
to disable duplicated variables in the options of the Merge node. After merging
the results, finally visualize the index values.
4. Interpret your findings.

Exercise 3: Calculating IT Satisfaction Indices Using PCA


In a survey, the satisfaction of IT users with their IT system was determined.
Questions were asked, relating to several characteristics, such as “How satisfied
are you with the amount of time your IT system takes to be ready to work (from
booting the system to the start of daily needed applications)?”. The users could
rate aspects using the scale “1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent”. See also
Sect. 10.1.20.
You can find the results in the file “IT user satisfaction.sav”. The aim of this
exercise is to determine the components that help explain the user satisfaction.
Additionally, satisfaction indices should be calculated. As the algorithm is similar
to the solution presented in the previous exercise, we strongly recommend solving
that exercise first.
The data can be found in the file “IT_user_satisfaction.sav”. The template stream
“Template-Stream IT_user_satisfaction” helps you get access to that file.

1. Open the template stream. Analyze the data, as well as the variables.
Based on the following variables, use the PCA algorithm to determine two
components from those below, which represent more than 50 % of the volatility:
“starttime”, “system_availability”, “performance”, “training_quality”,
“user_orientation”, “data_quality”.
2. Explain your findings from the PCA in detail.
(a) Try especially to explain the components that have been determined.
(b) The variable “user_quali” represents the answer to the question “How do
you evaluate your knowledge, abilities, and accomplishments when dealing
with the IT system and the provided applications?”. Explain why this
variable could not be used in the PCA.
3. The input values of the variables named above can now be used to determine the
values of satisfaction indices. Using probably a separate calculation, e.g., a
Microsoft Excel spreadsheet, calculate the coefficients for each of the variables.
Standardize the coefficients so that the indices have the same scale as the input
variables.
4. Using the coefficients, extend the stream to calculate and visualize the values of
the satisfaction indices.
5. Summarize your findings in the form of a “Management Summary”.

Exercise 4: PISA Study


In the dataset “pisa2012_math_q45.sav”, you can find questionnaire responses
related to an OECD program called “Program for International Student Assess-
ment” (PISA). The responses are related to the question “Thinking about mathe-
matical concepts: how familiar are you with the following terms?” The respondents
550 6 Factor Analysis

would then evaluate mathematical terms such as “exponential function” and “divi-
sor”. A detailed description can be found in Sect. 10.1.28. Based on the template
stream “Template-Stream_OECD_PISA_Questionnaire_Question_45”, a PCA
should be performed.

1. Open the template stream and save it under a new name. Analyze the variables
included, as well as their scale.
2. Assess the quality of the correlation matrix.
3. Modifying the parameter of an appropriate Modeler node, perform a PCA.
Discuss especially the number of factors to extract. To do this you should also
create a scree plot.
4. Describe the identified components.
5. Assess the quality of the PCA and the result.

6.3.4 Solutions

Exercise 1: PCA Versus Cluster Algorithms


Finding subsets of variables means, in terms of a factor analysis, that we are trying
to determine hidden factors that each load to a subset of input variables. The factors
explain some variance that the input variables have in common, but their number is
smaller than the number of input variables. The PCA in this way clustered the
variables. That is why, for further analyses, it makes sense to use the factor instead
of the original variables that were input.
Using this result of the PCA, the factors can be used to combine the input
variables (linear combination) so that factor scores can be calculated for each
case. They can be interpreted in terms of the factors. The determined factor scores
for each case will then represent a (hopefully large) portion of the information from
the input variables. Based on the factor scores, the researcher can now try to find
clusters of objects with the same characteristics. So the subset of variables com-
bined as factor scores can help to cluster the objects (see also Fig. 7.1).

Exercise 2: Index Construction Using PCA Results


Name of the solution stream pca_nutrition_habits_indices.str
Microsoft Excel spreadsheet for index calculation:
pca_eating_habits_index_coefficients.xlsx
Theory discussed in chapter Section 6.1
Section 6.3

1. We exported the rotated factor loadings that we determined to a Microsoft Excel


file (see Table 6.2). The calculation of the index coefficients is shown in Table 6.3.
Here, we calculated the squared factor loadings and got the unstandardized
coefficients. The standardized index coefficients equal the unstandardized
coefficients divided by the sum of all unstandardized coefficients, which were
6.3 Principal Component Analysis 551

Table 6.2 Factor loadings in a Microsoft Excel spreadsheet


PCA factor loadings determined by the Modeler
Component 1 “heartiness of Component 2 “nativeness/naturalness of
nutrition” nutrition”
vegetarian 0.0412 0.9604
low_meat 0.2411 0.9244
fast_food 0.9608 0.0177
filling 0.9131 0.1207

Table 6.3 Calculation of the index coefficients


Variance explained by one
component ¼ unstandardized index
coefficients Standardized index coefficients
Component
Component Component Component 2 “nativeness/
1 “heartiness of 2 “nativeness of 1 “heartiness of naturalness of
nutrition” nutrition” nutrition” nutrition”
0.9223 0.5191
0.8545 0.4809
0.9231 0.3562
0.8338 0.3218
0.8344 0.3220
Sum 2.5913 1.7768 1.0000 1.0000

Fig. 6.45 Stream for calculating the values of the indices

assigned to the same component/index. This standardization is necessary, to


ensure that the index values are on the same scale as the input variables and
therefore can be interpreted in the same way.
2. Figure 6.45 shows the stream to calculate the indices. Figures 6.46 and 6.47
depict the formulas in the Derive nodes of the stream.
552 6 Factor Analysis

Fig. 6.46 Parameters of the Derive node for calculating the “heartiness” index

Fig. 6.47 Parameters of the Derive node for calculating the “nativeness/naturalness” index
6.3 Principal Component Analysis 553

Fig. 6.48 Parameters of the Merge node

Fig. 6.49 Final stream

3. In the Modeler, a Derive node has to be used to calculate each index. After that,
the results have to be consolidated using a Merge node. In the options for this
node, duplicated input variables must be disabled, as shown in Fig. 6.48.
Now the stream can be extended with a table, as well as a Plot node, to
visualize the result (see Fig. 6.49 and 6.50). Figure 6.51 shows the parameters of
the Plot node, and Fig. 6.52 shows the diagram with the index values.
4. This example shows how to use squared factor loadings to create indices
generally. The idea is to standardize the squared loadings and to use them as
coefficients of a linear combination of the corresponding input variables.
554 6 Factor Analysis

Fig. 6.50 Calculated index values shown in the Table node

Fig. 6.51 Parameters of the Plot node

The result helps to cluster the respondents by their preferences. Figure 6.52 is
similar to Fig. 6.43, and the algorithm to calculate the indices can be used
generally for all PCA results. For instance, we can calculate satisfaction indices
based on consumer survey results, by clustering questionnaire items. So we can
see if the satisfaction increases or decreases year by year.
But there are also some disadvantages:
6.3 Principal Component Analysis 555

Fig. 6.52 Calculated index values shown in the Plot node

– For purposes of calculation, the determined factor loadings are cumbersome


to use in the Modeler, as one has to copy them “manually”. Additionally,
some external calculations help to determine the standardized coefficients
efficiently.
– In this example, the dimensions of the coefficients are similar, and therefore
the input variables have nearly the same influence as the indices.
– Due to the ordinal character of the input variables, the indices are also
categorical.

Exercise 3: Calculating IT Satisfaction Indices Using PCA


Name of the solution stream pca_it_user_satisfaction.str
Microsoft Excel Spreadsheet for index calculation:
pca_IT_user_satisfaction_index_coefficients.xlsx
Theory discussed in chapter Section 6.1
Section 6.3
556 6 Factor Analysis

Fig. 6.53 Template-Stream IT_user_satisfaction

1. Figure 6.53 shows the template stream. The meaning of the input variables is
explained in detail in Sect. 10.1.20.
To perform the PCA, the template stream has to be extended. First of all, we
recommend using a Filter node to select the necessary variables named. Fig-
ure 6.54 shows the details. It is important to remember to disable the variables
that are not necessary for the PCA.
Additionally, we add a PCA/Factor node. Figure 6.55 shows the extended
template stream and the model nugget with the PCA. The factor scores are
shown in the Table node.
As described in the exercise, we should extract two components. Figure 6.56
shows the parameters of the PCA/Factor node. The number of components to
extract is at least 2, and the “Varimax” algorithm is used to rotate the solution.
2. Figures 6.57 and 6.58 show the most important details from the PCA results.
With the two components, we can explain 60.255 % of the variance of the used
input variables. This is not too much but a larger number of components would
be even harder to interpret, for example, in the case of six input variables.
(a) The factor loadings in Fig. 6.57 allow us to assign each variable to a
component. The start time, the system availability, as well as the perfor-
mance are more technical characteristics of an IT system. Therefore, the
component behind these variables will be called “technical satisfaction”.
Later, this will also be the name of the index.
Component 2 refers to the quality of the training offered by the firm, the
user orientation of the system, and the quality of the data that can be
accessed. We will call this component “organizational satisfaction”.
6.3 Principal Component Analysis 557

Fig. 6.54 Parameters of the Filter node

(b) The variable “user_quali” represents the answer to the question “How do
you evaluate your knowledge, abilities, and accomplishments when dealing
with the IT system and the provided applications?” So this is a user self-
assessment and doesn’t represent any aspect of satisfaction with the IT
system. We should not include this variable in any of the calculations.
3. As explained in Sect. 6.3.1, as well as in the previous exercise, the squared factor
loadings equal the explained variance of a variable by a factor. We can use these
numbers as coefficients, to cumulate the input variables into an index. The index
is a linear combination of the variables that are related to one of the components.
For straightforward calculation, we use Microsoft Excel and export the table
in Fig. 6.58 into a separate file with a right mouse-click. Several easy
calculations can be done in this spreadsheet. Standardization is necessary, to
558 6 Factor Analysis

Fig. 6.55 Template-Stream IT_user_satisfaction

Fig. 6.56 Parameters of the PCA/Factor node


6.3 Principal Component Analysis 559

Fig. 6.57 Result of PCA (part 1)

Fig. 6.58 Result of PCA (part 2)

ensure that the index has the same scale as the input variables. We divide the
squared factor loadings by their sum. For details see Fig. 6.59. The results can be
found in file “pca_IT_user_satisfaction_index_coefficients.xlsx”.
560 6 Factor Analysis

Fig. 6.59 Calculation of the index coefficients in Microsoft Excel

Fig. 6.60 Index calculation for “technical_satisfaction” in the Derive node

4. Now we can extend the stream to calculate the values of the indices. We use two
Derive nodes. In the formula, we have to define the coefficients manually, as
calculated in Fig. 6.59. Figures 6.60 and 6.61 show the parameters of the nodes.
To have a chance to assess the index values, we have to combine the results of
both parts of the stream. Here, we use a Merge node. The parameters are shown
in Fig. 6.62. Again it is important to disable the duplicates at the bottom of the
dialog window.
6.3 Principal Component Analysis 561

Fig. 6.61 Index calculation for “organisational_satisfaction” in the Derive node

To review the results, we add a Table node, a Data Audit node, and a Plot
node, as shown in Fig. 6.63. In the last two columns of the Table node in
Fig. 6.64, the index values for the first five respondents are shown. For a more
detailed analysis, we can use the Data Audit node. The statistical measures of
central tendency for the indices (see Fig. 6.65), show us there is no respondent
that answered all questions with “1 ¼ poor”. That’s because the minimum
indices is larger than 1. Theoretically, we could rescale the indices, but the
frequency distributions, as well as the 2D scatterplot of the indices in Fig. 6.66,
show us their practical importance.
5. Management Summary: Technical aspects, as well as organizational aspects,
determine the satisfaction of IT users. In the analysis, we use three variables for
each of these categories. With statistical analysis (PCA), we can find two
important factors, which help us to summarize user opinions. Despite the fact
that the determined components do not represent all the information details
(at least 60 %), we can calculate values within a technical and an organizational
satisfaction index. As the survey will be repeated at adequate intervals, we can
measure satisfaction over time. Regardless of technical details, the firm can
evaluate the effect of IT expenditures based on the indices.
562 6 Factor Analysis

Fig. 6.62 Merge node parameters

Fig. 6.63 Final stream with PCA and index calculations


6.3 Principal Component Analysis 563

Fig. 6.64 Calculated satisfaction indices

Fig. 6.65 Measures of the central tendency and volatility of the indices

Fig. 6.66 2D scatterplot of the indices


564 6 Factor Analysis

Exercise 4: PISA Study


Name of the solution stream pca_pisa_question_45.str
Microsoft Excel Spreadsheet for index calculation:
pca_eating_habits_index_coefficients.xlsx
Theory discussed in chapter Section 6.1
Section 6.3

1. The details of the dataset are outlined in Sect. 10.1.28. Here, the description of
the variables and their coding are explained. Figure 6.67 shows the structure of
the Modeler’s template stream, which is the basis for performing a PCA at the
end. Verifying the settings in the Type node, we can see that all the variables are
ordinally scaled. This is correct given the scale of the answers.
2. To inspect the correlation matrix, we use a Sim Fit node. As this node
approximates the frequency distributions of the variables in the background,
we add also a Statistics node to the stream. The extended template stream is
depicted in Fig. 6.68.
The node at the bottom shows the result of the Sim Fit node. Double-clicking
on this node, the correlation matrix in Fig. 6.69 appears. The correlations
determined by the Sim Fit node are correct, as we can see in Fig. 6.70, which
shows the results of the statistics node. The correlations in Fig. 6.70 can be found
in the first row of the correlation matrix.
We discussed the output of the Statistics node in detail in Sect. 4.4. The
absolute values of the correlations can be found in Table 4.2, but here the
correlations are assessed by the Modeler, using their inverse significance. As
shown in Table 4.3, there is a good chance that the variables are correlated if the
inverse significance is larger than 0.95. Scrolling through the results of the

Fig. 6.67 Template stream


“010 Template-Stream_
OECD_PISA_Questionnaire_
Question_45”
6.3 Principal Component Analysis 565

Fig. 6.68 Extended template stream

Fig. 6.69 Correlation matrix determined by the Sim Fit node


566 6 Factor Analysis

Fig. 6.70 Correlations determined by the Statistics node

Statistics node, we can see that all correlations (bar one) are strong. That means
that the values are reliable. The absolute values, however, tell us that the
correlation matrix is not a good basis for a PCA.
N.B.: We outlined that there are several statistical tools for verifying if the
correlation matrix is appropriate for a PCA. An R-Script for this PCA is also
available in the download section of this book. Here, the following statistics can
also be determined:
6.3 Principal Component Analysis 567

Fig. 6.71 Parameters of the PCA/factor node

KMO statistics 0.87079 (>0.5 necessary)


Bartlett significance 0 (<0.001)

These statistics let us assume that we can use the correlation matrix as the
basis for a quite reliable PCA.
3. We add a PCA/Factor node with the parameters shown in Fig. 6.71. Rotation of
the solution is not yet enabled. We run the PCA/Factor node and add the model
nugget to the stream (see Fig. 6.72).
Assessing the Modelers output in Fig. 6.73, we can see that 55.36 % of the
variance in the responses can be explained by just three factors. Figure 6.74
depicts the scree plot. Also here we can see that three factors should be extracted.
Now we can modify the parameters of the PCA/Factor node so that the three
factors are extracted and the Varimax rotation will be used. Figure 6.75 shows
the result.
568 6 Factor Analysis

Fig. 6.72 Added PCA/Factor node to the template stream

Fig. 6.73 PCA unrotated results

4. In this example, interpretation of the identified components is more difficult.


Figure 6.76 shows the identified components that help to create subsets of
variables. Additionally, the proportion of variance in the input variables that
can be explained by one of the related components is assigned to the
corresponding arrow.
Finding an appropriate description of the components is difficult in that
example. For instance, association of the “Polygone” with the “Vectors” and
6.4 Principal Factor Analysis 569

Fig. 6.74 Scree plot for PISA dataset PCA

the “Complex numbers” is difficult to interpret. The results in general make


sense from a mathematical perspective, however.
5. The SPSS Modeler is not as smooth to use for PCA/Factor analysis as for other
statistical procedures. This is particularly due to problems associated with the
most important factor: assessment of the quality of the correlation matrix. Here,
we recommend the usage of other statistical programs, such as R. See also Sect. 9.
This PCA, based on a real questionnaire dataset, reveals some difficulties. For
instance, interpretation of the identified components is not that clear. Regardless,
the characteristic of the PCA algorithm for finding subsets of related variables
can help to interpret the questionnaire responses.

6.4 Principal Factor Analysis

6.4.1 Theory

The most important details of the factor analysis are explained in Sect. 6.2. Addi-
tionally, we discussed the steps of a factor analysis in Sect. 6.3.1, where we used the
Principal Component Analysis (PCA). In the end, the PCA tries to identify factors
or principal components that are used as “a general description [not reason!] for the
common variance” (see Backhaus 2011, p. 357).
In this chapter, we want to extend our knowledge by looking at Principal Factor
Analysis (PFA). We will use the dataset with responses from 200 consumers
regarding their dietary preferences. For details see Sect. 10.1.24.
570 6 Factor Analysis

Fig. 6.75 Rotated components of PISA PCA

Respondents were asked to rate the characteristics of their diet on an ordinal


scale, under the categories “vegetarian”, “low meat”, “fast food”, “filling”, and
“hearty”. The concrete question was “Please rate how the following dietary
characteristics describe your preferences . . .”.

" To understand the PFA concept, we recommend first reading the


theory of the factor analysis approach in Sect. 6.2. As the PFA and the
PCA are based on the same algorithm, one should also read Sect. 6.3.
6.4 Principal Factor Analysis 571

Fig. 6.76 Components vs. input variables with the variance proportion explained

" PFA is a type of a factor analysis. It tries to identify hidden principal


components. In most applications, PCA will be preferred, because of
its more practical usefulness.

Confirmatory Versus Exploratory Factor Analysis


We are now interested in finding reasons for the dietary habits of respondents (see
Backhaus 2011, p. 357). That is because we want to use the PFA algorithm. The
starting point is a theory about subsets of variables that tend to explain the same part
of the behavior. Based on this theory, we will create a stream and perform the PFA.
If we realize that the theory is not optimal and we modify the model, we then no
longer follow a confirmatory approach. Instead, we would then use an exploratory
procedure (see also Backhaus et al. 2013, pp. 125–126).

" Using a PFA means finding reasons for the values of variables or the
behavior of respondents. PFA should only be used when there is a
theory on the dependency between variables.

" The aim of the PFA is to confirm that theory. This procedure is called a
confirmatory factor analysis.
572 6 Factor Analysis

Fig. 6.77 Estimated communalities to start the PCA algorithm

" If it turns out that the theory is inappropriate, and has to be updated,
then an exploratory factor analysis approach is performed. Statisti-
cally speaking, that means determining factor loadings that best
reproduce the empirical correlation matrix (fundamental theorem)
(see Backhaus et al. 2013, p. 125).

Building a Theory on Dependencies Between Variables


In this example, interpretation of the variables is straightforward and is also
presented in Sect. 10.1.24. We can formulate a theory on the relationships between
the variables: the variables “vegetarian” and “low_meat” represent the “healthi-
ness” of the food, whereas, the other variables “fast_food”, “filling”, and “hearty”
reflect the “price–quality ratio”.

Distinguishing PFA from PCA


The calculation procedure with PFA is the same as with the PCA. Based on the
correlation matrix, loadings will be calculated and the squared loadings will be the
communalities and equal the explained variance.
Understanding the difference between PFA and PCA means starting from the
beginning of this procedure. Figure 6.77 shows once more the percentage of
variance that should be explained by the factors in a PCA. This figure is shown
also in Fig. 6.20. Initially, a PCA algorithm should determine components that
explain 100 % of the volatility. If we use all the extracted components (here five),
we can also then indeed explain all the volatility.
6.4 Principal Factor Analysis 573

The assumption with PFA is different (e.g., Tacq 1997, pp. 298–301): The
volatility of a variable can be divided into the communality and a residual variance
that cannot be explained using hidden variables. So the initial communality will
always be smaller than 1. Theoretically, we can determine the proportion of
variance that should be reproduced, based on our knowledge of the relationship.
The PFA will determine factors that can reproduce the given proportion. In fact, the
software contains an algorithm that tries to determine the common variance.
Assuming that the SPSS Modeler algorithm and the procedure implemented in
IBM SPSS Statistics are the same, the initial communalities are determined as
multiple determination coefficients of a multivariate regression, with each variable
as the target variable to reproduce.

6.4.2 Building a Model

Description of the model


Stream name pfa_nutrition_habits.str
Based on dataset nutrition_habites.sav
Stream structure

Related exercises: all exercises in Sect. 6.4.3

Here, we also want to use data that represent the answers of respondents in relation
to their dietary habits. Based on the categories “vegetarian”, “low meat”, “fast food”,
“filling”, and “hearty”, the respondents should rate the characteristics of their diet on
an ordinal scale. The concrete question was “Please rate how the following dietary
characteristics describe your preferences . . .”. The scale offers the values
“1 ¼ never”, “2 ¼ sometimes”, and “3 ¼ (very) often”. See also Sect. 10.1.24.
In Sect. 6.3.2, we explained in detail how to create a stream to perform a PCA. As
discussed above, the PFA and the PCA calculations are similar, except for the initial
communalities that are used and will be reproduced by the factors. The calculations
are similar because we don’t have to create a new stream here. Instead, we will use
574 6 Factor Analysis

Fig. 6.78 Initial PCA stream that should be modified

the stream “pca_nutrition_habits.str” and modify some parameters to perform a PFA


instead of a PCA. Then we will discuss the meaning of the factors extracted.

1. We open the stream “pca_nutrition_habits.str” and save it under another name.


The authors used “pfa_nutrition_habits.str” for the solution stream, which can be
found in the streams folder of this book. You should be careful and use
another name.
If you want to focus on the result without doing all the modifications, you can
open the stream “pfa_nutrition_habits.str” and go straight to the interpretation of
the results below.
Figure 6.78 shows the initial stream with the PCA calculation, however. This
stream should be modified here.
We assess the correlation matrix in Sect. 6.3.2. We see that the matrix is
appropriate. So we do not have to assess the correlations once more here using
the Sim Fit and the Statistics node.
To perform a PFA, we double-click on the PCA/Factor node. In the “Model”
tab, we select the extraction method “Principal Axis Factoring” (PAF), which is
similar to the PFA (see IBM 2015, p. 163) (Fig. 6.79).
Figure 6.80 shows the other parameters of the PFA to perform. They are similar
to the PCA shown in Fig. 6.23. The rotation type “Varimax” must be activated.
We can start the calculation of the factor loading using the button “Run” in the
PCA/Factor node.
Figure 6.81 shows the initial factor loadings, as well as the extracted. At the
end of all iterations done in the background, the extracted communalities are
larger than the initial values.
2. As shown in Fig. 6.82, the extracted two factors account for 82.389 % of the
variance in the input variables. This is a very good result.
6.4 Principal Factor Analysis 575

Fig. 6.79 Activating the PAF/PFA in the PCA/Factor node

Fig. 6.80 Rotation settings in the Expert Tab of a PCA/factor node


576 6 Factor Analysis

Fig. 6.81 Initial and extracted communalities using a PFA

Fig. 6.82 Explained variance shown in the PCA/Factor node

3. The rotated factor loadings are shown in Fig. 6.83. In comparison with the PCA
results in Fig. 6.84, the loadings, and therefore also their communalities, are
smaller. That is because of the restriction caused by the initial communalities, at
the starting point of a PFA extraction.
4. As it does not make sense to plot the factor scores here, we should remove both
Plot nodes. Figure 6.85 shows the final stream.
6.4 Principal Factor Analysis 577

Fig. 6.83 Rotated factor loadings of the PFA

Fig. 6.84 Rotated factor loadings of the PCA (not PFA!)


578 6 Factor Analysis

Fig. 6.85 Final stream for performing the PFA

Fig. 6.86 Reproduced correlations and the residuals determined using SPSS Statistics

Interpretation of Factor Loadings


Based on our theory that the variables “vegetarian” and “low_meat” represent the
“healthiness” of the food, and the variables “fast_food”, “filling”, and “hearty”
reflect the “price–quality ratio”, we were able to extract considerable
communalities.
We recall that the squared factor loadings equal the communalities, and there-
fore by definition, the percentage of variance explained. For example, the volatility
of the input variable “vegetarian” can be explained by the factor “healthiness” to
0.8992 ¼ 80.82 %.
The aim of a PFA is to best reproduce the correlations between the variables.
Unfortunately, the SPSS Modeler does not offer the matrix of these reproduced
correlations, so the user can barely assess the quality of the PFA. Figure 6.86
6.4 Principal Factor Analysis 579

shows the values and the derivation from the original values, called residuals. These
results are calculated with IBM SPSS statistics. All residuals are smaller than 0.05,
so the factors can be used to explain the original variables.

6.4.3 Exercises

Exercise 1: Explaining the PFA Algorithm and Results

1. Explain the difference between the PCA and the PFA algorithm in your own
words.
2. Remember the PCA and PFA diet example shown here. Figures 6.83 and 6.84
show the determined factor loadings. Give a detailed interpretation of the
“practical” meaning of both results.
3. The right-hand side of Fig. 6.81 shows the extracted communalities. Figure 6.83
shows the rotated factor loadings.
(a) Interpret the communality 0.857 for the variable “low_meat”.
(b) Explain the calculation of the value 0.857 for the variable “low_meat”,
using the factor loadings shown in Fig. 6.83.

Exercise 2: Building a Small PFA Stream


In the file “beer.sav”, characteristics of different beers can be found (see also
Sect. 10.1.2).

1. Using the dataset, create a new stream from scratch. There is no template stream
given. Load the dataset using an appropriate node. Show the data and analyze the
variables, with regard to the possibility of using them in a PFA.
2. Now add all nodes that are necessary to perform a PFA to the stream.
3. Interpret you results.

6.4.4 Solutions

Exercise 1: Explaining the PFA algorithm and Results


Name of the solution streams pca_nutrition_habits.str
pfa_nutrition_habits.str
Theory discussed in chapter Section 6.1
Section 6.3.2
Section 6.4.1

1. The difference between the PCA and the PFA algorithm is described in
Sects. 6.1, 6.3.2, and 6.4.1.
2. The factors determined by a Principal Factor Analysis (PFA) can be interpreted
as “the reason for the common variance”, whereas the factors determined by a
580 6 Factor Analysis

Principal Component Analysis (PCA) can be described as “a general description


for the common variance” (see Backhaus 2011, p. 357). PCA is used much more
often than PFA.
Keeping in mind the aim of the PCA (finding “a general description [not
reason!] for the common variance” of subsets of variables), we described
component 1 as “heartiness of diet” and component 2 as “providence of diet”.
Based on our theory for the PFA, we interpreted the factor behind the
variables “vegetarian” and “low_meat” as the “healthiness” of the food, and
the factor behind the variables “fast_food”, “filling”, and “hearty” as the “price–
quality-ratio”.
3. The answers are . . .
a. The value 0.857 shown in Fig. 6.81 equals the proportion of variance that can
be explained by using both factors.
b. In Fig. 6.83, we can find the rotated factor loadings. The squared factor
loadings are the communalities, but here these are the communalities
separated for each factor. The calculation behind this is: 0.2412 + 0.8942
¼ 0.857. So the sum of all the squared factor loadings equals the communality
extracted by all factors.

Exercise 2: Building a Small PFA Stream


Name of the solution streams pfa_beer_solution.str
Theory discussed in chapter Section 6.1
Section 6.3.1
Section 6.4.1

1. Figure 6.87 shows the complete stream. To import the dataset, we add a
Statistics File node and a Table node. As we can see in Fig. 6.88, four variables
are defined. The variable “name” cannot be used in a PFA. All the other three are
metrically scaled and therefore provide a possible basis for a PFA.
To create a transparent stream, we recommended adding a Type node. As
depicted in Fig. 6.89, the variables “price”, “calories”, and “alcohol” are contin-
uous. Additionally, the role for the variable “name” is “None”.

Fig. 6.87 Complete stream “pfa_beer_solution.str” for performing the PFA


6.4 Principal Factor Analysis 581

Fig. 6.88 Some records of “beer.sav” shown in Table node

Fig. 6.89 Type node parameter

In addition, here we can add a Filter node to entirely block the variable
“name” from usage in any other nodes that follow. This is not really necessary
as the role was already set to “None”, but in our experience, one should do
everything possible to create streams that can be easily understood by each user.
You can decide which option to use.
Here, we add a Filter node to exclude the variable “name” (see Fig. 6.90).
582 6 Factor Analysis

Fig. 6.90 Filter node parameter

Fig. 6.91 PCA/Factor node parameters (part 1)

2. To perform a PFA, we need to add a PCA/Factor node at the end of the stream. In
the options of the node, we defined “Principal Axis Factoring” as the extraction
method and also in the Expert tab, to extract all factors with eigenvalues larger
than 0. Finally, it is important to define the rotation method. Here, we used
“Varimax” (Figs. 6.91 and 6.92).
6.4 Principal Factor Analysis 583

Fig. 6.92 PCA/Factor node parameters (part 2)

3. We can identify two factors. Factor 2 loads only the variable “price”. Factor
1 loads only “calories” and ”alcohol”. Obviously, factor 2 is not very useful.
That is because the factor loading is 0.410 and so the communality is 0.168 (see
Fig. 6.93). The proportion of variance explained by factor 1 is negligible, but
factor 2 leads us to expect it to be useful. Looking at the communality, we can
explain a significant proportion of the variance of both input variables.
Here, we can identify a “common” proportion of variance between “calories”
and “alcohol”. Knowing the aim of a PFA, we can describe factor 2 as the
“heaviness” or “strength” of a beer. This example is useful for training purposes
only, however.
584 6 Factor Analysis

Fig. 6.93 PCA/Factor analysis results

Literature
Backhaus, K. (2011). Multivariate Analysemethoden: Eine anwendungsorientierte Einf€ uhrung,
Springer-Lehrbuch (13th ed.). Berlin: Springer.
Backhaus, K., Erichson, B., & Weiber, R. (2013). Fortgeschrittene multivariate
Analysemethoden: Eine anwendungsorientierte Einf€ uhrung, Lehrbuch (2nd ed.). Berlin:
Springer Gabler.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research,
1, 245–276.
Dziuban, C. D., & Shirkey, E. C. (1974). When is a correlation matrix appropriate for factor
analysis? Some decision rules. Psychological Bulletin, 6(81), 358–361.
Guttman, L. (1953). Image theory for the structure of quantitative variates. Psychometrika, 18(4),
277–296.
Guttman, L. (1954). Some necessary conditions for common-factor analysis. Psychometrika, 19
(2), 149–161.
IBM. (2011). Kaiser-Meyer-Olkin measure for identity correlation matrix – United States.
Accessed March 18, 2015, from http://www-01.ibm.com/support/docview.wss?
uid¼swg21479963
IBM. (2015). SPSS modeler 17 modeling nodes. Accessed September 18, 2015 ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerModelingNodes.pdf
Jackson, J. E. (2003). A user’s guide to principal components. New York: Wiley.
Janssen, J., & Laatz, W. (2010). Statistische Datenanalyse mit SPSS: Eine anwendungsorientierte
Einf€ uhrung in das Basissystem und das Modul Exakte Tests; [Zusatzmaterial online] (7th ed.).
Berlin: Springer.
Literature 585

Jolliffe, I. T. (2002). Principal component analysis, Springer series in statistics (2nd ed.).
New York: Springer.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and
Psychological Measurement, 20, 141–151.
Kaiser, H. F., & Rice, J. (1974). Little Jiffy, Mark IV. Educational and Psychological Measure-
ment, 34, 111–117.
Patil, V. H., McPherson, M. Q., & Friesner, D. (2010). The use of exploratory factor analysis in
public health: A note on parallel analysis as a factor retention criterion. American Journal of
Health Promotion, 24(3), 178–181.
Scherbaum, C., & Shockley, K. M. (2015). Analysing quantitative data: For business and
management students, mastering business research methods. London: Sage.
Smith, L. (2002). A tutorial on principal components analysis. Accessed March 13, 2015, from
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Tacq, J. J. A. (1997). Multivariate analysis techniques in social science research: From problem to
analysis. London: Sage.
Wendler, T. (2004). Modellierung und Bewertung von IT-Kosten: Empirische Analyse mit Hilfe
multivariater mathematischer Methoden, Wirtschaftsinformatik. Wiesbaden: Deutscher
Universitäts-Verlag.
Cluster Analysis
7

After finishing this chapter the reader is able to . . .

1. Evaluate data using more complex statistical techniques such as cluster analysis,
2. Explain the difference between several approaches to deal with large datasets in
cluster analysis by using TwoStep or K-Means algorithm,
3. Describe the advantages and the pitfalls of the cluster analysis methods,
4. Apply TwoStep or K-Means and explain the results as well as
5. Describe the usage of the Auto Clustering node of the SPSS Modeler and its
pitfalls.

Ultimately, the reader will be called upon to propose well thought-out and
practical business actions from the statistical results.

7.1 Motivating Examples

Cluster analysis is a collection of many different multivariate statistical methods,


considering a more or less large dataset that describes objects or persons with many
variables. A cluster analysis is used to identify groups of objects that are “similar”.
As shown in Fig. 7.1, the algorithms are used to find objects represented by the rows
in the dataset that belong to the different clusters, whereas the PCA and PFA factor
analysis approaches try to find variables that can be described by a common factor
or component.
Examples where, in practice, cluster analysis helps to find groups are:

• Marketing and Strategy—segmentation of a department store’s customers.


In a dataset, the items bought, as well as the individual characteristics of the
customers (e.g., gender, age, time of visits, etc.), are recorded. With cluster
analysis, statisticians can identify customer subgroups. In the case of a

# Springer International Publishing Switzerland 2016 587


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_7
588 7 Cluster Analysis

Fig. 7.1 Distinguishing between cluster and factor analysis

successful segmentation, the department store can forecast the possible needs of
a customer group and deliver attractive offers.
• Banking—identification of groups in nonperforming loans.
Examining a dataset with firm-specific information (type of firm, business
sector, number of employees, rated experience of owner, etc.) on nonperforming
loans, a bank can define firm categories. In the case of a new enquiry for a loan,
the bank is able to predict the risk, based on the data of the firm. This process is
also called rating.
• Medicine—finding diagnostic clusters.
Based on data, a risk evaluation with respect to the carcinogenic qualities of
certain substances can be performed.
• Education—Identifying groups of students with special needs.
Using data on socioeconomic background, combined with the performance of
the students in several courses or programs, the university staff can identify
student subgroups with special needs, such as more intensive support, additional
exercises, or advanced services, such as use of the electronic services offered on
a university-wide online platform.

In this chapter, we explain the general procedure for determining clusters of


similar objects. To do this, we discuss how to measure the similarity or the
7.2 General Theory of Cluster Analysis 589

dissimilarity of two objects. To understand the different approaches, we will use


examples or datasets, such as the following:

Car segmentation based on the price


The data represent prices of six cars in different categories. The dataset includes the
name of the manufacturer, the type of car, and a price. Using the prices of the cars,
we explain the clustering procedure and determine step-by-step segments of similar
cars, based on their price. For details of the dataset see also Sect. 10.1.4.

Customer segmentation
Customer segmentation, also called market segmentation, is one of the main fields
where cluster analysis is often used. A market is divided into subsets of customers
who have typical characteristics. The aim of this segmentation is to identify target
customers or to reduce risks.
In the banking sector, this technique is used to improve the profitability of
business and to avoid risks. If a bank can identify customer groups with lower or
higher risk of default, the bank can define better rules for money lending or credit
card offers. We apply cluster algorithms for customer segmentation purposes. See
also Sect. 10.1.7.

7.2 General Theory of Cluster Analysis

As outlined in the previous section, the term “cluster analysis” stands for a set of
different algorithms for finding subgroups in a dataset. Before we start to explain
the three algorithms available in SPSS Modeler, we want to give an overview of the
different approaches. Figure 7.2 shows their names, as well as the structure of this
section.
The big picture allows us to characterize the procedures by their advantages and
disadvantages, so that later on we can identify the correct procedure for the data
given or the problem to solve.
Considering a given dataset with two variables, as depicted in Fig. 7.3, we can
imagine different procedures for finding subgroups of objects. The variables here
are metrical and we can find the subgroups by determining the distance between the
objects pairwise. That means, determining their dissimilarity. We will show this
approach later in detail. For now, it is important to note that measuring the
similarity or the dissimilarity/distance of objects is the basis of all cluster
algorithms. For the first step, we want to focus here on procedures for assigning
the objects to the subgroups. Later we will discuss the measures in detail.
Figure 7.4 shows the big picture: how to categorize the clustering algorithms.
Hierarchical approaches can be divided into agglomerative or divisive algorithms.
Both are easy to understand. The agglomerative algorithms measure the distance
between all objects. In the next step, objects that are close are assigned to one
subgroup. In a recursive procedure, the algorithms now calculate the distances
between the more or less large subgroups and merge them stepwise by their
distance.
590 7 Cluster Analysis

Fig. 7.2 Structure of Cluster Analysis chapter

Fig. 7.3 2-Dimensional representation of a clustering problem

Fig. 7.4 Overview of Clustering algorithms


7.2 General Theory of Cluster Analysis 591

The divisive algorithms assign all objects to the same cluster. This cluster is then
divided step-by-step, so that in the end homogeneous subgroups are produced.

" Cluster analysis represents a collection of multivariate statistical


methods. The aim is to identify subgroups/clusters of objects within
the data. Each given object will be assigned to a cluster, based on
similarity or dissimilarity/distance measures. As group membership
assignment is not known in advance, and so the algorithm cannot
learn the characteristics of the groups, these algorithms are called
unsupervised learning procedures.

" The scale type of the variables is crucial for the algorithm and the
result. The clusters contain a high homogeneity themselves (intra-
cluster homogeneity) and a small homogeneity between each other
(inter-cluster separability).

The disadvantage of hierarchical clustering is that the assignment of one object


to a group cannot be changed afterwards. Additionally, the agglomerative
algorithms in particular have to deal with a huge number of different distances.
That is because we have to compare the distances between all the objects. These
procedures therefore are not normally applied to large datasets, such as those often
given for data mining. Table 7.1 shows the advantages and the disadvantages of the
different cluster algorithm categories.

Table 7.1 Characteristics of clustering algorithms by category


Hierarchical clustering Pros:
(TwoStep) – The number of clusters must not be defined in advance.
– Clusters are defined stepwise. The order of clusters defined can be
visualized in a tree. See explanation of dendrogram (Fig. 7.10).
– Clustering results are deterministic (non-random). For example,
the result does not depend on the order of the objects/data records
analyzed in the dataset.
– The quality of the solution can be determined by using problem-
specific distance functions (single-linkage, complete-linkage, etc.
in the case of agglomerative algorithms.).
Cons:
– Assignment of the objects to a specific cluster can never be
changed/revised.
– It is assumed that continuous variables are normally distributed.
Transformation is necessary in advance. See Sect. 3.2.5.
Partitioning clustering Pros:
(K-Means) – More flexible than agglomerative algorithms, based on the
reassignment of the objects to other clusters.
– Non-normally distributed variables can be used.
Cons:
– The number of clusters must be defined.
– Initial clustering is often based on heuristic methods, but initial
clusters determine the quality of the final solution.
– The order of cluster definition can’t be visualized in a tree (called
Flat-clustering).
592 7 Cluster Analysis

Fig. 7.5 Detailed Overview of Clustering algorithms

The first step in a partitioning clustering method is to assign each object to an


initial cluster. Then a quality index for each cluster will be calculated. By
reassigning the objects to other clusters, the overall quality of the classification
should now be improved. After checking all possible cluster configurations, by
reassigning the elements to any other cluster, the algorithm will end when no
improvement to the quality index is possible.
Partitioning or divisive clustering can be divided into two classes (see Fig. 7.5).
Algorithms where only one variable assigns objects to the cluster (e.g., the
y-component in a two-dimensional diagram) are called monothetic. If more than
one variable is used (e.g., the x- and the y-component), the algorithms are called
polythetic. For details, see Murty and Devi (2011), p. 222–225.

Measuring similarities or dissimilarities/distances with proximity measures


Each of the algorithms discussed above is based on a method to measure the
similarity of each object. Based on this measure, similar objects can be assigned
to a subgroup called a cluster. In the following paragraphs, we will focus on the
question of how to measure the similarity or the dissimilarity of objects. In general,
we call such measures “proximity measures”.
To cluster objects into homogeneous subgroups, we have to compare them
pairwise. The measure used to determine the degree of similarity or dissimilarity
depends on the scale type of the variables given. Remembering the scale types
discussed in Sect. 3.1.2, we can easily see that different measures have to be used
for quantitative and qualitative variables.
For qualitative variables, we should focus on measuring similarities between the
objects. Table 7.2 shows an example of three products and four binary variables. If
the variables of two objects have the same expression of characteristics, then they
are somehow similar. In Table 7.2 these are products A and B.
7.2 General Theory of Cluster Analysis 593

Table 7.2 Similarities among products based on binary variables


Offered
Special offer Customer Supported by guarantee Recyclable
Variable last month satisfied manufacturer > 1 year packaging
Product 1 0 1 1 1
A
Product 1 1 0 1 0
B
Product 0 1 0 0 1
C

Table 7.3 Contingency Object 2


table for Tanimoto-
1 0
coefficient formula
Object 1 1 a b
0 c d

Table 7.4 Contingency Product B


table for product A and B
1 0
presented in Table 7.1
Product A 1 2 2
0 1 0

Considering such binary variables, several similarity functions can be used, e.g.,
Tanimoto, simple matching, or Russel & Rao coefficients. Nonbinary qualitative
variables must be recoded into a set of binary variables. This will be discussed in
exercise 2 of Sect. 7.2.1 for more details. The interested reader is referred to Timm
(2002), p. 519–522 and Backhaus (2011), p. 402.

To present the principal procedure, we want to calculate the Tanimoto coeffi-


cient here. Its formula

a
sij ¼
aþbþc
is based on a contingency table shown in Table 7.3.
As we can see, the Tanimoto coefficient is the proportion of the common
characteristics a and the characteristics that are present in at least one object,
represented by a þ b þ c. Other measures, e.g., the Russel–Rao coefficient, use
other proportions. See Timm (2002), p. 521.
Given the example of products A and B in Table 7.2, we determine the
frequencies as shown in Table 7.4. The Tanimoto coefficient is then

2
sAB ¼ ¼ 0:4
2þ2þ1
594 7 Cluster Analysis

Fig. 7.6 2D-plot of objects


represented by two variables

The solution can also be found in the Microsoft Excel file, “cluster dichotomous
variables example.xlsx”.
In the case of quantitative metrical variables, the geometric distance can be used
to measure the dissimilarity of objects. The larger the distance, the less similar or
the more dissimilar they are.
Considering two prices of products, we can use the distance, e.g., given by the
absolute value of the difference. If two or three metrical variables are available to
describe the product characteristics, then we can create a diagram as shown in
Fig. 7.6 and measure the distance by using the Euclidean distance, known from
school math. This approach can also be used in n-dimensional vector space. To
have a more outlier-sensitive measure, we can improve the approach by looking at
the squared Euclidean distance. Table 7.5 gives an overview of the different
measures, depending on the scale type of the variables.

" Proximity measures are used to identify objects that belong to the
same subgroup in a cluster analysis. They can be divided into two
groups: similarity and dissimilarity measures. Nominal variables are
recoded into a set of binary variables, before similarity measures are
used. Dissimilarity measures are mostly distance-based. Different
approaches/metrics exist to measure the distance between objects
described by metrical variables.

" The SPSS Modeler offers the log-likelihood or the Euclidean distance
measures. In the case of the log-likelihood measure, the variables
have to be assumed as independent. The Euclidean distance can only
be calculated for continuous variables.

The SPSS Modeler implements two clustering methods in the classical sense:
TwoStep and K-Means. Additionally, the Kohonen algorithm, as a specific neural
7.2 General Theory of Cluster Analysis 595

Table 7.5 Overview of proximity measures


Nominally scaled variables Similarity measures are used to determine the similarity of
objects
Examples:
Tanimoto-, Simple matching-, Russel & Rao coefficients
Metrical variables (at least Distance measures are used to determine the dissimilarity. See
interval scaled) also exercise 4 in Sect. 7.2.1.
Examples: Object x and object y are described by
ðvariable1 ; variable2 ; . . . ; variablen Þ ¼ ðx1 ; x2 ; . . . ; xn Þ and (y1,
y2, . . ., yn). Using the vector components xi and yi, the metrics are
defined as follows:
Minkowski-metric (L-metric)
!1r
X n
r
d¼ jxi  yi j
i¼1
Considering two components per vector, it is
1
d ¼ ðjx1  y1 jr þ jx2  y2 jr Þr
with specific value of r,, this becomes:
City-block metric (L1-metric with r ¼ 1)
X n
d¼ j xi  yi j
i¼1
Euclidean distance (L2-metric with r ¼ 2) See also Fig. 7.6.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X n
d¼ ðxi  yi Þ2
i¼1
This measure can only be used if all variables are continuous.
squared Euclidean distance
Xn
d¼ ðxi  yi Þ2
i¼1
This measure can only be used if all variables are continuous.
log-likelihood distance measure
As explained in Sect. 4.7, a chi-quadrat distribution can be used
in contingency tables to determine a probability, if the observed
frequency in a contingency table lets us assume that the data
comes from a specific distribution. The log-likelihood distance
measure is a probability-based distance. The decrease in the
likelihood, by combining two objects/clusters, is a measure of
the distance between them. See IBM (2015a), p. 398–399 for
details.
The method assumes that all variables are independent, but it
can deal with metrical and categorical variables at the
same time.

network type, can be used for classification purposes. The Auto Cluster node
summarizes all these methods and helps the user to find an optimal solution.
Table 7.6 includes an explanation of these methods.
To understand the advantages and disadvantages of the methods mentioned, it is
helpful to understand in general, the steps involved in clustering algorithms.
Therefore, we will explain the theory of cluster algorithms in the following section.
596 7 Cluster Analysis

Table 7.6 Clustering methods implemented in the SPSS Modeler


TwoStep The algorithm handles datasets by using a tree in the background. Based on a
comparison of each object with the previous objects inspected, TwoStep, in the
first step, assigns each object to an initial cluster. The objects are organized in the
form of a tree. If this tree exceeds a specific size, it will be reorganized. Due to the
step-by-step analysis of each object, the result depends on the order of the records
in the dataset. Reordering may lead to other results. See IBM (2015b), p. 201.
Next, hierarchical clustering is used to merge the predefined cluster stepwise. We
will explain and use the TwoStep algorithm in Sect. 7.3. The technical
documentation can be found in IBM (2015b), p. 201–207 and IBM (2015a),
p. 397–401.
K-Means The theory of clustering algorithms shows that comparing all objects one-by-one
with all other objects is time-consuming and especially hard to handle with large
datasets.
As a first step, K-Means determines a cluster center within the data. Then each
object is assigned to the cluster center with the smallest distance. The cluster
centers are recalculated and the clusters are optimized, by rearranging some
objects.
The process ends if iteration does not improve the quality, e.g., no object is
assigned to another cluster. For more details and applications, see Sect. 7.4, as well
as IBM (2015b), p. 199–200.
Kohonen This is an implementation of a neural network. Based on training data, a self-
organizing map is created. This map is used to identify similar objects. New
objects presented to the network are compared with the learned pattern. The new
object will be assigned to a class where other objects are most similar. So an
automated classification is implemented. Details to this algorithm are presented in
Sect. 7.5.1. An application can be found in exercise 2 of Sect. 7.5.3. See also IBM
(2015b), p. 196–199.
Auto The Auto Cluster node uses TwoStep, K-Means, and the Kohonen algorithm.
Cluster Here, the SPSS Modeler tries to determine models that can be used for clustering
purposes. The theory and application of this node is explained in Sect. 7.5. Further
details can also be found in IBM (2015b), p. 67–69.

After that, we will come back to TwoStep and K-Means. Based on the theory, we
can understand the difficulties in dealing with clustering algorithms and how to
choose the most appropriate approach for a given dataset.

7.2.1 Exercises

Exercise 1 Recap cluster analysis


Please answer the following questions:

1. Explain the difference between hierarchical and partitioning clustering methods.


For each method, name one advantage and one disadvantage.
2. Consider a given dataset with six objects, as shown in Fig. 7.7.
(a) How many variables are defined per object?
(b) Show the difference between a divisive and an agglomerative approach.
7.2 General Theory of Cluster Analysis 597

Fig. 7.7 Given dataset with


six objects

Exercise 2 Similarity measures in cluster analysis

1. Define in your own words what is meant by proximity, similarity, and distance
measure.
2. Name one example for a measure of similarity as well as one measure for
determining dissimilarity/distances.
3. Explain why we can’t use distance measures for qualitative variables.
4. Similarity measures can typically only deal with binary variables. Consider a
nominal or ordinal variable with more than two values. Explain a procedure to
recode the variable into a set of binary variables.
5. In the theoretical explanation, we discussed how to deal with binary and with
quantitative/metrical variables in cluster analysis. For binary variables, we use
similarity measures as outlined in this section by using the Tanimoto coefficient.
In the case of metrical variables, we also have a wide range of measures to choose
from. See Table 7.5. Consider a dataset with binary and metrical variables that
should be the basis of a cluster analysis. Outline at least two possibilities of how to
use all of these variables in a cluster analysis. Illustrate your approach using an
example.

Exercise 3 Calculating Tanimoto Similarity measure


Table 7.2 shows several variables and their given values for three products. In the
theory, we explained how to use the Tanimoto similarity measure. In particular, we
calculated the similarity coefficient for products A vs. B.
Please calculate the Tanimoto similarity coefficient for products A vs. C and
additionally for products B vs. C.

Exercise 4 Distance measures in cluster analysis

1. Figure 7.6 illustrates two objects represented by two variables. Consider a firm
with employees P1 and P2. Table 7.7 shows their characteristics, depending on
employment with a company and their monthly net income. Using the formulas
in Table 7.5, calculate the City-block metric (L1-metric), the Euclidean distance
(L2-metric), and the squared Euclidean distance.
2. In the theoretical part, we said that the squared Euclidian distance is more
outlier-sensitive than the Euclidian distance itself. Explain!
598 7 Cluster Analysis

Table 7.7 Employee dataset


Employment with a company x1 Monthly net income x2
Object [months] [USD]
P1 10 2400
P2 25 3250

Fig. 7.8 Comparison between agglomerative and divisive clustering procedures

7.2.2 Solutions

Exercise 1 Recap cluster analysis

1. See Sect. 7.1.


2. For each object, two variables are given. We know this because in Fig. 7.7, we
can see a 2-dimensional plot with one variable at each axis. For a segmentation
result, using an agglomerative as well as a partitioning method, see Fig. 7.8.
Here, we will get three clusters, however, because the “star” object has a
relatively large distance from all the other objects.

Exercise 2 Similarity measures in cluster analysis

1. A similarity measure helps to quantify the similarity of two objects. The measure
is often used for qualitative (nominal or ordinal) variables. A distance measures
calculate the geometrical distance between two objects. They can therefore be
interpreted as a dissimilarity measure. Similarity and distance measures together
are called proximity measures.
7.2 General Theory of Cluster Analysis 599

Table 7.8 Comparison of the characteristics of qualitative variables


Scale type of a
qualitative
measure Example Comment
Nominal nonperforming loan Nominally scaled variables have no inherent
(yes or no) order that can be used to order the values
ascending or descending. Distances can never be
measured.
Ordinal categories of hotels Ordinally scaled variables have a naturally
(2-star, 3-star, . . .) defined inherent order. In the case of the
satisfaction of mentioned hotel categories, it is clear that there
customers with a is a distance between a 2-star and 3-star hotel,
product (very satisfied, but we cannot specify further details.
satisfied, . . .) Sometimes we can find a scale representation
1, 2, . . ., 5 (often called Likert-type scale).
Nevertheless the distances should be interpreted
with care, or not at all. This fact has created
controversy in literature. See Vogt et al. (2014),
p. 34. The researcher can’t be sure that the
distance between 1 and 2 equals the distance
between 2 and 3 etc.

2. See Table 7.5 for examples.


3. As explained in Sect. 3.1.1 and 3.1.2, nominally and ordinally scaled variables
can be called qualitative. The characteristics of both scales are that the values
can’t be ordered (nominal scale) and that we can’t calculate a distance between
the values (nominal and ordinal scale). For an example, as well as further
theoretical details, see Table 7.8. Due to the missing interpretation of distances,
we can’t use distance measures such as the Euclidean distance.
4. Considering an ordinal variable (e.g., hotel categories 2-star, . . ., 5-star),
Table 7.9 shows how to recode it into a set of binary variables. Given n categories
we need n  1 binary variables.
5. Given a binary variable, we can use, e.g., the Tanimoto coefficient. Unfortu-
nately, this measure can’t be used for metrical values, e.g., the price of a product.
Now we have two options:
(a) Recode the metrical variable into a dichotomous variable:
By defining a threshold, e.g., the median of all prices, we can define a new
binary variable: 0 ¼ below median and 1 ¼ above median. This variable can
then be used in the context of the other variables and the similarity measure.
A disadvantage is the substantial reduction of information in this process.
(b) Recode the metrical variable into an ordinal variable:
In comparison with the approach using only a binary variable, we can
reduce the loss of precision by defining more than one threshold. We get
more than two price classes (1, 2, 3, . . .) for the products, which represent
an interval or ordinally scaled variable. This variable can’t be used with the
similarity measure either. As described in the answer to the previous
question, we have to recode the variable into a set of binary variables.
600 7 Cluster Analysis

Table 7.9 Scheme for recoding a nominal or ordinal variable into a binary variable
Binary variable 1 Binary variable 2 Binary variable 3
2-star 0 0 0
3-star 1 0 0
4-star 1 1 0
5-star 1 1 1

Exercise 3 Calculating the Tanimoto Similarity measure


To calculate the Tanimoto similarity coefficient for products A vs. C, as well as for
products B vs. C, the contingency tables must be determined. Then the formula can
be used.
The solution can also be found in the Microsoft Excel file “cluster dichotomous
variables example.xlsx”.

Product C
1 0
Product A 1 1 3
0 1 0

The Tanimoto coefficient is

1
sAC ¼ ¼ 0:2
1þ3þ1

Product C
1 0
Product B 1 1 2
0 1 0

The Tanimoto coefficient is

1
sBC ¼ ¼ 0:25
1þ2þ1

Exercise 4 Distance measures in cluster analysis

1. Table 7.10 shows the solution.


2. The Euclidean distance (L2-metric) can be calculated by
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X n
d¼ ðx i  yi Þ2
i¼1

whereas the squared Euclidean distance follows by


7.3 TwoStep Hierarchical Agglomerative Clustering 601

Table 7.10 Distances based on different metrics


Metric Distance
City-block metric (L1-metric) d ¼ j10  25j þ j2400  3250j
X n d ¼ 865
d¼ jxi  yi j
i¼1
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean distance (L2-metric)
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d ¼ ð10  25Þ2 þ ð2400  3250Þ2
Xn
d¼ ðxi  yi Þ2 d ¼ 850:13
i¼1
Squared Euclidean distance d ¼ ð10  25Þ2 þ ð2400  3250Þ2
Xn
d ¼ 722725
d¼ ð xi  yi Þ 2
i¼1

X
n
d¼ ðxi  yi Þ2
i¼1

Considering the 1-dimensional case of two given values 1 and 10, we can see
that the Euclidean distance is 9 and the squared Euclidean distance is definitely
81, but the reason for the outlier-sensitivity is nothing to do with getting a larger
number! If we have two objects with a larger distance, we have to square the
difference of their components, but in the case of the Euclidian distance, we will
reduce the effect by calculating the square root at the end. See also the explana-
tion of standard deviation. Here, we can find the same approach for defining an
outlier-sensitive measure for volatility in statistics.

7.3 TwoStep Hierarchical Agglomerative Clustering

7.3.1 Theory of Hierarchical Clustering

Example vs. Modeler functionalities


To understand cluster analysis, we think it would be helpful to discuss a simple
example. We want to show the several steps for finding a cluster of objects that are
“similar”. Going through each of the steps, parameters and challenges can be
identified. In the end, it will be easier to understand the results of the clustering
algorithms that the SPSS Modeler provides.
We recommend studying the details presented in the following paragraphs, but if
more theoretical information is not of interest, the reader can also proceed onto
Sect. 7.3.2.

Clustering sample data


Remark: the data, as well as the distance measures and matrices explained in this
section, can also be found in the Microsoft Excel spreadsheet “Car_Simple_Clus-
tering_distance_matrices.xlsx”.
602 7 Cluster Analysis

Table 7.11 Prices of different cars


ID Manufacturer Model Dealer (Possible) Price in 1000 USD
1 Nissan Versa ABC motors 13
2 Kia Soul California motors 19
3 Ford F-150 Johns test garage 27,5
4 Chevrolet Silverado 1500 Welcome cars 28
5 BMW 3 series Four wheels fine 39
6 Mercedes-Benz C-Class Best cars ever 44

So far, we have discussed the aim and general principle of cluster analysis
algorithms. The SPSS Modeler offers different clustering methods. These methods
are certainly advanced, however, to understand the advantages and the challenges
of using clustering algorithms; we want to explain here the idea of hierarchical
clustering in more detail. The explanation is based on an idea presented in Handl
(2010), p. 364–383, but we will use our own dataset.
The data represent prices of six cars in different categories. The dataset includes
the name of the manufacturer, the type of car, and a price. Formally, we should
declare that the prices are not representative for the models and types mentioned.
Table 7.11 shows the values.
The only variable that can be used for clustering purposes here is the price. The
car ID is nominally scaled and we could use a distance measure (e.g., the Tanimoto
coefficient) for such variables; but remember that the IDs are arbitrarily assigned to
the cars.
Based on the data given in this example, we can calculate the distance between
the objects. We discussed distance measures for metrical variables in the previous
chapter. The Euclidean distance between car 1 and 2 is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
d Euclidean ð1; 2Þ ¼ ð13  19Þ2 ¼ 6

Due to increased outlier-sensitivity (see exercise 4 question 2 in the previous


chapter), a lot of algorithms use the squared Euclidean distance, which in this case is

d squared Euclidean ð1; 2Þ ¼ ð13  19Þ2 ¼ 36

The algorithm K-Means of the SPSS Modeler is based on this measure. See IBM
(2015a), p. 229–230. We will discuss this procedure later. For now we want to
explain with an example, how a hierarchical cluster algorithm works in general.
The following steps are necessary:

1. Calculating the similarities/distances (also called proximity measures) of the


given objects.
2. Arranging the measures in a similarity/distance matrix.
3. Determining the objects/clusters that are “most similar”.
4. Assigning the identified objects to a cluster.
7.3 TwoStep Hierarchical Agglomerative Clustering 603

Table 7.12 Overview of Cluster description Objects/cars that belong to the cluster
initial clusters and the
Cluster 1 1
objects assigned
Cluster 2 2
Cluster 3 3
Cluster 4 4
Cluster 5 5
Cluster 6 6

Fig. 7.9 Visualization of the given data

5. Calculating the similarities/distances between the new cluster and all other
objects. Updating the similarity/distance matrix.
6. If not all objects are assigned to a cluster, go to step 3, otherwise stop.

Initially, the objects are not assigned to a specific cluster, but mathematically we
can say that each of them form a separate cluster. Table 7.12 shows the initial status
of the clustering.

Step 1: Calculating the similarities/distances


In the example of the car prices used here, the distance measure is quite simple to
determine because we have just one dimension. Figure 7.9 visualizes the original
data given.
The squared Euclidean distance for each pair of objects can be calculated easily.
Although we have six cars here, we do not have to calculate 36 distances: the
distance between each object and itself is zero and the distances are symmetrical,
e.g.,

d ð1; 2Þ ¼ dð2; 1Þ

For the distances we get:

dð1; 2Þ ¼ ð13  19Þ2 ¼ 36

dð1; 3Þ ¼ ð13  27:5Þ2 ¼ 210:25


604 7 Cluster Analysis

dð1; 4Þ ¼ ð13  28Þ2 ¼ 225

dð1; 5Þ ¼ ð13  39Þ2 ¼ 676

dð1; 6Þ ¼ ð13  44Þ2 ¼ 961

dð2; 3Þ ¼ ð19  27:5Þ2 ¼ 72:25

dð2; 4Þ ¼ ð19  28Þ2 ¼ 81

...

dð6; 3Þ ¼ ð44  27:5Þ2 ¼ 272:25

dð6; 4Þ ¼ ð44  28Þ2 ¼ 256

dð6; 5Þ ¼ ð44  39Þ2 ¼ 25

Step 2: Arranging the measures in a similarity/distance matrix.


The rows of the distance matrix represent the object where to start the distance
measurement. The columns stand for the end point in the distance measurement.
Using the notation d ð1; 2Þ ¼ 36; the distance will be assigned to the cell in the first
row, second column of the distance matrix. Table 7.13 shows the final matrix. All
matrices discussed in this section can also be found in the Microsoft Excel spread-
sheet “Car_Simple_Clustering_distance_matrices.xlsx”.

Start of Iteration 1
We can see in the steps of the algorithm that more than one iteration will be
necessary, most of the time. In the following paragraphs, we extend the description
of steps 3 to 6 with the number of the iteration.

Iteration 1, Step 3: Determining the objects/clusters that are “most similar”


To keep this example simple, we want to call the object that has the smallest
difference, the “most similar”. We will find out later that this is not always the
best definition of “similarity”, but for now it is an appropriate approach.
The minimum distance is dð3; 4Þ ¼ 0:25.

Table 7.13 Initial To


distance matrix of the
1 2 3 4 5 6
car data
From 1 0 36 210.25 225 676 961
2 0 72.25 81 400 625
3 0 0.25 132.25 272.25
4 0 121 256
5 0 25
6 0
7.3 TwoStep Hierarchical Agglomerative Clustering 605

Table 7.14 Overview of Cluster description Objects/cars that belong to the cluster
clusters and the objects
Cluster 1 1
assigned after iteration 1
Cluster 2 2
Cluster 3 new 3; 4
Cluster 5 5
Cluster 6 6

Table 7.15 Distance To


matrix cluster step 1
1 2 3; 4 5 6
From 1 0 36 210.25 676 961
2 0 72.25 400 625
3; 4 0 121 256
5 0 25
6 0

Iteration 1, Step 4: Assigning the identified objects to a cluster


For that reason, the cars with ID 3 and 4 are assigned to one cluster. We call it
“cluster 3 new”. Table 7.14 gives an overview.

Iteration 1, Step 5: Calculating the similarities/distances of the new cluster,


Update of the similarity/distance matrix
In the cluster, we now find cars 3 and 4. The distance to the other cars, 1; 2; 5; 6 to
this cluster, can be determined pairwise. Based on the distances in Table 7.13, it is

d ð3; 1Þ ¼ 210:25

d ð4; 1Þ ¼ 225

Once more based on the principle that the smallest distance will be used, it is

dðf3; 4g; 1Þ ¼ 210:25

Table 7.15 shows the new distance matrix.

Iteration 1, Step 6: Check if algorithm can be finished


The clustering algorithm ends if all objects are assigned to exactly one cluster.
Otherwise, we have to repeat steps 3–6. Here, the objects 1; 2; 5; 6 are not assigned
to a cluster.
So we have to go through the procedure once again.

Start of Iteration 2
Iteration 2, Step 3: Determining the objects/clusters that are “most similar”
The minimum distance is dð5; 6Þ ¼ 25.
606 7 Cluster Analysis

Table 7.16 Overview of Cluster description Objects/cars that belong to the cluster
clusters and the objects
Cluster 1 1
assigned after iteration 1
Cluster 2 2
Cluster 3 3; 4
Cluster 4 new 5; 6

Iteration 2, Step 4: Assigning the identified objects to a cluster


For that reason, the cars with the IDs 5 and 6 are assigned to a new cluster (see
Table 7.16).

Iteration 2, Step 5: Calculating the similarities/distances of the new cluster,


Update of the similarity/distance matrix
In cluster 4, we find cars 5 and 6. The distance from this cluster to cars 1 and 2 can
be determined using Table 7.15. For the distance from car 1, we get:

d ð5; 1Þ ¼ 676

d ð6; 1Þ ¼ 961

The minimum distance is

dðf5; 6g; 1Þ ¼ 676

For the distance from the new cluster to car 2, we get:

d ð5; 2Þ ¼ 400

d ð6; 2Þ ¼ 625

The minimum distance is

dðf5; 6g; 2Þ ¼ 400

Table 7.17 shows the new distance matrix so far, but we also have to calculate the
distances between the cluster with cars 3; 4 and the new cluster with cars 5 and
6. We use Table 7.15 and find that

dðf3; 4g; 5Þ ¼ 121

and

dðf3; 4g; 6Þ ¼ 256

So the distance is

dðf3; 4g; f5; 6gÞ ¼ minð121; 256Þ ¼ 121


7.3 TwoStep Hierarchical Agglomerative Clustering 607

Table 7.17 Distance To


matrix cluster iteration
1 2 3; 4 5; 6
2 (part 1)
From 1 0 36 210.25 676
2 0 72.25 400
3; 4 0 ?
5; 6 0

Table 7.18 Distance To


matrix cluster iteration
1 2 3; 4 5; 6
2 (part 2)
From 1 0 36 210.25 676
2 0 72.25 400
3; 4 0 121
5; 6 0

Table 7.19 Overview of Cluster description Objects/cars that belong to the cluster
clusters and the objects
Cluster 1 new 1; 2
assigned after iteration 2
Cluster 3 3; 4
Cluster 4 5; 6

See also Table 7.18.

Iteration 2, Step 6: Check if algorithm can be finished


The algorithm does not end because objects 1 and 2 are not assigned to a cluster.

Iteration 3, 4, and 5
The minimum distance in Table 7.18 is 36. So the objects or cars 1 and 2 are
assigned to a new cluster.
The distances from the new cluster {1; 2} to the other existing clusters can be
determined using Table 7.18. They are

dð1; f3; 4gÞ ¼ 210:25

and

dð2; f3; 4gÞ ¼ 72:25

So the distance

dðf1; 2g; f3; 4gÞ ¼ minð210:25; 72:25Þ ¼ 72:25

Furthermore, the distances from the new cluster {1; 2} to the existing cluster {5; 6}
are
608 7 Cluster Analysis

dð1; f5; 6gÞ ¼ 676

dð2; f5; 6gÞ ¼ 400

So the distance

dðf1; 2g; f5; 6gÞ ¼ minð676; 400Þ ¼ 400

Table 7.20 shows the new distance matrix.


The remaining iterations are so far very similar in detail to the explained one. We
show here the distance matrices in Tables 7.21 and 7.22.
This result is typical for a hierarchical clustering algorithm. All the objects are
assigned to one cluster. Statistical software normally does not provide the huge
amount of detail we have presented here, but this example should allow us to
understand what happens in the background. Furthermore, we can identify the
parameters of clustering algorithms, such as the method to determine the clusters
that have to be merged in the next step. For now, we want to pay attention to the
so-called “dendrogram” in Fig. 7.10. This is produced with SPSS statistics. Unfor-
tunately, the SPSS Modeler does not offer this helpful diagram type.
The dendrogram shows us the steps of the cluster algorithm in the form of a tree.
The horizontal axis annotated with “Rescaled Distance Cluster Combine”
represents the minimum distances used in each step, for identifying which objects
or clusters to combine.
The minimum distance (measured as the squared Euclidean distance) between
objects 3 and 4 was 0.25 (see Table 7.13). The maximum distance was 121 in
iteration 5 (see Table 7.21). If the maximum in the dendrogram is 25, than we get

Table 7.20 Distance To


matrix iteration 3
1; 2 3; 4 5; 6
From 1; 2 0 72.25 400
3; 4 0 121
5; 6 0

Table 7.21 Distance To


matrix iteration 4
1; 2; 3; 4 5; 6
From 1; 2; 3; 4 0 121
5; 6 0

Table 7.22 Distance To


matrix iteration 5
1; 2; 3; 4; 5; 6
From 1; 2; 3; 4; 5; 6 -
7.3 TwoStep Hierarchical Agglomerative Clustering 609

Fig. 7.10 Dendrogram for simple clustering example using the car dataset

Table 7.23 Rescaled Distances Rescaled distances


distances for drawing a
0.25 0.05
dendrogram
25.00 5.17
36.00 7.44
72.25 14.93
121.00 25.00

25
 0:25 ¼ 0:0517
121
Table 7.23 shows all the other values that can be found in the dendrogram in
Fig. 7.10. This calculation can also be found in the Microsoft Excel spreadsheet
“Car_Simple_Clustering_distance_matrices.xlsx”.

" A dendrogram can be used to show the steps of a hierarchical cluster


analysis algorithm. It shows the distance between the clusters, as well
as the order in which they are joined. Depending on the horizontal
distance between the visualized cluster steps, the researcher can
decide how many clusters are appropriate for best describing the
dataset.

Single-linkage and other algorithms to identify clusters to merge


To determine the distance between two objects (or cars) or clusters, we used the
squared Euclidean distance measure. Other so-called “metrics” or “proximity
610 7 Cluster Analysis

measures” are shown in Table 7.5. So far, we have defined how to measure the
distance between the objects or clusters.
The next question to answer with a clustering algorithm is which objects or
clusters to merge. In the example presented above, we used the minimum distance.
This procedure is called “Single-linkage” or the “nearest neighbor” method. In the
distance matrices, we determine the minimum distance and merge the
corresponding objects by using the row and the column number. This means that
the distance from a cluster to another object “A” equals the minimum distance
between each object in the cluster and “A”.

dðobject A; fobject 1; object 2gÞ ¼ minðdðobject A; object 1Þ; dðobject A; object 2ÞÞ

The disadvantage of single-linkage is that less separated groups cannot normally be


detected. Furthermore, outliers tend to remain isolated points (see Fig. 7.11).
Table 7.24 shows a summary of other algorithms and their characteristics.

Determining the number of clusters


The general approach for clustering algorithms can be explained using this more or
less simple example. For sure, if we analyze the given dataset in Table 7.11, the
results are not a surprise. By analyzing the dendrogram in Fig. 7.10 in comparison
to the car prices, the steps of the clustering algorithm makes sense; for instance, the
price difference between the cars Ford F-150 and Chevrolet Silverado 1500 in this
dataset is very small. Both cars belong to the same cluster, however, additional
information (e.g., the type of car) is not presented to the algorithm. So the decision
is made only based on price.
In the case of using a hierarchical cluster algorithm, the number of clusters need
not be determined in advance. In the case of an agglomerative algorithm, the cluster
will be identified and then merged. We have to decide how many clusters should be

Fig. 7.11 Example for two


clusters with an extremely
small distance between them
7.3 TwoStep Hierarchical Agglomerative Clustering 611

Table 7.24 Hierarchical clustering algorithms


Name of algorithm Characteristics
Single-Linkage Smallest distance between objects is used.
All proximity measures can be used.
Outliers remain separated—identification is possible.
Groups that are close to each other are not separated.
Complete-Linkage/Maximum Largest distance between objects is used.
distance All proximity measures can be used.
Outliers are merged to clusters—identification not
possible.
Tends to build many small clusters.
All proximity measures can be used.
Average linkage/Average The average of the object distances between two clusters
distance/Within groups linkage is used. All possible object distances will be taken into
account.
All proximity measures can be used.
Centroid/Average group linkage/ Similar to centroid, but the distances between cluster
Between groups linkage centers are taken into account. The compactness of the
clusters is also relevant.
Squared Euclidean distance is used.
Median Similar to centroid, but the cluster centers are determined
by the average of the centers, taking the number of objects
per cluster into account (number of objects is known as the
weight).
Only distance measures can be used.
Ward’s method Cluster centers are determined. The distances between all
objects in each cluster and the center are determined and
cumulated.
If clusters are merged, then the distance from the objects
to the new cluster center increases. The algorithm
identifies the two clusters where the distance increment is
the lowest. See Bühl (2012), p. 650.
Cluster sizes are approximately the same.
Only distance measures can be used.

used to describe the data best, however. This situation is similar to PCA or PFA
factor analysis. There also, we had to determine the number of factors to use (see
Sect. 6.3). In the dendrogram in Fig. 7.10, the cutoff shows an example using two
clusters.
A lot of different methods exist to determine the appropriate number of clusters.
Table 7.25 shows them with a short explanation. For more details, the interested
reader is referred to Timm (2002), p. 531–533. In our example, the rule of thumb
tells us that we should analyze clustering results with
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffi
number of objects 6
¼ 2 ðclustersÞ
2 2
Later on we will also use the Silhouette plot of the SPSS Modeler.
612 7 Cluster Analysis

Table 7.25 Methods to determine the number of clusters


qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
number of objects In reference to Mardia et al. (1979), this rule of thumb approximates
2 the number of clusters.
Elbow-criterion As explained in Sect. 6.3.2 and Fig. 6.18, the dependency of a
classification criterion vs. the number of clusters can be visualized in a
2D-chart. In the case of PCA/PFA, this is called a screeplot.
In cluster analysis, at the vertical axis the sum of squares of distances/
errors can be used.
In reverse, the percentage of variance/information explained can be
assigned to the vertical axis. See Figs. 7.12 and 7.13. An example can
be found in exercise 5 of Sect. 7.4.3.
Silhouette charts To measure the goodness of a classification, the silhouette value S can
be calculated:
1. Calculation of average distance from the object to all objects in the
nearest cluster.
2. Calculation of average distance from the object to all objects in the
same cluster the object belongs to.
3. Calculation of the difference between the average distances (1)-(2).
4. This difference is then divided by the maximum of both those
average distances. This standardizes the result.
avg: dist: to nearest clusteravg: dist: to objects in same cluster
maximum of both avg: above
More details can be found in Struyf et al. (1997), p. 5–7.
S can take on values between 1 and +1. If S ¼ 1, the object is not
well classified. If S ¼ 0; the object lies between two clusters, and if
S ¼ þ1 the object is well classified. The average of the Silhouette
values of all objects represents a measure of goodness for the overall
classification.
As outlined in IBM (2015b), p. 77 and IBM (2015b), p. 209, the IBM
SPSS Modeler calculates a Silhouette Ranking Measure based on the
Silhouette value. It is a measure of cohesion in the cluster and
separation between the clusters. Additionally, it provides thresholds
for poor (up to +0.25), fair (+0.5), and good (above +0.5) models.
Information criterion Information criterions such as the Bayesian Information Criterion
Approach (BIC) or the Akaikes Information Criterion (AIC) are used to
determine the appropriate number of clusters. For more details, see
Tavana (2013), p. 61–63.

" In cases using a hierarchical clustering algorithm, specifying the


number of clusters to determine is unnecessary. The maximum
compression of the information included in the data can be found,
if all objects belong to one cluster. The maximum accuracy can be
realized, by assigning each object to a separate cluster.

" To find the optimal number of clusters, a different rule of thumb


exists. The SPSS Modeler provides the Silhouette chart as a measure
of cohesion in the cluster, and a measure of separation between the
clusters.
7.3 TwoStep Hierarchical Agglomerative Clustering 613

Fig. 7.12 Elbow criterion (II) to identify the number of clusters

Fig. 7.13 Elbow criterion (I) to identify the number of clusters

Interpretation of clustering results


As mentioned above in the example presented, here we should analyze the results
for two clusters determined by the algorithm. It is important to note that the
clustering methods do not provide any help with finding an appropriate description
for each cluster. The researcher has to figure out by herself/himself how to best
describe each cluster.
Looking at the results in Table 7.21, or in the dendrogram of Fig. 7.10, we can
see that cars with the IDs 1; 2; 3; 4 are assigned to one cluster and cars 5; 6 to
another cluster. The former cluster can probably be described as “cheap or
614 7 Cluster Analysis

moderate priced cars” and the latter cluster with two cars can be described as
“luxury cars”.

Disadvantages of hierarchical clustering methods


A lot of distances must be calculated in order to find clusters that can be merged.
Furthermore, the distances are not determined globally; if one object is assigned to
a cluster, it cannot be removed and assigned to another one. This is because the
distances within the clusters are irrelevant.

7.3.2 Characteristics of the TwoStep Algorithm

As mentioned in the introduction to the clustering algorithms in Sect. 7.1, and


especially summarized in Table 7.6, TwoStep is an implementation of a hierarchi-
cal clustering algorithm.
It tries to avoid difficulties caused by huge datasets and the necessity to compare
the objects pairwise. Measuring their similarity, or determining their distance, is
time consuming and requires excellent memory management.
In the so-called “pre-clustering”, TwoStep uses a tree to assign each object
to a cluster. The tree is built by analyzing the objects one-by-one. If the tree
grows and exceeds a specific size, the tree is internally reorganized. So the
procedure can handle a huge number of objects. The disadvantage of
pre-clustering is that assignment of the objects to the (smaller) clusters is fixed.
Based on the characteristics of hierarchical clustering methods, we can easily
understand that the result from TwoStep depends also on the order of the objects
in the dataset.
In the second step, the predefined clusters are the basis for using a hierarchical
clustering algorithm. That’s also because the number of clusters is much less than
the sample size of the original data. The clusters are merged stepwise.
The Modeler offers the Euclidean distance or the log-likelihood distance, to
determine the dissimilarity of objects or clusters. For the log-likelihood method, the
variables must be continuous and normally distributed. The Euclidean distance
measure often leads to inappropriate clustering results. Details can be found in the
solution to exercise 3 in Sect. 7.4.4.
As explained in Sect. 4.7, “Contingency Tables”, a chi-quadrat distribution can
be used to determine a probability, if the observed frequency in a contingency table
lets us assume that the data come from a specific distribution. The log-likelihood
distance measure is a probability-based distance, which uses a similar approach.
See IBM (2015a), p. 398–399 for details. The decrease in the likelihood caused by
combining two clusters is a measure of the distance between them. Mathematical
details can be found in IBM (2015a), p. 397–401.
7.3 TwoStep Hierarchical Agglomerative Clustering 615

Advantages

– TwoStep can deal with categorical and metrical variables at the same time. In
these, cases the log-likelihood measure is used.
– By using Bayesian information criterion, or Akaike’s Information Criterion, the
TwoStep algorithm implemented in the SPSS Modeler determines the “optimal”
number of clusters.
– The minimum or maximum number of clusters that come into consideration can
be defined.
– The algorithm tends not to produce clusters with approximately the same size.

Disadvantages

– TwoStep clustering assumes that continuous variables are normally distributed,


if log-likelihood estimation is used. Transformation of the input variables is
necessary. The log transformation can be used if all input variables are
continuous.
– The clustering result depends on the order of the objects in the dataset.
Reordering might lead to other results.
– TwoStep results are relatively sensitive in relation to mixed type attributes in the
dataset. Different scales or codes for categorical variables can result in different
clustering results. See, e.g., Bacher et al. (2004), p. 1–2.

" TwoStep is a hierarchical clustering algorithm that uses pre-clustering


to assign objects to one of the smaller clusters. Larger datasets can be
used because of this. The pre-assignment of the objects is not revised
in the following steps, however, and the order of the objects in the
dataset influences the clustering result.

" Important assumptions when using TwoStep clustering are that


continuous variables are assumed to be normally distributed, and
categorical variables are multinomially distributed. Therefore, they
should be transformed in advance. Often the log transformation of
continuous values can be used.

" Alternatively, the Euclidean distance measure can be used, but it


often produces unsatisfying results.

7.3.3 Building a Model in SPSS Modeler

In Sect. 7.3.1, we discussed the theory of agglomerative clustering methods using the
single-linkage algorithm. Additionally, we learned in the previous section that
the TwoStep algorithm is an improved version of an agglomerative procedure.
616 7 Cluster Analysis

Here, we want to use TwoStep to demonstrate its usage for clustering objects in the
very small dataset “car_simple”. We know the data from the theoretical section. They
are shown in Table 7.11. For more details see also Sect. 10.1.4. First we will build the
stream and discuss the parameters of the TwoStep algorithm, as well as the results.

Description of the model


Stream name car_clustering_simple
Based on dataset car_simple.sav
Stream structure

Related exercises: All exercises in 7.3.4

Creating the stream

1. We open the “Template-Stream_Car_Simple”. It gives us access to the SPSS-


dataset “car_simple.sav” (Fig. 7.14).
2. Then we check the data by clicking on the Table node. We find the records as
shown in Fig. 7.15. These cars will be the basis for our clustering approach.
3. Checking the variables in Fig. 7.15, we can see that the ID can’t be used to
cluster the cars. That’s because the ID does not contain any car-related infor-
mation. The variables “manufacturer”, “model”, and “dealer” are nominally
scaled. Theoretically, we could use the variables with the binary coding
procedure and the Tanimoto similarity measure explained in Sect. 7.1. So the

Fig. 7.14 Stream ‘Template-


Stream_Car_Simple”
7.3 TwoStep Hierarchical Agglomerative Clustering 617

Fig. 7.15 Dataset “car_simple.sav” used in “Template-Stream_Car_Simple”

type of the variables is not the reason to exclude them from clustering, but the
variables “manufacturer” as well as “model” and “dealer” do not provide any
valuable information. Only a scientist can cluster the cars based on some of
these variables, using knowledge about the reputation of the manufacturer and
the typical price ranges of the cars.
So the only variable that can be used here for clustering is the price. We now
should check the scale of measurement.
4. Using the Type node, we can see that all the variables are already defined as
nominal, except the price with its continuous scale type (Fig. 7.16). Normally,
we can define the role of the variables here and exclude the first three, but we
recommend building transparent streams and so we will use a separate node to
filter the variables.
We add a Filter node to the stream, right behind the Type node (see
Fig. 7.17).
We find out that apart from the variable “price”, no other variable can be
used for clustering. As we know from the theoretical discussion in Sect. 7.1
though, a clustering algorithm only provides cluster numbers. The researcher
must find useful descriptions for each cluster, based on the characteristics of the
assigned objects.
So any variable that helps us to identify the objects should be included in the
final table alongside the cluster number determined by the algorithm. In our
dataset, the manufacturer and the model of the car is probably helpful. The
name of the dealer does not give us any additional input, so we should exclude
this variable. We exclude it by modifying the Filter node parameter, as shown
in Fig. 7.18.
618 7 Cluster Analysis

Fig. 7.16 Defined scale types of variables in dataset “car_simple.sav”

Fig. 7.17 A Filter node is added to the template stream

5. For the next step, we add a TwoStep node from the Modeling tab of the SPSS
Modeler. We connect this node to the Filter node (see Fig. 7.19). So the only
input variable for this stream is the price of the cars. Only this information
should be used to find clusters.
6. We double-click on the TwoStep node. In the Fields tab, we can choose to use
variables based on the settings in the Type node of the stream. Here, we will
add them manually. To do this, we use the button on the right marked with an
arrow in Fig. 7.20.
7.3 TwoStep Hierarchical Agglomerative Clustering 619

Fig. 7.18 Filtered variables

Fig. 7.19 Stream with added TwoStep node

" For transparency reasons, we recommend adding a Filter node


behind the Type node of the stream, when building a stream to
cluster objects. All variables that do not contribute any additional
information, in terms of clustering the objects, should generally be
excluded.

" The researcher should keep in mind, however, that the clusters must
be described based on the characteristics of the objects assigned. So
it is helpful not to filter the object ID’s along with the names. This is
helpful even if they are not used in the clustering procedure itself.

" The variables used for clustering should be determined in the clus-
tering node, rather than using the Type node settings.
620 7 Cluster Analysis

Fig. 7.20 Definition of the variables used in the TwoStep node

7. In the Model tab, we can find other parameters as shown in Fig. 7.21. We will
explain them in more detail here.
By default, numeric fields are standardized. This is very important for the
cluster procedure. Different scales or different coding of input variables result
in various large differences between their attributes. To make the values
comparable they must be standardized. We outlined the z-standardization in
Sect. 2.7.6. Here, this method is automatically activated and should be used.
Using the option “Cluster label”, we can define whether a cluster is a string
or a number. See the arrow in Fig. 7.21. Unfortunately, there is a failure in the
software of Modeler version 17 (and probably before). If this option “cluster
label” is changed to “number”, then the clustering result also changes for no
reason. We have reported this bug, and it will be fixed in the following releases.
For now, we do not recommend using this option, despite the fact that it makes
handling cluster numbers easier.

" The option “Cluster label” must be used carefully. The clustering
results can change for no reason. It is recommended to use the
option “string” instead.

The advantage of TwoStep implementation in the Modeler is that it tries to


automatically find the optimal number of clusters. For technical details, see
IBM (2015b), p. 202. After the first trial, we will probably have to modify the
predefined values by using the methods outlined in Table 7.25.
The SPSS Modeler offers the log-likelihood or the Euclidean distance
measure. In the case of the log-likelihood measure, the variables have to be
7.3 TwoStep Hierarchical Agglomerative Clustering 621

Fig. 7.21 Options in the TwoStep node

assumed as independent. It is also assumed that continuous variables are


normally distributed. Therefore, we would have to use an additional Transform
node in advance. The Euclidean distance can only be calculated for continuous
variables. See also Table 7.5, and so we recommend using the log-likelihood
distance.
As also mentioned in Table 7.25, the last option “Clustering criterion”,
relates to the automatic detection of the number of clusters. Normally, numbers
are easier to handle, but we found out that the cluster result changes (or is
sometime wrong) if we use the option “number”. So we suggest using “String”.
We don’t have to modify the options in the TwoStep node yet. Neither do we
need an outlier exclusion nor should we try to determine the number of clusters.
We click on “Run” to start the clustering.
8. Unfortunately, we get the message “Error: Too few valid cases to build
required number of clusters”. This is because the very small sample size of
six records means automatic detection in the TwoStep node can’t determine the
number automatically.
Here, we use the rule of thumb mentioned in Table 7.5 and used in
Sect. 7.3.1. The simple calculation
622 7 Cluster Analysis

Fig. 7.22 Options in the TwoStep node, with two clusters specified
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffi
number of objects 6
¼ 2
2 2

tells us to try and fix a number of clusters. We modify the options in the TwoStep node
as shown in Fig. 7.22. We run the node now once again with these new parameters.
9. We get a new model nugget as shown in Fig. 7.23.
10. Before we start to show the details of the analysis, we finish the stream by adding
a Table node behind the model nugget node. Figure 7.24 shows the final stream.

" Using the TwoStep algorithm, the following steps are recommended

1. The scale types defined in the Type node are especially impor-
tant for the cluster algorithms. That’s because the usage of the
distance measure (log-likelihood or euclidean), for example,
depends on these definitions.
2. In a Filter node, variables can be excluded that are unnecessary
for the clustering itself. The object ID (and other descriptive
variables) should not be filtered. That’s because it/they can
help the user to identify the objects.
7.3 TwoStep Hierarchical Agglomerative Clustering 623

Fig. 7.23 TwoStep cluster model nugget node is added to the stream

Fig. 7.24 Final TwoStep cluster stream for simple car example

3. For transparency reasons, it is optimal to select the variables


used for clustering directly in the clustering node itself, rather
than defining the variable roles in the Type node.
4. Standardization of numerical variables is recommended.
5. An important assumption, when using TwoStep clustering
(with the log-likelihood distance measure), is that continuous
variables are normally distributed. Therefore, they should be
transformed by using a Transform node in advance.
6. The algorithm tries to identify the optimal number of clusters.
The user should start the node by using this option. If this fails,
methods outlined in Table 7.25 should be used to determine
the number of clusters manually and set the minimum and the
maximum in the TwoStep node options. The option “Cluster-
ing criterion” is related to automatic cluster number detection.
7. The log-likelihood distance measure is recommended. It can
also be applied if not all variables are continuous.
8. If the results are unsatisfying or hard to interpret, outlier
exclusion can be activated.
624 7 Cluster Analysis

Fig. 7.25 Cluster assigned to the cars

Interpretation of the determined clusters


To analyze the results of the TwoStep clustering method, we open the Table node
connected to the model nugget. Figure 7.25 shows the clusters assigned to each of
the cars. The number in the variable “$T-TwoStep” can be determined by using the
options “Cluster label” and “Label prefix”, as shown in Fig. 7.22.
A double-click on the model nugget opens a new dialog window with many
options. In the model summary—also called Model viewer—on the left in Fig. 7.26,
we can see the silhouette measure of cohesion and separation. We explained this in
Sect. 7.3.1 and in particular in Table 7.25. A value between fair (+0.25) and good
(above +0.5) lets as assume that the clustering was successful.
If the number of clusters will be reduced, generally (not always) the silhouette
value will decrease and vice versa. For more details, see exercise 5 in Sect. 7.4.3.

" The silhouette value helps to assess the goodness of a classification. It


measures the average distance from each object to the other objects
belonging in the same cluster it is assigned to, as well as the average
distance from the other clusters. The values range from 1 to +1.
Above +0.5, it is a (quite) “good model”.

" The SPSS Modeler shows the silhouette value in the model summary
of the Model Viewer on the left. Moving the mouse over the diagram
is one way to get the silhouette value. This is depicted in Fig. 7.26 in
the middle. To get a more precise result, one can use the option
“Copy Visualization Data”. To do this, the second button from left in
the upper part of the Model Viewer should be clicked. Then the
copied values must be pasted using simple word processing soft-
ware. Table 7.26 shows the result.
7.3 TwoStep Hierarchical Agglomerative Clustering 625

Fig. 7.26 Model summary of a TwoStep node

Table 7.26 Silhouette measure with full precision—Copied with option “Copy Visualization
Data”
Category. Silhouette measure of cohesion and separation. V3
1 0.7411 0.7

On the right in Fig. 7.26, the Modeler shows us that we have two clusters, with
four and two elements, respectively. Their size ratio is definitely two.
In the left corner of Fig. 7.26, we select the option “Clusters” from the drop-
down list, instead of the “Model Summary”.
We can then analyze the clusters as shown in Fig. 7.27. By selecting cluster
1 (marked with an arrow in Fig. 7.27) on the left, we will get a more detailed output.
In the drop-down list on the right, field “View”, we can choose “Cell Distribution”.
By selecting the cluster on the left, we can see the frequency distribution of the
objects on the right.
Another valuable analysis is offered in the model viewer, if we use the symbols
in the left window below the clustering results. We can activate the “cell distribu-
tion” button, marked with an arrow in Fig. 7.28. The distribution of each variable in
the clusters then appears above.
626 7 Cluster Analysis

Fig. 7.27 Detailed cluster profiles in the model viewer

Focusing once more on Fig. 7.25, we can see that the clustering result is exactly
the same as the result achieved in Sect. 7.3.1. Comparing the results of the manual
calculation with the visualization in Fig. 7.10, in the form of a dendrogram, we see
that also here cars 1–4 are assigned to one cluster and cars 5 and 6 to another one.

Summary
With the TwoStep algorithm, we used a small dataset to find clusters. We did this in
order to show that the result that we got based on the “single-linkage method”, from
the theory section, is the same. Unfortunately, we had to decide in advance, the
number of clusters to determine. Normally, this is not necessary using the TwoStep
algorithm.
After determining “segments” of the objects, the silhouette plot helps to assess
the quality of the model generally. Furthermore, we learn how to analyze the
different clusters step-by-step, using the different options from the model viewer.
7.3 TwoStep Hierarchical Agglomerative Clustering 627

Fig. 7.28 Input variable distribution per cluster on the left

7.3.4 Exercises

Exercise 1 Theory of cluster algorithms

1. Outline why clustering methods, e.g., TwoStep, belong to the category of


“unsupervised learning algorithms”.
2. In Sect. 2.7.7 “Partitioning dataset”, we explained how to divide a dataset into
training and test partitions. A validation subset can probably be separated from
the original data before starting the training; proof that it is unnecessary to use
this method for the TwoStep algorithm; in fact it should be avoided.
3. We explained the meaning of the silhouette chart in Sect. 7.3.1, and particularly
in Table 7.25. A value between fair (+0.5) and good (above +0.5) lets us assume
that the clustering was successful. Explain how this chart—or rather the theory
this chart is based on—can be used to determine the optimal number of clusters
to use.

Exercise 2 IT user satisfaction based on PCA results


In a survey, the satisfaction of IT users with an IT system was determined.
Questions were asked relating to several characteristics, such as “How satisfied
628 7 Cluster Analysis

are you with the amount of time your IT system takes to be ready to work (from
booting the system to the start of daily needed applications)?” The users could rate
the aspects using the scale “1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent”. See also
Sect. 10.1.20.
You can find the results in the file “IT user satisfaction.sav”. In Sect. 6.3.3,
exercise 3, we used results of a PCA to determine technical and organizational
satisfaction indices. Now the users should be divided into groups based on their
satisfaction with both aspects.

1. Open the stream “pca_it_user_satisfaction.str” and save it under another name.


2. In the lower sub-stream, the indices for technical and organizational are calcu-
lated based on the PCA results. Open the 2D-plot created by the Plot node and
interpret the chart. Determine the number of possible clusters you expect to find.
3. Using a TwoStep node, determine clusters of users based on the indices. Don’t
forget to add a Type node first. Explain your findings.
4. Optional: The TwoStep algorithm assumes normally distributed continuous
values. We explained in Sect. 3.2.5 how to assess and transform variables to
meet this assumption. The stream created above can now be modified so that
transformed PCA results are used for clustering. Outline your findings.

Exercise 3 Consumer segmentation using habits


Using the options, “vegetarian”, “low meat”, “fast food”, “filling”, and “hearty”,
consumers were asked “Please indicate which of the following dietary characteristics
describe your preferences. How often do you eat . . .”. The respondents had the
chance to rate their preferences on a scale “(very) often”, “sometimes”, and “never”.
The variables are coded as follows: “1¼never”, “2¼sometimes”, and “3¼(very)
often”. They are ordinally scaled. See also Sect. 10.1.24.

1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The
Stream “Template-Stream_nutrition_habits” uses this dataset. Please open the
template stream and make sure the data can be loaded. Save the stream with
another name.
2. Use a TwoStep clustering algorithm to determine consumer segments. Assess
the quality of the clustering, as well as the different groups of consumers
identified. Use a table to characterize them.
3. The variables used for clustering are ordinally scaled. Explain how proximity
measures are used to deal with such variables in clustering algorithms.

Remark
Please note that in exercise 2 in Sect. 7.5.3 and also in the solution in Sect. 7.5.4, an
alternative Kohonen and K-Means model will be presented using the Auto
Cluster node.
7.3 TwoStep Hierarchical Agglomerative Clustering 629

7.3.5 Solutions

Exercise 1 Theory of cluster algorithms

1. Unsupervised learning methods do not need a target field to learn how to handle
objects. In contrast to the tree methods for example, these algorithms try to
categorize the objects into subgroups, by determining their similarity or
dissimilarity.
2. First of all, in clustering we find more than one correct solution for segmentation
of the data. So different parameters, e.g., proximity measures or clustering
methods, should be used and their results should be compared. This is also a
“type of validation”, but there are many reasons not to divide the original dataset
when using TwoStep:
(a) Partitioning reduces the information presented to the algorithm for finding
“correct” subgroups of objects. Reducing the number of records in cases of
small samples will often lead to worse clustering results. We recommend
using the partitioning option only in cases of huge sample size.
(b) Clusters are defined by inspecting each object and measuring the distance
from or similarity to all other objects. If objects are excluded from cluster-
ing, by separating them in a test or validation partition, the algorithm will
not take them into account and will not assign any cluster number to these
objects.
(c) TwoStep does not produce a “formula” for how to assign objects to a
cluster. Sure, we can assign a completely new object to one of the clusters,
but this process is only based on our “characterisation” of the clusters, using
our knowledge of the data’s background, where the data came from.
3. To measure the goodness of a classification, we determine the average distance
from each object to the points of the cluster it is assigned to and the average
distance to the other clusters. The silhouette plot shows this measure. If we
assume that the silhouette is a measure of the goodness of the clustering, we can
compare different models using their silhouette value. For instance, we can
modify the number of clusters to determine in a TwoStep model, and then use
the model with the highest silhouette value. This method is used by the Auto-
clustering node, as we will see later.

Exercise 2 IT user satisfaction based on PCA results


Name of the solution streams clustering_pca_it_user_satisfaction
Theory discussed in section Sect. 6.3 for PCA theory
Sect. 6.3.3, exercise 3 for indices calculation based on PCA
Sect. 7.3.2 for TwoStep algorithm
Sect. 3.2.5 for transformation towards normal distribution
630 7 Cluster Analysis

Fig. 7.29 Stream “custering_pca_it_user_satisfaction”

1. We opened the stream “pca_it_user_satisfaction.str” and saved it under the name


“clustering_pca_it_user_satisfaction.str”. Figure 7.29 shows the stream.
2. We open the 2D-plot created by the Plot node and marked with an arrow in
Fig. 7.29. As we can see in Fig. 7.30, the scatterplot shows a cloud with fewer
separated points. The points are arranged on the bisecting line of the diagram. So
in the lower left-hand corner, we can find the users who are less satisfied, and in
the upper right-hand corner, the users who are satisfied. In this exercise, we use a
clustering algorithm to assign the users automatically to these “satisfaction
groups”. We expect to have two or three groups. Outliers on the left and on
the right are probably separated.
3. We must add a Type node before we add a TwoStep node to the stream, to
determine clusters of users based on the indices. Figure 7.31 shows the settings
in the “fields” tab of the TwoStep node. Only the technical and organizational
satisfaction indices are used to determine the clusters. We do not modify other
parameters in this node. We run the TwoStep node and get the model in nugget
form.
Figure 7.32 shows the last sub-stream. Finally, we add a Plot node to visualize
the result of the TwoStep node clustering. Figure 7.33 shows the parameters of
the Plot node. Here, we used the option “Size” to show the cluster number. Of
course the option “Color” can also be used.
As expected, Fig. 7.34 shows two subgroups of users. So each user is assigned
to exactly one “satisfaction group”. A more detailed analysis in the Model
7.3 TwoStep Hierarchical Agglomerative Clustering 631

Fig. 7.30 Organizational vs. technical satisfaction indices

Fig. 7.31 TwoStep node settings


632 7 Cluster Analysis

Fig. 7.32 Clustering sub-stream is added

Fig. 7.33 Parameter of Plot node shown in Fig. 7.32


7.3 TwoStep Hierarchical Agglomerative Clustering 633

Fig. 7.34 Clustered users by their satisfaction indices

viewer tells us that the segments with 50 % of the users have exactly the same
size.
If we should need to generate a unique user ID for each record, we could do
that by adding a Derive node with the @INDEX function.
4. The solution for this part of the exercise can be found in the stream “cluster_i-
t_user_satisfaction_transformed”. The stream will not be explained in detail
here. The reader is referred to Sect. 3.2.5 “Transform node and SuperNode”.
As explained in Sect. 3.2.5, the Transform node should be used to assess the
normality of data and to generate Derive nodes to transform the values. We add a
Transform node. Assessing the data, we find that using a Log transformation
could help to move the distributions towards normality. We generate a
SuperNode for these transformations. A Shapiro Wilk test would show whether
or not the original, rather than the transformed, variables are normally
distributed. The transformation helps to improve the quality though, as we will
see when assessing the clustering results.
We add a TwoStep node and use the transformed two variables to cluster the
user according to his/her level of satisfaction. Plotting the original (and not the
634 7 Cluster Analysis

Fig. 7.35 Clustered users by their satisfaction indices, based on transformed data

transformed) variables against each other in a Plot node, we get the result shown
in Fig. 7.35. The algorithm can separate the groups, but the result is unsatisfying.
The two users with technical satisfaction ¼ 5 at the bottom belong more to the
cluster in the middle than to the cluster on the left.
If we restrict the number of clusters produced by TwoStep to exactly two, we
get the solution depicted in Fig. 7.36. Also here, 50 % of the users are assigned to
each of the clusters.
This solution shows that for the given dataset, the identification of the number
of clusters determined a higher appropriate number of segments. Generally
though, the algorithm works fine on these skewed distributions.
7.3 TwoStep Hierarchical Agglomerative Clustering 635

Fig. 7.36 Two users clusters, determined from transformed data

Exercise 3 Consumer segmentation using dietary habits


Name of the solution streams cluster_nutrition_habits
Theory discussed in section Sect. 7.3.2

Remarks
The TwoStep algorithm assumes that non-continuous variables are multinomially
distributed. We do not verify this assumption here.
For the dependency of the silhouette measure, and the number of clusters
determined by TwoStep or K-Means, see also the solution to exercise 5 in
Sect. 7.4.4. The solution can be found in the Microsoft Excel file “kmeans_clus-
ter_nutrition_habits.xlsx”.
Please note that in exercise 2 in Sect. 7.5.3, and also in the solution in Sect. 7.5.4,
an alternative Kohonen and K-Means model will be presented using the Auto
Cluster node.

1. We open the stream “Template-Stream_nutrition_habits” and save it under the


new name “cluster_nutrition_habits.str”.
636 7 Cluster Analysis

Fig. 7.37 Parameters of the Filter node

Fig. 7.38 Scale type settings in the Type node

2. Before we start to cluster, we have to understand the settings in the stream. For
this, we open the Filter node and the Type node, to show the scale type (see
Figs. 7.37 and 7.38). Here, it can be helpful to enable the ID in the Filter node to
identify the consumers related to their assigned cluster number.
We add a TwoStep node to the stream. To be sure the correct variables are
used for the cluster analysis, we can add them in the Filed tab of the TwoStep
7.3 TwoStep Hierarchical Agglomerative Clustering 637

Fig. 7.39 Fields tab in the TwoStep node

Fig. 7.40 Model tab in the TwoStep node

node (see Fig. 7.39). Additionally, we must make sure that the ordinal variables
are standardized in the Model tab, as shown in Fig. 7.40. Running the TwoStep
node, we get the final stream as shown in Fig. 7.41.
638 7 Cluster Analysis

Fig. 7.41 Final stream “cluster_nutrition_habits”

Fig. 7.42 Summery of TwoStep clustering in the Model Viewer

The advantage of using a TwoStep node is that it determines the number of


clusters automatically. Double-clicking the model nugget, we get the model
summary in the Model Viewer, as in Fig. 7.42.
The quality of clustering is good, based on the silhouette plot. As described in
Sect. 7.3.3, we get a more precise silhouette value by using the “Copy
7.3 TwoStep Hierarchical Agglomerative Clustering 639

Fig. 7.43 Detailed assessment of the clusters

Table 7.27 Characterization of clusters


Cluster number Description
Cluster-1 Avoids vegetarian food but does not prefer lots of meat
Cluster-2 Respondents who eat hearty, filling, and fast food but sometimes vegetarian
Cluster-3 Mainly vegetarian food preferred, but also eats “low meat”
Cluster-4 Avoids vegetarian, meat, and fast food. Sometimes eats “low meat”
Cluster-5 Sometimes vegetarian and sometimes also “low meat”

Visualization Data”, also highlighted with an arrow in Fig. 7.42. Here, the
silhouette value is 0.7201.
Using the option “clusters” from the drop-down list in the left-hand corner of
the Model Viewer in Fig. 7.42, we get the frequency distribution per variable and
cluster, as shown in Fig. 7.43. Table 7.27 shows a short assessment of the
different clusters. Cluster 4 is particularly hard to characterize. The TwoStep
algorithm should probably be used to determine only four clusters.
640 7 Cluster Analysis

3. In this case, the proximity measure is a similarity measure. Ordinal variables are
recoded internally into many dual variables. For details, see exercise 2 in
Sect. 7.2.2. To determine the similarity between the dual variables, the Tanimoto
coefficient can be used. For details see exercise 3 in Sect. 7.2.2, as well as the
explanation in Sect. 7.1.

7.4 K-Means Partitioning Clustering

7.4.1 Theory

In hierarchical clustering approaches, the distance between all objects must be


determined to find the clusters. As described in Sect. 7.3.2, TwoStep tries to
avoid the difficulties of handling a large number of values and objects, by using a
tree to organize the objects based on their distances. Additionally, TwoStep results
are relatively sensitive, relative to mixed type attributes in the dataset.
Dealing with large datasets in cluster analysis is challenging. With the K-Means
algorithm, it is not necessary to analyze the distance between all objects. For this
reason, the algorithm is used often.
In this section, we will outline the theory K-Means is based upon, as well as the
advantages and the disadvantages that we can conclude, based on this knowledge.
The steps of the K-Means algorithm can be described as follows (see also IBM
(2015a), p. 227–232):

1. The user specifies the number of clusters k.


2. Metrical variables are transformed to have values between 0 and 1, using the
following formula:

xi  xmin
xi, new ¼
xmax  xmin
Nominal or ordinal variables (also called symbolic fields) are recoded, using
binary coding as outlined in Sect. 7.2.1, Exercise 2 and especially in Table 7.9.
Additionally, the SPSS Modeler uses a scaling factor to avoid having the
variables overweighed in the following steps. For details see also IBM
(2015a), p. 227–228. Normally, the factor equals the square root of 0.5 or
0.70711, but the user can define his/her own values in the expert tab of the
K-Means node.
3. The k cluster centers are defined as follows (see IBM (2015a), p. 229):
(a) The values of the first record in the dataset are used as the initial cluster
center.
(b) Distances are calculated from all records to the cluster centers so far
defined.
(c) The values from the record with the largest distance to all cluster centers are
used as a new cluster center.
7.4 K-Means Partitioning Clustering 641

(d) The process stops if the number of clusters equals the number predefined by
the user, i.e., until k cluster centers are defined.
4. The squared Euclidean distance (see Tables 7.5 and 7.10) between each record
or object and each cluster center is calculated. The object is assigned to the
cluster center with the minimal distance.
5. The cluster centers are updated, using the “average” of the objects assigned to
this cluster.
6. This process stops if either a maximum number of iterations took place, or there
is no change in the cluster centers recalculation. Instead of “no change in the
cluster centers”, the user can define another threshold for the change that will
stop the iterations.

" K-Means is a clustering algorithm suitable for large datasets in partic-


ular. First k cluster centers are determined and then the objects are
assigned to the nearest cluster center. The number of clusters k must
be defined by the user. In the following iterations, the clustering is
improved by assigning objects to other clusters and updating the
cluster centers. The process stops if a maximum number of iterations
are obtained or there is no change in the clusters, or the change is
smaller than a user-defined threshold.

" The Auto-Clustering node also uses K-Means clustering. Here, the
number of clusters does not have to be determined in advance.
Based on several goodness criteria, as defined by the user, “the
best” model will be determined.

Advantages

– Each cluster consists of at least one item.


– The clusters do not overlap.
– Using large datasets, K-Means is faster than hierarchical algorithms. K-Means
can probably be used if other algorithms crash because of insufficient memory.

Disadvantages

– The number of clusters k must be defined by the user.


– It tends to produce clusters of approximately the same size.
– Finding an appropriate number of clusters is difficult.
– The result depends to some extent on the order of the objects in the dataset. This
is also because the first record is being used as the initial cluster.
642 7 Cluster Analysis

7.4.2 Building a Model in SPSS Modeler

Customer segmentation, or market segmentation, is one of the main fields where


cluster analysis is often used. A market is divided into subsets of customers who
have typical characteristics. The aim of this segmentation is to identify target
customers or to reduce risks.
In the banking sector, this technique is used to improve the profitability of
business and to avoid risks. If a bank can identify customer groups with lower or
higher risk of default. Then the bank can define better rules for money lending or
credit card offers.
We want to apply the K-Means algorithm for customer segmentation purposes
here too. The dataset comes from the IBM Website (2014). See Sect. 10.1.7.

Description of the model


Stream name customer_bank_segmentation_K_means
Based on data set customer_bank_data.csv
Stream structure

Related exercises: All exercises in Sect. 7.4.3

Creating the stream

1. We open the “Template-Stream_Customer_Bank”. It includes a Variable File


node, a Type node to define the scale types, and a Table node to show the
records (Fig. 7.44).
2. To check if the records are imported and to understand the data, let’s open the
Table node first (see Fig. 7.45).
3. Now we have to be sure the correct scale types are assigned to the variables, so
we open the Type node as shown in Fig. 7.46. There is no need to define the role
of the variables here. That’s because we will add the variables we want to use
later one-by-one to the clustering node. We think this method is more transpar-
ent than using options that have side effects on other nodes.
4. Interpreting the variables shown in Fig. 7.46 and additionally described in
detail in Sect. 10.1.7, we can select variables that help us to find subgroups
of customers, in terms of risk of default. We assume:
7.4 K-Means Partitioning Clustering 643

Fig. 7.44 Stream “Template-Stream_Customer_Bank”

Fig. 7.45 Records of “customer_bank_data.csv”

Fig. 7.46 Scale types of variables defined for “customer_bank_data.csv”


644 7 Cluster Analysis

Fig. 7.47 Added Reclassify node to the stream

(a) “ADDRESS” and “CUSTOMERID” are not helpful and can be excluded.
(b) “DEFAULTED” should be excluded too, because it is the result, and the
clustering algorithm is an unsupervised method, which should identify the
pattern by itself and should not “learn” or “remember” given facts.
(c) “AGE”, “CARDDEBT”, “EDUCATION”, “INCOME”,
“OTHERDEBT”, and “YEARSEMPLOYED” are relevant variables for
the segmentation.
5. The “EDUCATION” variable is defined as nominal because it is a string, but in
fact it is ordinal. To define an order, we use a Reclassify node as described in
Sect. 3.2.6. After adding this node from the “Field Ops” tab of the Modeler to
the stream (see Fig. 7.47), we can define its parameters as shown in Fig. 7.48.
We define here a new variable “EDUCATIONReclassified”. Later in this
stream, we have to add another Type node, for assigning the correct scale type
to this new variable. For now, we want to check all variables to see if any of
them can be used in clustering.
6. Interpreting the potential useful variables, we can see that some of them are
correlated. A customer with a high income can afford to have higher credit card
and other debt. Using these original variables doesn’t add much input to the model.
Let’s first of all check the correlation coefficient of the variables by using a
Sim Fit node, however. We add the node to the stream and execute it. We
explained how to use this node in Sect. 4.4. Figure 7.49 shows the actual
stream. Figure 7.50 shows the correlations.
We can see that the variables “CARDDEBT” and “OTHERDEBT”, as well
as “INCOME” and “CARDDEBT”, and “INCOME” and “OTHERDEBT” are
correlated.
7. So it is definitely necessary to calculate a new variable in the form of the ratio of
credit card debt and other debt to income. To do so, we add a Drive node from
the “Field Ops” tab of the Modeler. Using its expression builder, as outlined in
7.4 K-Means Partitioning Clustering 645

Fig. 7.48 Assigning ordinal scaled values to the variable “EDUCATION”

Sect. 2.7.2 we define the parameters as shown in Fig. 7.51. The name of the new
variable is “DEBTINCOMERATIO” and the formula “(CARDDEBT +
OTHERDEBT)/INCOME * 100”. So “DEBTINCOMERATIO” equals the
summarized debt of the customer vs. the income in percent.
8. To assign the correct scale types to the reclassified educational characteristics
in “EDUCATIONReclassify”, and the new derived variable
“DEBTINCOMERATIO”, we must add another Type node at the end of the
stream. We define “DEBTINCOMERATIO” as continuous and
“EDUCATIONReclassify” as ordinal (see Figs. 7.52 and 7.53).
646 7 Cluster Analysis

Fig. 7.49 Reclassify and Sim Fit nodes are added to the stream

Fig. 7.50 Correlation of metrical variables included in “customer_bank_data.csv”

9. Now we have finished the preliminary work and we are ready to try to cluster
the customer. We add a K-Means node to the stream from the “Modeling” tab
and open its parameter dialog window (Fig. 7.54).
7.4 K-Means Partitioning Clustering 647

Fig. 7.51 Parameters of the Derive node to calculate the debt-income ratio

Fig. 7.52 Stream with another added Type node at the end
648 7 Cluster Analysis

Fig. 7.53 Assigning scale types to “DEBTINCOMERATIO” and “EDUCATIONReclassified”

Fig. 7.54 Variables used in K-Means node


7.4 K-Means Partitioning Clustering 649

Fig. 7.55 Cluster parameters used in K-Means node

In the Fields tab, we add the variables previously discussed. To do so, we


enable the option “Use custom settings” and click the variable selection button
on the right. Both are marked with an arrow in Fig. 7.54.
Using the K-Means algorithm, we have to define the number of clusters in
the tab “Model” of the node. The rule of thumb explained in Table 7.25 of
Sect. 7.3 was
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffi
number of objects 850
¼  20:62
2 2
The derived 21 clusters are definitely too much, because we have to describe
the characteristics of each customer cluster based on our knowledge. The four
variables used here do not allow us that precision. We should therefore start
with a lower number and decide to use five clusters first (see Fig. 7.55).
The option “Generate distance field“ would give us the opportunity to
calculate the Euclidean distance between the record and its assigned cluster
center. These values are assigned to each record and appear in a Table node
attached to the model nugget, in a variable called “$KMD-K-Means”. We don’t
want to use this option here. For details, see IBM (2015a), p. 231–232.
10. We can start the K-Means clustering with the “Run” button. A model nugget
will be added to the stream (Fig. 7.56).
11. Double-clicking on the node, we can assess the clustering results (see
Fig. 7.57). We can see on the left in the Silhouette plot that the clustering
650 7 Cluster Analysis

Fig. 7.56 K-Means node and Model nugget in the stream

Fig. 7.57 Model summary for the five-cluster solution

quality is just “fair”. On the right, the Modeler shows that there is a cluster
3 with 5.8 % of the records.
To assess the model details, we choose the option “Clusters” in the left-hand
corner of Fig. 7.57. In the left part of the window in the Model viewer, we can
find the details, as shown in Fig. 7.58. Obviously, the difference between the
clusters is not that remarkable. So the first conclusion of our analysis is that we
have to reduce the number of clusters.
7.4 K-Means Partitioning Clustering 651

Fig. 7.58 Cluster details for the five-cluster solution

We also want to assess the quality of the predictors used to build the model,
however. The SPSS Modeler offers this option in the drop-down list on the
right-hand side of the window. It is marked with an arrow in Fig. 7.57.
Figure 7.59 shows us that the clustering is dominated by the education of the
customer. First of all, this is not very astonishing. So the practical conclusion is
not that useful. Furthermore, in terms of clustering quality, we want to have the
significant influence of more than one variable in the process. The second
conclusion of our analysis is that we should try to exclude the variable
“EDUCATIONclassified” from the K-Means clustering.
We can close the Model viewer with “OK”.

" Predictor importance, determined by the SPSS Modeler, can help


identify the best variables for clustering purposes, but it is not in
itself proof of the importance of a variable, in terms of improving the
model accuracy. The importance value is just rescaled so that the
sum of all predictors is one. A larger importance means one variable
is more appropriate than another.
652 7 Cluster Analysis

Fig. 7.59 Predictor


importance in the five-cluster
solution

" For categorical variables, the importance is calculated using Pearson’s


chi-square. If the variables are continuous, an F-test is used. For
calculation details, see IBM (2015a), p. 89–91.

12. In the K-Means node, we remove the “EDUCATIONclassified” variable from


the variable list and define the number of clusters the algorithm should identify
as four. The final settings are shown in Figs. 7.60 and 7.61.
We click “Run” to start the modeling process again.
13. Figure 7.62 shows the summary of the model in the Model Viewer. The cluster
quality has improved, as we can see in the silhouette plot: the smallest
represents 15.5 % of the 850 records.
14. If we use the drop-down list on the left for an assessment of the clusters, we get
the values shown in Fig. 7.63. No cluster has a similar characteristic to another
one. Also, the importance of the predictors in Fig. 7.64 is well-balanced.
15. The Model Viewer offers a wide range for analysis of the cluster. If we click on
the button for the absolute distribution in the middle of the left window, marked
with an arrow in Fig. 7.65, we can analyze the distribution of each variable per
cluster on the right. To do so, we have to click one of the table cells on the left.

Interpretation of the determined clusters


The clustering algorithms do not produce any description of the identified clusters,
but in our example, Figs. 7.63 and 7.65 help us to characterize them very well.
Table 7.28 summarizes the findings.
7.4 K-Means Partitioning Clustering 653

Fig. 7.60 Modified list of variables in the K-Means node

Fig. 7.61 Final model tab options in the K-Means node

The practical conclusions are more difficult, however, as is often expected.


Based on these results, we could think that customers assigned to cluster 4 are
uninteresting to the bank. A more detailed analysis can help us be more precise.
To get more information, we add a Table node and a Data Audit node to the
stream and connect them with the model nugget (see Fig. 7.66). In the Table node of
Fig. 7.67, we can find a column with the assigned cluster number.
654 7 Cluster Analysis

Fig. 7.62 Model summary of the four-cluster solution

Fig. 7.63 Cluster details from the four-cluster solution


7.4 K-Means Partitioning Clustering 655

Fig. 7.64 Predictor importance in the four-cluster solution

Fig. 7.65 Details of Predictor importance

We remember that in the original dataset, a variable “Defaulted” was included


(see Fig. 7.45). We should use these past defaults to find out more details related to
our segmentation, based on the logic of the K-Means algorithm. We add at least a
Matrix node and connect it with the Model nugget. The node is shown in Fig. 7.66
at the end.
We discussed how to use a Matrix node in Sect. 4.7. There we also explained
how to conduct a Chi-Square test of independence. Here, we assign the variable
“DEFAULTED” to the rows and the cluster number to the columns (see Fig. 7.68).
To find out details of the dependency, we enable the option “Percentage in column”
in the “Appearance” tab of the Matrix node (see Fig. 7.69).
656 7 Cluster Analysis

Table 7.28 Cluster description of the customer segmentation


Cluster number as shown
in Fig. 7.63 Characteristics of customer segment
3 Young customers with an average age of 27 years and therefore a
relatively low number of years of employment (average 3.3 years).
Remarkable debt-income ratio of 10 %. This means 10 % of the
income is required to pay credit card or other debt.
1 Middle-aged or approximately 36–37 years old and with a very low
debt-income ratio of 6 % on average.
4 Older customers with 44 years and a long employment history.
Average debt-income ratio of 9 %.
2 Middle-aged customers but slightly older than those in cluster 1. In
contrast, the number of years of employment is lower than in
cluster 1. The average debt-income ratio is above 20 %.

Fig. 7.66 Final K-Means customer segmentation stream

As shown in Fig. 7.70, the relative frequency of customers for whom we have no
information regarding their default is approximately the same in each cluster. So
this information gap does not affect our judgement very much.
More surprisingly, we can see that the default rate in cluster 1 and cluster 4 is
relatively low; because of this, customers in cluster 1 are a good target group for the
bank. First private banking may generate profit, and the number of loans given to
them can be increased.
We described cluster 2 as the class of middle-aged customers with an average
debt-income ratio of above 20 %. These may be the customers who bought a house
or a flat and have large debts besides their credit card debt. The loss in case of a
default is high, but contrary to this the bank would be able to generate high profit.
The very high default rate of above 43 % means we recommend separating these
customers, however, and paying attention to them.

" Cluster analysis identifies groups in data. To analyze the


characteristics of these groups, or the differences between groups,
the following approaches can be used:
7.4 K-Means Partitioning Clustering 657

Fig. 7.67 Records with the assigned cluster number

Fig. 7.68 Variables are assigned to columns and rows in the Matrix node

1. If the variables that should be used for further analysis are


nominal or ordinal, a Matrix node must be used. A Chi-square
test of independence can be performed here also.
2. Often it is also necessary to calculate several additional
measures, e.g., the mean of credit card debts per cluster.
Here, the Means node can be used to determine the averages
and additionally to perform a t-test or a one-way ANOVA. See
exercise 1 in Sect. 7.4.3 for details.
658 7 Cluster Analysis

Fig. 7.69 Percentage per column should be calculated in the Matrix node

Fig. 7.70 Result of Default analysis per cluster

The characteristics of cluster 3, with its young customers, its lower debt-income
ratio, and the default rate of 32 %, are different from cluster 2, where in some
specific cases we found it better to discontinue the business relationship. These
customers may default very often, but the loss to the bank is relatively low.
Probably, payments for loans with lower rates are delayed. The bank would do
very well to support this group because of the potential to generate future profit.
7.4 K-Means Partitioning Clustering 659

Summary
We separated groups of objects in a dataset using the K-Means algorithm. To do
this, we assessed the variables related to their adequacy for clustering purposes.
This process is more or less a decision of the researcher, based on knowledge of the
practical background. New measures must be calculated, however, to condense
information from different variables into one indicator.
The disadvantage of the algorithm is that the number of clusters must be defined
in advance. Using statistical measures, e.g., the silhouette, as well as practical
knowledge or expertise, we found an appropriate solution. In the end, the clusters
must be described by assessing their characteristics in the model viewer of the
K-Means nugget node. Knowledge of the practical background is critically impor-
tant for finding an appropriate description for each cluster.

7.4.3 Exercises

Exercise 1 Calculating means per cluster


Using the K-Means algorithm, we identified four groups of customers. Assessing
the results and characterizing the clusters, we found that customers in cluster 2 are
middle-aged with an average debt-income ratio of above 20 %. We speculated
about credit card and other debts. Probably, these customers bought a house or a flat
and have large “other” debts in comparison to their credit card debt.

1. Open the stream “customer_bank_segmentation_K_means.str” we created in


Sect. 7.4.2. Determine the average credit card debt and the average other debt,
using a Means node from the Output tab of the Modeler.
2. Assess the result and explain if cluster 2 does indeed have remarkably higher
other debts.

Exercise 2 Improving clustering by calculating additional measures


In Sect. 7.4.2, we created a stream for customer segmentation purposes. We
assessed all variables and defined a new debt-income ratio. The idea behind this
measure is that a customer with a higher income is able to service higher debts. In
the end, we can interpret the customer segmentation result; however, the silhouette
plot in Fig. 7.62 lets us assume that the fair model quality can be improved. Here,
we want to show how additional well-defined measures can improve the clustering
quality.

1. Open the stream “customer_bank_segmentation_K_means.str” we created in


Sect. 7.4.2. Assess the variables included in the original dataset.
2. Obviously, older customers also normally have longer employment history and
therefore also a higher number of years in employment. So the idea is to
calculate the ratio of years employed to the age of the customer. Add this new
variable (name “EMPLOY_RATIO”) to the stream.
660 7 Cluster Analysis

3. Now update the clustering, using the new variable instead of the variables
“AGE” and “YEARSEMPLOYED” separately. Explain your findings using
the new variable.

Exercise 3 Comparing K-means and TwoStep results


In the previous section, we used the K-Means algorithm to create a stream for
customer segmentation purposes. In this exercise, we want to examine if the
TwoStep algorithm discussed in Sect. 7.3 leads to the same results.

1. Open the stream “customer_bank_segmentation_K_means.str” we created in


Sect. 7.4.2. Modify the stream so that a cluster model based on TwoStep will
also be calculated. Bear in mind the assumption that the TwoStep algorithm has
normally distributed values.
2. Now consolidate the results of both models so they can be analyzed together.
Use a Merge node.
3. Now add the necessary nodes to analyze the results. Also add nodes to show the
default rates per cluster, depending on the cluster method used.
4. Assess the clustering calculated by the TwoStep algorithm and compare the
results with the findings presented in Sect. 7.4.2, using K-Means.

Exercise 4 Clustering on PCA results


In Sect. 6, we discussed principal component analysis (PCA) and principal factor
analysis (PFA), as different types of factor analyses. PCA is used more often. The
factors determined by a Principal Component Analysis (PCA) can be described as
“a general description of the common variance”.
If the fluctuation of a set of variables is somehow similar, then behind these
variables a common “factor” can be assumed. The factors are the explanation
and/or reason for the original fluctuation of the input variables and can be used to
represent or substitute them in further analyses.
In this exercise, PCA and K-Means should be combined to find clusters of people
with homogeneous dietary characteristics.

1. Open the stream “pca_nutrition_habits.str”, created and discussed intensively in


Sect. 6.3. Save it under another name.
2. Recap the aim of factor analysis and especially describe the meaning of the
factor scores. Furthermore, outline the aim of cluster analysis in your own words
and explain the difference from factor analysis in general.
3. Based on the factor scores, a cluster analysis should now be performed. In the 2D
plot of factor scores in Fig. 6.43, we found out that three clusters are most
appropriate for describing the structure of the customers based on their dietary
characteristics.
Now please extend the stream to perform the cluster analysis using K-Means.
Show the result in an appropriate diagram.
4. The factor scores, or more accurately the principal component scores, express
the input variables in terms of the determined factors, by reducing the amount of
7.4 K-Means Partitioning Clustering 661

information represented. The loss of information depends on the number of


factors extracted and used in the formula.
The factor scores are standardized. So they have a mean of zero and a standard
deviation of one.
Other multivariate methods, such as cluster analysis, can be used based on the
factor scores. The reduced information and the reduced number of variables
(factors) can help more complex algorithms to converge or to converge faster.
Outline why it could be helpful and sometimes necessary to use PCA, and then
based on the factor scores, a clustering algorithm. Create a “big picture” to
visualize the process of how to decide when to combine PCA and K-Means or
TwoStep.

Exercise 5 Determining the optimal number of clusters


K-Means does not determine the number of clusters to use for an optimal model. In
Sect. 7.3.1, we explained the theory of clustering and discussed several methods for
determining rules or criterions that can help to solve this problem. Table 7.25 shows
different approaches.
To measure the goodness of a classification, we can determine the average
distance of each object from the other objects in the same cluster and the average
distance from the other clusters. Based on these values, the silhouette S can be
calculated. More details can be found in Struyf et al. (1997), p. 5–7.
As outlined in IBM (2015b), p. 77 and IBM (2015b), p. 209, the IBM SPSS
Modeler calculates a Silhouette Ranking Measure based on this silhouette value. It
is a measure of cohesion in the cluster and separation between the clusters.
Additionally, it provides thresholds for poor (up to +0.25), fair (+0.5), and good
(above +0.5) models. See also Table 7.26.
The SPSS Modeler shows the silhouette value in the model summary of the
Model Viewer on the left. To get the precise silhouette value, the option “Copy
Visualization Data” can be used. To do this the second button from left in the upper
part of the Model Viewer should be clicked. Then the copied values must be pasted
in simple word processing software. Table 7.26 shows the result.
The aim of this exercise is to use the silhouette value to assess the dependency
between the silhouette value and the number of clusters or to find an appropriate
number of clusters when using the K-Means algorithm.

1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The
Stream “Template-Stream_nutrition_habits” uses this dataset.
Please open the template stream. Save the stream under another name.
Using the dietary types, “vegetarian”, “low meat”, “fast food”, “filling”, and
“hearty”, the consumers were asked “Please indicate which of the following
dietary characteristics describes your preferences. How often do you eat . . .”.
The respondents had the chance to rate their preferences on a scale “(very)
often”, “sometimes”, and “never”. The variables are coded as follows:
“1¼never”, “2¼sometimes”, and “3¼(very) often”. See also Sect. 10.1.24.
662 7 Cluster Analysis

The K-Means clustering algorithm should be used to determine consumer


segments, based on this data. Add a K-Means node to the stream.
2. Now the dependency of the silhouette value and the number of clusters should be
determined. Create a table and a diagram that shows this dependency, using
spreadsheet software, e.g., Microsoft Excel. Start with two clusters.
3. Explain your findings and determine an appropriate number of clusters, also
keeping the background of the data in mind.
4. OPTIONAL: Repeat the steps by using the TwoStep algorithm and explain your
findings.

7.4.4 Solutions

Exercise 1 Calculating means per cluster


Name of the solution customer_bank_segmentation_K_means_extended_1
streams and Microsoft Excel file with a ANOVA in
“customer_bank_data_ANOVA.xlsx”
Theory discussed in Sect. 7.4.2
section

1. As described in the exercise, we add a Means node to the existing stream and
connect it with the Model nugget node. Figure 7.71 shows the extended stream,
and Fig. 7.72 shows the parameters of the Means node. Running the node, we get
a result as shown in Fig. 7.73.
2. The results in Fig. 7.73 confirm our suspicion that customers in cluster 2 have
high other debts. The Means node performs a t-test in the cases of two different
clusters and a one-way ANOVA if there are more than two clusters or groups.
See IBM (2015c), p. 298–300. The test tries to determine if there are differences
between the means of several groups. In our case, we can find either 100 % or

Fig. 7.71 Means node is added to the stream


7.4 K-Means Partitioning Clustering 663

Fig. 7.72 Parameters of the Means node

Fig. 7.73 Averages between clusters in the Means node


664 7 Cluster Analysis

less than a 0 % chance that the means are the same. The significance levels can
be defined in the Options tab of the Means node. More detailed statistics can also
be found in Microsoft Excel file “customer_bank_data_ANOVA.xlsx”.

Exercise 2 Improving clustering by calculating additional measures


Name of the solution streams customer_bank_segmentation_K_means_extended_2
Theory discussed in section Sect. 7.4.2

1. Using the right table node in the final stream, depicted in Fig. 7.66, we get the
variables included in the original dataset and the clusters numbers as shown in
Fig. 7.74. The variables of interest here are “AGE” and “YEARSEMPLOYED”.
2. To add a Derive node to the stream, we remove the connection between the
Derive node for the “DEBTINCOMERATIO” and the second Type node. The
name of the new variable is “EMPLOY_RATIO” and the formula
“YEARSEMPLOYED/AGE”. This is also shown in Fig. 7.75.
We connect the new Type node with the rest of the stream. Figure 7.76 shows
the final stream and the new Derive node in its middle.
3. Now we update the clustering node by removing the variables “AGE” and
“YEARSEMPLOYED”. Subsequently, we add the new variable
“EMPLOY_RATIO” (see Fig. 7.77).

Running the stream, we can see in the model viewer that the quality of the
clustering could be improved, based on assessment of the silhouette plot in
Fig. 7.78. In comparison to the previous results presented in Fig. 7.62, cluster 1 is
larger with 37.4 % (previously 31.5 %), whereas all other clusters are smaller.
Using ratios or calculated measures can improve the quality of clustering, but the
result may be harder to interpret because of the increased complexity in the new
calculated variable. Additionally, the segmentation can be totally different, as we
can see here: although the percentage of records per cluster in Fig. 7.78 lets us
assume that there are probably some records now assigned to other clusters,
detailed analysis shows another picture.
Summarizing the results in Figs. 7.79 and 7.80, we can find two interesting
groups, in terms of risk management. The younger customers in cluster 2 have a
remarkable debt-income ratio and a default ratio of 28 %. Every second customer
assigned to cluster 4 defaulted in the past. The customers here are older and have a

Fig. 7.74 Variables and cluster numbers


7.4 K-Means Partitioning Clustering 665

Fig. 7.75 Parameters of the added Derive node

Fig. 7.76 Stream is extended with the added variable “EMPLOY_RATIO”

Fig. 7.77 Parameters of the updated K-Means node


666 7 Cluster Analysis

Fig. 7.78 New cluster result overview

Fig. 7.79 Details per cluster in the Model Viewer


7.4 K-Means Partitioning Clustering 667

Fig. 7.80 Default rate per cluster

lower debt-income ratio of 17 %. Clusters 1 and 3 consist of customers that are good
targets for new bank promotions.

Exercise 3 Comparing K-means and TwoStep results


Name of the solution streams customer_bank_segmentation_clustering_comparison
Theory discussed in section Sect. 7.3
Sect. 7.4

In the given stream, we used “AGE”, “YEARSEMPLOYED”, and


“DEBTINCOMERATIO” as input variables. The TwoStep algorithm expects
these variables to be normally distributed. That’s because we have to assess and
transform them using a Transform node. The detailed procedure is described in
Sect. 3.2.5 as well as in exercise 9 in Sect. 3.2.8 as well as 3.2.9.
At the end of the exercise, we can verify if usage of the Euclidean distance
measure helps us to produce better clustering results, based on the fact that the
normality assumption is avoided for the log-likelihood distance measure.
We did not modify the variables mentioned above for the K-Means node,
because interpreting the clustering results is much easier.

1. First we open the stream “customer_bank_segmentation_K_means” created in


Sect. 7.4.2 that we need to modify here. We remove the three nodes at the end
of the stream, depicted in Fig. 7.81. Then we add a Transform node and a Data
Audit node (Fig. 7.82).
668 7 Cluster Analysis

Fig. 7.81 Original K-Means stream

Fig. 7.82 Transform and Data Audit nodes are added

2. As described in detail in Sect. 3.25 we assess the distribution of the variables


“AGE”, “YEARSEMPLOYED”, and “DEBTINCOMERATIO” in the Trans-
form node and the skewness in the Data Audit node. To do this, we add the
three variables in the Transform node (see Fig. 7.83).
3. The assessment and the steps for adding a SuperNode, for the transformation of
“AGE” and “DEBTINCOMERATIO”, can be found in Sect. 3.2.5 as well as in
exercise 9 in Sect. 3.2.7. Figure 7.84 shows the main aspects. We use a
non-standardized transformation and create the SuperNode.
7.4 K-Means Partitioning Clustering 669

Fig. 7.83 Parameters of the Transform node

Fig. 7.84 Distribution analysis in the Transform node

4. Now we are able to use the transformed variable for clustering purposes in a
TwoStep node, so we connect the SuperNode with the original Type node. To
make sure the scale definitions for the transformed variables are correct, we
must add another Type node after the SuperNode. Finally, we can add a
TwoStep node. All these steps are shown in Fig. 7.85.
5. Additionally, we have to change the number of clusters the TwoStep node
should determine. The number of clusters should be four, as shown in Fig. 7.86.
Normally, TwoStep would offer only three clusters here.
6. We run the TwoStep node to get its model nugget.
7. To analyze the results of both streams, we must merge the results of both
sub-streams. We add a Merge node to the stream from the Record Ops tab (see
Fig. 7.87).
8. As outlined in Sect. 2.79, it is important to disable duplicates of variables in the
settings of this node. Otherwise the stream will no longer work correctly.
Figure 7.88 shows that we decided to remove variables coming from the
K-Means sub-stream section.
670 7 Cluster Analysis

Fig. 7.85 Clustering stream is extended

Fig. 7.86 Parameters of the TwoStep node are modified


7.4 K-Means Partitioning Clustering 671

Fig. 7.87 Stream with added Merge node for comparing K-Means and TwoStep results

9. Theoretically, we could now start to add the typical nodes for analyzing the
data, but as we have a lot of different variables, reordering them is helpful.
Therefore, we add a Field Reorder node from the Field ops tab (see Fig. 7.89).
Figure 7.90 shows the parameters of this node.
10. Finally, we add a Table node, a Data Audit node, and two Matrix nodes. The
parameters of the matrix node are the same as shown in Figs. 7.68 and 7.69, as
well as the TwoStep results saved in the variable “$T-TwoStep”.
11. The Table node in Fig. 7.91 shows us that probably only the cluster names have
been rearranged.

Using the Data Audit node, however, we can find that the frequency distributions
that are dependent on cluster numbers are slightly different for both methods.
Comparing the distributions in Fig. 7.92 for K-Means and Fig. 7.93 for TwoStep,
we can see differences in the size of the clusters.
Figure 7.94 shows once more the default rates for K-Means in a Matrix node,
also previously analyzed in Fig. 7.70. As we explained in Sect. 7.4.2, analyzing the
percentage per column is useful for proving if the default rates are independent or
not from the cluster numbers. In Figs. 7.94 and 7.95, we can see that this is true for
both methods.
672 7 Cluster Analysis

Fig. 7.88 Settings of the Merge node for removing duplicated variables

Inspecting the details of the distribution for each variable, Fig. 7.96 shows us, in
comparison with Fig. 7.64, that the importance of the variable
“YEARSEMPLOYED” has increased. Previously, this variable was ranked as
number 2.
Summarizing all our findings that are also shown in Fig. 7.97 for the TwoStep
algorithm, we can say that the usage of different methods leads to significantly
different results in clustering. To meet the assumption of normally distributed
variables, TwoStep needs us to transform the input variables, but they then no
longer exactly meet these assumptions.
7.4 K-Means Partitioning Clustering 673

Fig. 7.89 Final Stream to compare K-Means and TwoStep results

Fig. 7.90 Parameters of the Field Reorder node

Fig. 7.91 Cluster numbers in


the Table node
674 7 Cluster Analysis

Fig. 7.92 Frequency per


cluster determined by
K-Means

Fig. 7.93 Frequency per


cluster determined by
TwoStep

Fig. 7.94 Default analysis per cluster, based on K-Means


7.4 K-Means Partitioning Clustering 675

Fig. 7.95 Default analysis per cluster, based on TwoStep, with log-likelihood distance measure

Fig. 7.96 Predictor importance in the TwoStep model viewer

As outlined in Sect. 7.3.2, we can use the Euclidean distance measure instead.
Activating this distance measure in the TwoStep node dramatically decreases the
quality of the clustering result. Here, we get a silhouette measure of 0.2 and one
very small cluster that represents 3.9 % of the customers. These results are
unsatisfying. That’s because we have concluded using the log-likelihood distance
measure, even when we can’t completely meet the assumption of normal distributed
variables here.
676 7 Cluster Analysis

Fig. 7.97 Detailed analysis of the TwoStep result in the Model Viewer

To decide which model is the best, we need to assess the clusters by inspecting in
detail the records assigned. This is beyond the scope of this book. The higher
importance of “YEARSEMPLOYED” seems to address the risk aspect of the model
better than the higher ranked “AGE” using K-Means, but the smaller differences in
“DEBTINCOMERATIO” per cluster for TwoStep separate the subgroups less
optimally.

Exercise 4 Clustering based on PCA results


Name of the solution streams k_means_nutrition_habits
Theory discussed in section Sect. 6.3
Sect. 7.4.2

1. The solution can be found in the stream “k_means_nutrition_habits”.


2. The user tries to determine factors that can explain the common variance of
several subsets of variables. Factor analysis can be used to reduce the number of
variables.
7.4 K-Means Partitioning Clustering 677

Fig. 7.98 Factor Scores defined by the PCA

Fig. 7.99 Extended stream “pca_nutrition_habits.str”

The factor scores, or more accurately the principal component scores, shown
in Fig. 7.98, express the input variables in terms of the determined factors, by
reducing the amount of information represented. The loss of information
depends on the number of factors extracted and used in the formula.
The reduced information and the reduced number of variables (factors) can
help more complex algorithms, such as cluster analysis, to converge or to
converge faster.
Cluster analysis represents a class of multivariate statistical method. The aim
is to identify subgroups/clusters of objects in the data. Each given object will be
assigned to a cluster based on similarity or dissimilarity/distance measures.
So the difference between, let’s say PCA and K-Means, is that PCA helps us
to reduce the number of variables, and K-Means reduces the number of objects
we have to look at, using defined consumer segments and their nutrition habits.
3. Figure 7.99 shows the extended stream. First we added a Type node on the right.
This is to ensure the correct scale type is assigned to the factor scores defined by
the PCA/Factor node.
678 7 Cluster Analysis

Fig. 7.100 Parameters in the Field tab of the K-Means node

Figures 7.100 and 7.101 show the parameters of the K-Means node. First we
defined the factor scores to be used for clustering purposes, and then we defined
the number of clusters as three. This is explained in Sect. 6.3.2 and shown in
Fig. 6.43.
Figure 7.102 shows the results of the K-Means clustering. Judging from the
silhouette plot, the model is of good quality. The determined clusters are shown
in a 2D-plot using a Plot node. Figure 7.103 shows the parameters of the node
Here, we used the size of the bubbles as visualization of the cluster number. That
is due to printing restrictions with this book. There is of course the option to use
different colors.
As expected and previously defined in Fig. 6.43, however, K-Means finds
three clusters of respondents (see Fig. 7.104). The cluster description can be
found in Table 6.1.
4. Clustering data is complex. The TwoStep algorithm is based on a tree, to manage
the complexity of the huge number of proximity measures and to make
comparisons of the different objects. K-Means is based on a more pragmatic
approach that determines the cluster centers before starting to cluster.
In cluster analysis, however, using either approach can often lead to computer
performance deficits. Here, the PCA can help to reduce the number of variables
or more to the point determine the variables that are most important. Figure 7.105
shows the process for combining PCA and clustering algorithms. Using original
data to identify clusters will lead to precise results, but PCA can also help with
using these algorithms, in the case of large datasets. Details can be found in Ding
and He (2004).
7.4 K-Means Partitioning Clustering 679

Fig. 7.101 Parameters in the Model tab of the K-Means node

Fig. 7.102 Results of the K-Means algorithm shown in the model viewer
680 7 Cluster Analysis

Fig. 7.103 Parameters of the Plot node to visualize the results of K-Means

Fig. 7.104 Clusters of respondents based on their dietary habits


7.4 K-Means Partitioning Clustering 681

Fig. 7.105 Process for combining PCA and cluster algorithms

Exercise 5 Determining the optimal number of clusters


Name of the solution file Stream: “kmeans_cluster_nutrition_habits.str”
Microsoft Excel: “kmeans_cluster_nutrition_habits.xlsx”
Theory discussed in section Sect. 7.3.1, Table 7.25
Sect. 7.4.2

Remark
In this exercise, we specify the number of clusters to determine manually and fit a
model for each new cluster number. In the solution to exercise 1 in Sect. 7.5.4, we
demonstrate how to use the Auto Cluster node for the same procedure.

1. We open the stream “Template-Stream_nutrition_habits” and save it under the


new name “kmeans_cluster_nutrition_habits.str”. Before we start to add the
K-Means node, we should check the scale type defined in the Type node (see
Fig. 7.106).
Now we can add a K-Means node to the stream. To be sure the correct
variables are used for the cluster analysis, we can add them in the Fields tab of
the K-Means node (see Fig. 7.107).
2. We should determine the dependency between the number of clusters and the
silhouette value. To do that, we start with two clusters as shown in Fig. 7.108.
Running the K-Means node, we get the final stream with the model nugget as
depicted in Fig. 7.109.
By double-clicking on the model nugget, we get a model summary in the
Model Viewer, as in Fig. 7.110.
The quality of clustering is fair, judging from the silhouette plot. As described
in Sect. 7.3.3, we get the precise silhouette value by using the button “Copy
Visualization Data”, highlighted with an arrow in Fig. 7.110. We paste the
682 7 Cluster Analysis

Fig. 7.106 Scale type settings in the Type node

Fig. 7.107 Fields tab of the K-Means node

copied data into a text processing application, e.g., Microsoft Word. Here, the
silhouette value is 0.4395.
There is no need to assess the clusters here in detail. We are actually only
interested in the silhouette values depending on the number of clusters. So we
7.4 K-Means Partitioning Clustering 683

Fig. 7.108 Model tab of the K-Means node

Fig. 7.109 Final stream “kmeans_cluster_nutrition_habits.str”

repeat the procedure for all other cluster numbers from 3 to 13. In the solution to
exercise 1 in Sect. 7.5.4, we demonstrate how to use the Auto Cluster node for
the same procedure.
Table 7.29 shows the dependency of silhouette measure vs. the number of
clusters determined. Figure 7.111 shows the 2D-plot of the data.
3. In the solution to exercise 3 in Sect. 7.3.5, we found that five clusters are difficult
to characterize. The graph for K-Means tells us that there is an approximately
684 7 Cluster Analysis

Fig. 7.110 Summary of K-Means clustering in the Model Viewer

Table 7.29 Dependency of silhouette measure and the number of clusters determined by
K-Means
Number of clusters Silhouette measure of cohesion and separation
2 0.4395
3 0.5773
4 0.5337
5 0.7203
6 0.7278
7 0.8096
8 0.8494
9 0.8614
10 0.9116
11 0.9658
12 0.9708
13 1.0000

linear dependency between the number of clusters and the silhouette value.
Using a simple regression function, we find that with each additional cluster,
the quality of the clustering improves by 0.0494, in terms of the silhouette
measure.
7.5 Auto Clustering 685

Fig. 7.111 Graph of


silhouette measure
vs. number of clusters
determined by K-Means

Fig. 7.112 Graph of


silhouette measure
vs. number of clusters
determined by TwoStep

Reducing the number of clusters from five to four, however, results in a very
low silhouette measure of 0.5337. So when using K-Means, the better option is
to determine three clusters.
4. Figure 7.112 shows results from using the TwoStep algorithm for clustering the
data. The average difference in the silhouette value is 0.0525 when increasing
the number of clusters by one. The linear character of the curve is clear. Here, we
can modify the number of clusters based also on the background of the applica-
tion of the clustering algorithm. As suggested in exercise 3 of Sect. 7.3.5, we can
also use three or four clusters, if we think this is more appropriate.

7.5 Auto Clustering

7.5.1 Motivation and Implementation of the Auto Cluster Node

General motivation
In the previous sections, we intensively discussed using the TwoStep and K-Means
algorithms to cluster data. An advantage of TwoStep implementation in the SPSS
Model is its ability to identify the optimal number of clusters to use. The user will
686 7 Cluster Analysis

get an idea of which number will probably best fit the data. Although K-Means is
widely used, it does not provide this convenient option.
Based on practical experience, we believe the decision on how many clusters to
determine should be made first. Additionally, statistical measures such as the
silhouette value give the user the chance to assess the goodness of fit of the
model. We discussed the dependency of the number of clusters and the clustering
quality in terms of silhouette value in more detail in exercise 5 of Sect. 7.4.3.
Determining different models with different cluster numbers, and assessing the
distribution of the variables within the clusters, or profiling the clusters, leads
eventually to an appropriate solution.
In practice, realizing this process takes a lot of experience and time. Here, the
idea of supporting the user by offering an Auto Cluster node seems to be a good one,
especially if different clustering algorithms will be tested. We will show how to
apply this node here and then summarize our findings.

Implementation details
The Auto Cluster node summarizes the functionalities of the TwoStep, the
K-Means node, and a node called Kohonen. The functionalities of the TwoStep
as a hierarchical agglomerative algorithm are intensively discussed in Sect. 7.3. The
K-Means algorithm and its implementation are also explained in Sect. 7.4.
The Auto-Clustering node also uses a partitioning K-Means clustering algo-
rithm. Here too, the user must define the number of clusters in advance. Models can
be selected based on several goodness criteria,.
Kohonen is the only algorithm that we have not discussed so far in detail.
Table 7.6 outlines some details. In this special type of neural network, an unsuper-
vised learning procedure is performed. So here no target variable is necessary.
Input variables defined by the user build an input vector. This input vector is then
presented to the input layer of a neural network. This layer is connected to a second
output layer. The parameters in this output layer are then adjusted using a learning
procedure, so that they learn the different patterns included in the input data.
Neurons in the output layer that are unnecessary are removed from the network.
After this learning procedure, new input vectors are presented to the model. The
output layer tries to determine a winning neuron that represents the most similar
pattern previously learned.
The procedure is also depicted in Fig. 7.113. The output layer is a
two-dimensional map. Here, the winning neuron is represented by its coordinates

Fig. 7.113 Visualization of Kohonen’s SOM algorithm used for clustering purposes
7.5 Auto Clustering 687

X and Y. The different combinations of the coordinates of the winning neuron are
the categories or the clusters recognized by the algorithm in the input data.
Interested readers are referred to Kohonen (2001) for more details.

" The Kohonen network is the implementation of an unsupervised


learning algorithm, in the form of a neural network. Input vectors
of an n-dimensional space are mapped onto a two-dimensional
output space. This is called a “self-organizing” map.

" The network tries to learn patterns included in the input data. After-
wards, new vectors can be presented to the algorithm, and the
network determines a winning neuron that represents the most
similar pattern learned. The different combinations of the coordinates
of the winning neuron equal the number or the name of the cluster.
The number of neurons can be determined by restricting the width
and the length of the output layer.

" The Auto Cluster node offers the application of TwoStep, K-Means,
and Kohonen node functionalities for use with data. TwoStep and the
Kohonen implementation in the SPSS Modeler determine the “opti-
mal” number of clusters automatically. The user must use K-Means to
choose the number of clusters to determine.

" Using the Auto Cluster node allows the user steer three algorithms at
the same time. Several options allow the user to define selection
criterions for the models tested and presented.

7.5.2 Building a Model in SPSS Modeler

We would like to show the Auto Cluster node in use, by clustering a dataset
representing diabetes data from a Pima Indian population near Phoenix, Arizona.
A detailed description of the variables included in the dataset can be found in
Sect. 10.1.8.
Using several variables from this dataset, we should be able to make predictions
about the individual risk of suffering from diabetes. Here, we would like to cluster
the population, so we look for typical characteristics. It is not the aim of this section
to go into the medical details. Rather, we would like to identify the best possible
algorithm for clustering the data and identify the advantages and the disadvantages
of the SPSS Modeler’s Auto Cluster node.
As we extensively discussed in Sect. 7.3.2, and in the solution to exercise 3 in
Sect. 7.4.4, the TwoStep algorithm also implemented in the Auto Cluster node
needs normally distributed variables. Using the Euclidian distance measure instead
688 7 Cluster Analysis

of the log-likelihood measure seems to be a bad choice too, as shown at the end of
exercise 3 in Sect. 7.4.4. That’s because we want to create a stream for clustering
the data based on the findings in Sect. 3.2.5 “Concept of ‘SuperNodes’ and
Transforming Variable to Normality”. Here, we assessed the variables
“glucose_concentration”, “blood_pressure”, “serum_insulin”, “BMI”, and
“diabetes_pedigree” and transformed them into normal distribution.

Remark
The solution can be found in the stream named “cluster_diabetes_auto.str”.

1. We open the stream “transform_diabetes” and save it under another name.


Figure 7.114 shows the initial stream.
2. The stream offers the option of scrolling through the records using the
Table node on the right, as well as assessing the scale type and the frequency
distribution in the Data Audit node, also on the right. As we can see in
Fig. 7.115, the variable in the column “test result” (named “class_variable”)
shows us the medical test results. The variable is binary, as depicted in
Fig. 7.116. Based on this variable, we can verify the cluster quality by deter-
mining the frequency distribution of class_variable/test result in the different
clusters.
3. To use the transformed variables in the Auto Cluster node later, we must add a
Type node behind the SuperNode (see Fig. 7.117). In the Type node, no
modifications are necessary.
4. We now have to define the variables used for clustering purposes, however. The
class_variable that represents the test result as the target variable does not have
to be included as an input variable in the cluster node. That’s because cluster
analysis is an unsupervised learning procedure.
We add an Auto Cluster node from the Modeling tab of the SPSS Modeler
and connect it with the Type node (see Fig. 7.117). As we do not want to use the

Fig. 7.114 Initial template stream “transform_diabetes”


7.5 Auto Clustering 689

Fig. 7.115 Sample records in the Table node

Fig. 7.116 Frequency distributions in the Data Audit node

Fig. 7.117 An Auto Cluster node is added to the stream


690 7 Cluster Analysis

Fig. 7.118 Fields tab in the Auto Cluster node

settings of the Type node for the role of the variable, we open the Auto Cluster
node by double-clicking on it.
5. In the fields tab of the Auto Cluster node, we activate the option “Use custom
settings” and determine the “class_variable” as the evaluation variable. See
Fig. 7.118).
6. By analyzing the meaning of the variables found in Sect. 10.1.8 we can state
that the following variables should be helpful for determining segments of
patients according to their risk of suffering from diabetes:

– transformed plasma glucose concentration in an oral glucose tolerance test


(“glucose_concentration_Log10”),
– diastolic blood pressure (mm Hg, “blood_pressure”),
– transformed 2-Hour serum insulin (mu U/ml, “serum_insulin_Log10”)
– transformed body mass index (weight in kg/(height in m)\^2,
“BMI_Log10”) and
– transformed diabetes pedigree function (DBF, “diabetes_pedigree_Log10”).

This result is based on different pre-tests. We can also imagine using the
other variables in the dataset. The variable settings in the Auto Cluster node in
Fig. 7.118, however, should be a good starting point.
7. We don’t have to modify an option in the model tab of the Auto Cluster node
(see Fig. 7.119). It is important to note, however, that the option “Number of
7.5 Auto Clustering 691

Fig. 7.119 Model tab in the Auto Cluster node

models to keep” determines the number that will be saved and presented to the
user later in the model nugget.

" The option “number of models to keep” determines the number of


models that will be saved. Depending on the criteria for ranking the
models, only this number of models will be presented to the user.

8. The Expert tab (Fig. 7.120) gives us the chance to disable the usage of
algorithms and determine stopping rules, in the case of huge datasets. We
want to test all three algorithms, TwoStep, K-Means, and Kohonen, as outlined
in the introduction to the section.
We must pay attention to the column “Model parameters” though. The
K-Means algorithm will need to define the number of clusters to identify.
We click on the “default” text in the second row, which represents the
K-Means algorithm (see Fig. 7.120). Then we use the “Specify” option, and a
new dialog window will open (see Fig. 7.121). Here, we can determine the
number of clusters. As we have patients that tested both positive and negative,
we are trying to determine two clusters in the data. We define the correct
number as shown in Fig. 7.122. After that, we can close all dialog windows.
9. We can start the clustering by clicking on “Run”. We then get a model nugget,
as shown in the bottom right corner of Fig. 7.123.
692 7 Cluster Analysis

Fig. 7.120 Expert tab in the Auto Cluster node

Fig. 7.121 Parameters of the K-Means algorithm in the Auto Cluster node

Fig. 7.122 Number of clusters that should be determined by K-Means


7.5 Auto Clustering 693

Fig. 7.123 Model nugget of the Auto Cluster node

Fig. 7.124 Model results in the Auto Cluster node

10. We double click on the model nugget. As we can see in Fig. 7.124, three models
are offered. The K-Means algorithm determines the model with the best
silhouette value of 0.431. The model has two clusters that are equal to the
number of different values in the “class_variable”, which represents the test
result “0 ¼ negative test result” and “1 ¼ positive test result” for diabetes. This
is also true for the model determined by TwoStep.
11. We can now decide which model should be assessed. We discussed the
consequences of the variable transformation towards normality for the cluster-
ing algorithms in exercise 3 of Sect. 7.4.4. To outline here the consequences
once more in detail, we will discuss the TwoStep model. We enable it in the
first column in the dialog window shown in Fig. 7.124.
As we know from discussing TwoStep and the K-Means in Sects. 7.3 and
7.4, we can assess the cluster by double-clicking on the symbol in the second
column, as in Fig. 7.124.
Figures 7.125 and 7.126 show details from the TwoStep model. The model
quality is fair, and the averages of the different variables in the clusters are
694 7 Cluster Analysis

Fig. 7.125 Overview of results for TwoStep clustering in the Model Viewer

Fig. 7.126 Model results from TwoStep clustering in the Model Viewer
7.5 Auto Clustering 695

different. Using the mouse, we can see that this is also true for
“diabetes_pedigree”, with 0.42 in cluster 1 on the left and 0.29 in cluster
2 on the right. The importance of the predictors is also good.

" In the Auto Cluster nugget node, the determined models will be listed
if they meet the conditions defined in the “Discard” dialog of the Auto
Cluster node.

" There are many dependencies between the parameters regarding the
models to keep, to determine, and to discard; for instance, when the
determined number of models to keep is three and the number of
models defined to calculate in K-Means is larger. Another example is
when models with more than three clusters should be discarded, but
in the K-Means node, the user defines models with four or more
clusters to determine. So, if not all expected models are presented
in the model nugget, the user is advised to verify the restrictions
“models to keep” vs. the options in the “discard section”.

" A model can be selected for usage in the following calculations of the
stream, by activating it in the first column of the nugget node.

12. The variable “test_result”, also shown as a predictor on the right in Fig. 7.126,
is definitely not a predictor. It is obviously a not correct repesentation of the
settings in the Modeler and not based on inaprropriate parameters in Fig. 7.118.
Here we excluded the variable “test_result”. In our model, it is the evaluation
variable. If we close the model viewer and click on the first column of
Fig. 7.124, we get a bar chart as shown in Fig. 7.127. Obviously, the patients
that tested positive are assigned to cluster 2. So the classification is not very
good, but satisfying enough.
13. Finally, we can verify the result by adding a Matrix node to the stream and
applying a chi-square test of independence to the result. Figure 7.128 shows the
final stream. Figure 7.129 shows the parameters of the Matrix node, using a
cross tabulation of the cluster result vs. the clinical test result. The resulting
chi-square test of independence in Fig. 7.130 proves our result: the cluster
results are not independent from the clinical test result.
14. We found out so far that the Auto Cluster node offers a good way to apply
different clustering algorithms in one step. K-Means and TwoStep algorithms
seem to produce fair clustering results for the given dataset. By modifying the
Auto Cluster node parameters, we want to hide models, such as the Kohonen
model, which are not appropriate.
We double-click on the Auto Cluster node and open the Discard tab as shown
in Fig. 7.131. As we know, we want to produce a model that distinguishes two
groups of patients. So we set the parameter “Number of clusters is greater than”
at the value 2.
696 7 Cluster Analysis

Fig. 7.127 Distribution of the evaluation variable “class_variable”, depending on the cluster

Fig. 7.128 Final stream with an added Matrix node

15. We run the Auto Cluster node with the new settings and get the results shown in
Fig. 7.132.
The Auto cluster node can help to apply the TwoStep, K-Means, and
Kohonen algorithms to the same variables at the same time. Furthermore, it
allows testing of different model parameters. For instance, in Figs. 7.120,
7.121, and especially Fig. 7.122, a multiple number of clusters for testing can
be defined for the K-Means algorithm. The user can determine the parameters
of all these models at the same time. So using the Auto Clustering node can
help to determine many models and select the best.
7.5 Auto Clustering 697

Fig. 7.129 Matrix node settings for producing a contingency table

Fig. 7.130 Chi-square test of independence


698 7 Cluster Analysis

Fig. 7.131 Discard tab in the Auto Cluster node

Fig. 7.132 Model results in the Auto Cluster node

The Auto Cluster node can also be used to produce a series of models with
different cluster numbers at the same time. Then the user can compare the
models and select the most appropriate one. This functionality will be
demonstrated in exercise 1.
It is important to note here, however, that all the algorithms must use the
same input variables. This is a disadvantage, because the TwoStep algorithm
needs transformed variables to meet the assumption of normally distributed
values. Using the Euclidean distance measure for TwoStep instead produces a
bad model. Interested users can test this. We came to the same conclusion in
exercise 3 of Sect. 7.4.4. In summary, the user has to deal with whether he/she
7.5 Auto Clustering 699

should avoid the assumption of normally distributed values for the TwoStep or
produce a model based on this assumption. The consequence is that the
K-Means algorithm, which uses the same variables in the Auto Cluster node,
often cannot perform well. We will show in exercise 1 that using untransformed
data will lead to better results for K-Means.

" The Auto Cluster node allows access details inside the data, by
determining a set of models, all with different parameters at the
same time. When applying this node to data, the following aspects
should be kept in mind:

Advantages
– The node allows testing for the application of algorithms K-Means,
TwoStep, and Kohonen, at the same time, to the same selected
variables.
– The user gets possible appropriate models in reference to the selected
variables.
– Several options can be defined for the algorithms individually.
– Restrictions for the identified models, such as number of clusters
and size of clusters can be determined in the discard tab of the
Auto Cluster node.
– The Auto Cluster node can be used to produce a series of models
with different cluster numbers at once. Then the user can compare
the models and select the most appropriate.

Disadvantages
– To meet the TwoStep algorithms assumption of normally distributed
input variables, the variables must be transformed. As all algorithms
must deal with the same input variables, then the K-Means node
does not perform very well. So the user must often deal with the
trade-off between avoiding the normality assumption for TwoStep
or producing probably better results with K-Means.
– Usage of the Auto Cluster node needs experience, because a lot of
parameters, such as “different numbers of clusters to test”, can be
defined separately for each algorithm. For experienced users, it is a
good option for ranking models. Nevertheless, caution is needed to
keep track of all parameters and discard unhelpful options.

7.5.3 Exercises

Exercise 1 K-means for Diabetes dataset


In this section, we used the Auto Cluster node to find an appropriate algorithm that
determines two segments of patients in the diabetes dataset. Judging from these
700 7 Cluster Analysis

results, the K-Means algorithm produces the best result in terms of silhouette value.
In this exercise, you should assess the model. Furthermore, you should examine the
dependency of the silhouette value and the number of clusters to separate.

1. Open the stream “cluster_diabetes_auto.str” and save it under another name.


2. Using the Auto Cluster node, you should determine the best K-Means model
with two clusters and assess the result. Bear in mind the findings related to the
transformed variables in Sect. 7.5.2.
3. Based on the model identified in step 2, now determine K-Means models with
between 2 and 10 clusters, and compare their quality in terms of silhouette value.

Exercise 2 Auto Clustering the Diet Dataset


In this exercise, the functionality of the Auto cluster node should be applied to the
diet dataset (see Sect. 10.1.24). We discussed TwoStep clustering in great detail in
Sect. 7.3.4 in exercise 3. The solution details for the application of TwoStep to the
diet dataset can be found in Sect. 7.3.5.
Using the diet types, “vegetarian”, “low meat”, “fast food”, “filling”, and
“hearty”, consumers were asked “Please indicate which of the following dietary
characteristics describe your preferences. How often do you eat . . .”. The
respondents had the chance to rate their preferences on a scale “(very) often”,
“sometimes”, and “never”. The variables are coded as follows: “1¼never”,
“2¼sometimes”, and “3¼(very) often”. They are ordinally scaled.

1. The data can be found in the SPSS Statistics file “nutrition_habites.sav”. The
Stream “Template-Stream_nutrition_habits” uses this dataset. Save the stream
under another name.
2. Use an Auto Cluster node to determine useful algorithms for determining
consumer segments. Assess the quality of the different cluster algorithms.

7.5.4 Solutions

Exercise 1 K-means for Diabetes dataset


Name of the solution streams Stream: cluster_diabetes_K_means
Microsoft Excel: “cluster_diabetes_K_means.xlsx”
Theory discussed in section Sect. 7.4.2

1. The name of the solution stream is “cluster_diabetes_K_means”.


The parameter of the Auto Cluster included in the stream can also be used
here, but first we should disable the TwoStep and the Kohonen models
(Fig. 7.133).
2. In Sect. 7.5.2, we assessed the TwoStep model. The quality was fair and could be
improved by using other input variables, as determined in Fig. 7.118. That’s
because the K-Means algorithm doesn’t need normally distributed variables.
7.5 Auto Clustering 701

Fig. 7.133 Modified parameters in the Auto Cluster node

Fig. 7.134 Modified variable selection in the Auto Cluster node

So we substitute the transformed variables for the original variables (see


Fig. 7.134). Only K-Means allows us to run the stream to determine the
segmentation.
As shown in Fig. 7.135, we get just one model to assess. The quality of the
model based on the untransformed variables is better. The silhouette value
702 7 Cluster Analysis

Fig. 7.135 K-Means model with two clusters in the model nugget

Fig. 7.136 Model summary in the K-Means node

increases from 0.373, found in Fig. 7.132, to 0.431. To start the model viewer,
we double-click on the model nugget in the second column.
More detailed K-Means results are shown in Fig. 7.136. The clustering quality
is based on assessment of the silhouette value of 0.431 as good. The importance
of the variables in Fig. 7.137 changes, in comparison to the results found in
Fig. 7.126.
We now can investigate the model quality using the Matrix node. In
Fig. 7.138, we illustrate the results for the frequency distribution of the diabetes
test results per cluster. In comparison with the results of the Auto Cluster node
shown in Fig. 7.130, we can state that here also cluster 1 represents the patients
that tested negative. With its frequency of 212 patients, against 196 in the
7.5 Auto Clustering 703

Fig. 7.137 Model details from K-Means clustering in the Model Viewer

Fig. 7.138 Frequency distribution of “class_variable” per cluster

TwoStep model, the K-Means model is of better quality in terms of test results
correctly assigned to the clusters.
3. To make the stream easier to understand, we can add a comment to the Auto
Cluster node (see Fig. 7.139). Then we copy the Auto Cluster node and paste
704 7 Cluster Analysis

Fig. 7.139 Stream with copied Auto Cluster node

Fig. 7.140 Model dialog in the Auto Cluster node

it. After that, we connect the new node to the type node. Figure 7.139 shows the
actual status of the stream.
We double-click on the pasted Auto Cluster node and activate the “Model”
tab. It is important to define the number of models to keep. Here, we want to
determine nine models. So this number must be at least 8 (see Fig. 7.140).
In the original stream, we only determined models with two clusters. Here, we
have to remove this restriction in the Discard tab of the Auto Cluster node.
Otherwise, other models will not be saved (see Fig. 7.141).
To determine the models with different cluster numbers, we activate the
Expert tab (see Fig. 7.142). Then we click on “Specify . . .”. As shown in
7.5 Auto Clustering 705

Fig. 7.141 Modified Discard options in the Auto cluster node

Fig. 7.142 Expert tab in the Auto Cluster node

Fig. 7.143, we open the dialog window in Fig. 7.144. Here, we define models
with between 2 and 10 clusters to determine.
This option has an advantage over the possibilities offered in the expert node
of the K-Means node. Here, we can only specify one model to determine at a
706 7 Cluster Analysis

Fig. 7.143 Defining the number of clusters to determine in the K-Means part of the Auto
Cluster node

Fig. 7.144 Number of clusters to determine

time. In that way, the Auto Cluster node is more convenient for comparing
different models with different cluster numbers.
We can close all dialog windows.
We do not need to add a Matrix node, because we only want to compare the
different models based on their silhouette values. Figure 7.145 shows the final
stream, with an additional comment added. We can run the second Auto Cluster
node now.
Assessing the results in Fig. 7.146, we can see that the model quality decreases
if we try to determine models with more than two clusters. This proves the model
quality determined in part 2 of the exercise. Furthermore, we can say that the
algorithm indeed tries to determine segments that describe patients suffering or
not suffering from diabetes.
7.5 Auto Clustering 707

Fig. 7.145 Final stream with two Auto Cluster nodes

Fig. 7.146 Auto Cluster node results for different numbers of clusters

Finally, Fig. 7.147 illustrates the dependency of the clustering quality, in


terms of silhouette value vs. the number of clusters determined. We explained
in exercise 5 of Sect. 7.4.4, how to get the precise silhouette value in the model
viewer. The data can be found in the Excel file “cluster_diabetes_K_means.
xlsx”. In the chart, we can see that quality decreases dramatically, the more
clusters are determined.
708 7 Cluster Analysis

Fig. 7.147 Silhouette value vs, number of clusters

Exercise 2 Auto Clustering the Nutrition Dataset


Name of the solution cluster_nutrition_habits_auto
streams
Theory discussed in Sect. 7.5 on Auto Cluster node usage
section Sect. 7.3.4, exercise 3, clustering with TwoStep
Sect. 7.3.5, solution to exercise 3—characteristics of clusters
determined with the TwoStep algorithm

Remark
The TwoStep node is implemented in the Auto Cluster node. The log-likelihood
distance measure needs normally distributed continuous variables and multi-
nomially distributed variables. In Sect. 7.5.2, we found that using the Euclidean
distance measure does not result in better models. That’s because we use untrans-
formed data in combination with the log-likelihood measure here.

1. We open the template stream “Template-Stream_nutrition_habits” and save it


under another name. The solution has the name “cluster_nutrition_habits_auto”.
2. We add an Auto Cluster node to the stream from the Modeling tab and connect it
with the Type node. In the Fields tab of the node, we define all five variables
available as input variables. Figure 7.148 shows this too. No other modifications
in the Auto Cluster node are necessary, so we can start the clustering process.
3. Double-clicking on the model nugget shown in Fig. 7.149, we get the results.
The three models presented in Fig. 7.150 have five or six clusters. So the model
determined with TwoStep equals the solution that we extensively discussed in
exercise 3 of Sect. 7.3.5. An assessment of the clustering can be found there.
Table 7.27 shows a summary of the cluster descriptions.
Of special interest here, is the result produced with the Kohonen algorithm.
This shows the frequency distribution of this model in the first column of
Fig. 7.150, as well as the model viewer in Fig. 7.151. Normally, the Kohonen
7.5 Auto Clustering 709

Fig. 7.148 Fields tab in the Auto Cluster node

Fig. 7.149 An Auto Cluster node is added to the template stream

algorithm tends to produce models with too many clusters. Here, with its six
clusters and a silhouette value of 0.794, it’s a good and useful model.
The predictor importance on the right of Fig. 7.151, and characteristics of the
different clusters shown on the left in Fig. 7.151, can also be used for consumer
segmentation purposes. Based on this short assessment, we have found an
alternative model to the TwoStep clustering presented in the solution to exercise
3 of Sect. 7.3.5.
710 7 Cluster Analysis

Fig. 7.150 Auto Cluster results

Fig. 7.151 Results of Kohonen clustering in the model viewer

7.6 Summary

Cluster algorithms represent unsupervised learning procedures for identifying


segments in data with objects that have similar characteristics. In this section, we
explained in detail how the similarity or dissimilarity of two objects could be
determined. We called the measures used “proximity measures”. Based on a
relatively simple example of car prices, we saw how clustering works in general.
The TwoStep algorithm and the K-Means algorithm were then used to identify
clusters in different datasets. TwoStep represents the hierarchical agglomerative,
and K-Means represents the partitioning clustering algorithms. Using both
procedures, and the Kohonen procedure, we introduced usage of the Auto Cluster-
ing node in the modeler.
Literature 711

Summarizing all our findings, we can state that clustering data often entails more
than one attempt at finding an appropriate model. Not least, the number of clusters
must be determined based on practical knowledge.

" The Modeler offers K-Means, TwoStep, and the Kohonen nodes for
identifying clusters in data. K-Means is widely used and portions the
datasets, by assigning the records to identified cluster centers. The
user has to define the number of cluster in advance.

" This is also necessary using the Kohonen node. This node is an imple-
mentation of a two-layer neural network. The Kohonen algorithm
transforms a multidimensional input vector into a two-dimensional
space. Vectors presented to the network are assigned to a pattern
first recognized in the data during a learning period.

" TwoStep implementation in the SPSS modeler is the most easy-to-use


clustering algorithm. It can deal with all scale types and standardizes
the input values. Moreover, the TwoStep node identifies the most
appropriate number of clusters that will represent the data structure
best. This recommended number is initially automatically generated
and can be used as a good starting point to find out the optimal
clustering solution. The user must bear in mind however, that the
log-likelihood distance measure assumes normally distributed
variables, but transforming variables often leads to unsatisfying
results.

" The Auto Cluster node can be used with the TwoStep, K-Means, and
Kohonen algorithm all steering in the background. This gives the user
a chance to find the best algorithm in reference to the structure of the
given data. This node is not as easy to use as it seems, however. Often
not all appropriate models are identified, so we recommend using the
separate TwoStep, K-Means, and Kohonen model nodes instead.

Literature
Bacher, J., Wenzig, K., & Vogler, M. (2004). SPSS TwoStep Cluster—A first evaluation.
Accessed 07/05/2015, from http://www.statisticalinnovations.com/products/twostep.pdf
Backhaus, K. (2011). Multivariate Analysemethoden: Eine anwendungsorientierte Einf€ uhrung,
Springer-Lehrbuch (13th ed.). Berlin: Springer.
Bühl, A. (2012). SPSS 20: Einf€
uhrung in die moderne Datenanalyse, Scientific tools (13th ed.).
München: Pearson.
Ding, C., & He, X. (2004). K-means Clustering via Principal Component Analysis. Accessed
18/05/2015, from http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf
712 7 Cluster Analysis

Handl, A. (2010). Multivariate Analysemethoden: Theorie und Praxis multivariater Verfahren


unter besonderer Ber€ ucksichtigung von S-PLUS, Statistik und ihre Anwendungen (2nd ed.).
Heidelberg: Springer.
IBM. (2015a). SPSS Modeler 17 Algorithms Guide. Accessed 18/09/2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/AlgorithmsGuide.pdf
IBM. (2015b). SPSS Modeler 17 Modeling Nodes. Accessed 18/09/2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerModelingNodes.
pdf
IBM. (2015c). SPSS Modeler 17 Source, Process, and Output Nodes. Accessed 19/03/2015, from
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/
ModelerSPOnodes.pdf
IBM Website. (2014). Customer segmentation analytics with IBM SPSS. Accessed 08/05/2015,
from http://www.ibm.com/developerworks/library/ba-spss-pds-db2luw/index.html
Kohonen, T. (2001). Self-Organizing Maps, Springer Series in Information Sciences, Vol. 30, 3rd
ed. Berlin: Springer.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis.
Murty, M. N., & Devi, V. S. (2011). Pattern recognition: An algorithmic approach, Undergradu-
ate topics in computer science. London, New York: Springer, Universities Press (India) Pvt.
Ltd.
Struyf, A., Hubert, M., & Rousseeuw, P. J. (1997). Integrating robust clustering techniques in
S-PLUS.
Tavana, M. (2013). Management theories and strategic practices for decision making. Hershey,
PA: Information Science Reference.
Timm, N. H. (2002). Applied multivariate analysis, Springer texts in statistics. New York:
Springer.
Vogt, W. P., Vogt, E. R., Gardner, D. C., & Haeffele, L. M. (2014). Selecting the right analyses for
your data: Quantitative, qualitative, and mixed methods.
Classification Models
8

In Chap. 5, we dealt with regression models and applied them to datasets with
continuous, numeric target variables, the common data for these kinds of models.
We then dedicated Chap. 7 to unsupervised data and described various cluster
algorithms. In this chapter, we turn back to the supervised methods and attend to
the third big group of data mining methods, the classification algorithms.
Here, we are confronted with the problem of assigning a category to each input
variable vector. As the name suggests, the target variable is categorical, e.g.,
“Patient is ill” or “Patient is healthy”. As in this example, the target variable’s
possible values of a classification problem can’t often be ordered. These kinds of
problems are very common in all kinds of areas and fields, such as Biology, Social
Science, Economics, Medicine, and Computer Science. Almost any problem you
can think that involves deciding between different kinds of discrete outcome.
In the first section, we describe some real-life classification problems in more
detail, where data mining and classification methods are useful. After these
motivating examples, we explain in brief the concept of classification in data
mining, using the basic mathematical theory that all classification algorithms
have in common. Then in the remaining sections, we focus on the most famous
classification methods, which are provided by the SPSS Modeler individually, and
describe their usage on data examples. Figure 8.1 shows the outline of the chapter
structure. A detailed list of the classification algorithms discussed here are
displayed in the subsequent Fig. 8.6.
After finishing this chapter, the reader . . .

1. Is familiar with the most challenges when dealing with a classification problem
and knows how to handles them.
2. Possesses a large toolbox of different classification methods and knows their
advantages and disadvantages.
3. Is able to build various classification models with the SPSS Modeler, and is able
to apply it to new data for prediction.

# Springer International Publishing Switzerland 2016 713


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_8
714 8 Classification Models

Fig. 8.1 Outline of the


chapter structure

4. Knows various validation methods and criteria and can evaluate the quality of
the trained classification models within the SPSS Modeler stream.

8.1 Motivating Examples

Classification methods are needed in a variety of real world applications and fields,
and in many cases they are already utilized. In this chapter, we present some of
these applications as motivating examples.

Example 1
Diagnosis of breast cancer
To diagnose breast cancer, breast mass is extracted from numerous patients by
fine needle aspiration. Each sample is then digitalized into an image from which
different features can be extracted. Among others, these include the size of the cell
core or the number of mutated cells. The feature records of the patients, together
with the target variable (cancer or no cancer), are then used to build a classification
model. In other words, the model learns how the features have to look, for the tissue
sample to be tumorous or not. Now for a new patient, doctors can easily decide
whether or not breast cancer is present, by establishing the above-mentioned
features and putting these into the trained classifier.
This is a classical application of a classification algorithm, and logistic regres-
sion is a standard method used for this problem. See Sect. 8.3 for a description of
logistic regression.
An example dataset is the Wisconsin Breast Cancer Data (see Sect. 10.1.35).
Details on the data, as well as the research study, can be found in Wolberg and
Mangasarian (1990).

Example 2
Credit scoring of bank customers
When applying for a loan at a bank, your credit worthiness is calculated based on
some personal and financial characteristics, such as age, family status, income,
number of credit cards, or amount of existing debt. These variables are used to
estimate a personal credit score, which indicates the risk to the bank when giving
you a loan.
8.1 Motivating Examples 715

An example of credit scoring data is the “tree_credit” dataset. For details, see
Sect. 10.1.33.

Example 3
Mathematical component of a sleep detector (Prediction of sleepiness in EEG
Signals)
Drowsiness brings with it a lesser ability to concentrate, which can be dangerous
and should be avoided in some everyday situations. For example, when driving a
car, severe drowsiness is the precursor of microsleep, which can be life-threatening.
Moreover, we can think of jobs where high concentration is essential, and a lack of
concentration can lead to catastrophes. For example, the work of plane pilots,
surgeons, or technique observers at nuclear reactors. This is one reason why
scientists are interested in detecting different states in the brain, to understand its
functionality.
For the purpose of sleep detection, EEG signals are recorded in different sleep
states, drowsiness and full consciousness, and these signals are analyzed to identify
patterns that indicate when drowsiness may occur. The EEG_Sleep_Signals.csv
(see Sect. 10.1.10) is a good example, which is analyzed with an SVM in
Sect. 8.5.2.

Example 4
Handwritten digits and letter recognition
When we send a letter with the mail, this letter is gathered together with a pile of
other letters. The challenge for the post office is to sort this huge mass of letters by
their destination (country, city, zip-code, street). In former days, this was done by
hand, but nowadays computers do this automatically. These machines scan the
address on each letter and allocate it to the right destination. The main challenge is
that handwriting differs noticeably between different people. Today’s sorting
machine’s use an algorithm that is able to recognize alphabetical characters and
numbers from individual people, even if it has never seen the handwriting before.
This is a fine example of a machine learning model, trained on a small subset and
able to generalize to the entire collective.
Other examples where automatic letter and digit identification can be relevant
are signature verification systems or bank-check processing. The problem of
handwritten character recognition falls into the more general area of pattern recog-
nition. This is a huge research area with many applications, e.g., automatic identifi-
cation of the correct image of a product in an online shop or analysis of satellite
images.
The optical recognition of handwritten digits data obtained from the UCI
Machine Learning Repository Machine Learning Repository (1998) is a good
example. See also Sect. 10.1.25.
716 8 Classification Models

Other examples and areas where classification is used:

• Sports betting. Prediction of the outcome of a sports match, based on the results
of the past.
• Determining churn probabilities. For example, a telecommunication company
wants to know if a customer has a high risk of switching to a rival company. In
this case, they could try to avoid this with an individual offer to keep the
customer.
• In the marketing area: to decide if a customer in your database has a high
potential of responding to an e-mailing marketing campaign. Only customers
with a high response probability are contacted.

8.2 General Theory of Classification Models

As described in the introduction to this chapter, classification models are dedicated


to categorizing samples into exactly one category. More precisely, let y be the
observation or target variable that can take one of a finite number of values
A, B, C, . . . Recalling the motivating examples in the previous section, these values
can, for example, describe a medical diagnosis, check if an e-mail is spam or
discern the possible meaning of a handwritten letter. As these examples show, the
observation values do not have to be in any kind of order or even numeric.
Based on some input variables, xi1, . . ., xip, a classification model now tries to
determine the value of the observation yi, and thus, categorize the data record. For
example, if some breast sample tissue has unusual cell sizes, this could likely be
cancer. Or, if the subject of an e-mail contains some words that are common in
spam, this e-mail is probably a spam mail too.
In this chapter, we give a brief introduction on the general theory of classifica-
tion modeling.

8.2.1 Process of Training and Using a Classification Model

The procedure for building a classification model follows the same concept as
regression modeling (see Chap. 5). As described in the regression chapter, the
original dataset is split into two independent subsets, the training and the test
set. A typical partition is 70 % training data and 30 % test data. The training data is
used to build the classifier, which is then applied to the test set for evaluation. Using
a separate dataset is the most common way to measure the goodness of the
classifier’s fit to the data and its ability to predict. This process is called cross-
validation (see Sect. 5.1.2) or James et al. (2013). We thus recommend always using
this training and test set framework when building a classifier.
Often, some model parameters have to be predefined. These however are not
always naturally given and have to be chosen by the data miner. To find the optimal
parameter, a third independent dataset is used, the validation set. A typical
8.2 General Theory of Classification Models 717

partition of the original data is then 60 % training, 20 % validation and 20 % test set.
After training several classifiers with different parameters, the validation set is used
to evaluate these models in order to find the one with best fit (see evaluation
measures in Sect. 8.2.5). Afterwards, the winner of the previous validation is
applied to the test set, to measure its predicting performance on independent data.
This last step is necessary to eliminate biases that might have occurred in the
training and validation steps. For example, it is possible that a particular model
performs very well on the validation set, but is not very good on other datasets. In
Fig. 8.2, the process of cross-validation and building a prediction model is
illustrated.
The general idea behind a classification model is pretty simple. When training a
classification model, the classification algorithm inspects the training data and tries
to find regularities in data records with the same target value and differences
between data records of different target values. In the simplest case, the algorithm
converts these findings into a set of rules, such that the target classes are
characterized in the best possible way through these “if . . . then . . .” statements.
In Fig. 8.3, this is demonstrated with a simple data example. There, we want to
predict if we can play tennis this afternoon based on some weather data. The
classification algorithms transform them into a set of rules that define the classifier.
For example, if the outlook is “Sunny” and the temperature is greater than 15 C, we
will play tennis after work. If otherwise the outlook is “Rain” and the wind
prognosis is “strong”, we are not going to play tennis.
Applying a classifier to new and unseen data now simply becomes the assign-
ment of a class (target value) to each data record using the set of rules. For example,
in Fig. 8.4, the classifier that suggests whether we can play tennis is applied to new

Fig. 8.2 Overview of the steps when building a classification model


718 8 Classification Models

Fig. 8.3 Training of a classification model

Fig. 8.4 Prediction with a classification model

data. For day 10, where the Outlook ¼ ”Sunny”, Temperature ¼ 18 C and
Wind ¼ ”strong”, the classifier predicts that we can play tennis in the afternoon,
since these variable values fulfill the first of its rules (see Fig. 8.3).
Many classification models pre-process the input data via a numeric function,
which is used to transform and quantify each data record. Each data record is then
assigned a score, which can be a measure of the probabilities of each class, or some
distances. When training the model, this scoring function and its parameters are
determined, and a set of decision rules for the function’s value are generated. The
class of unseen data is now predicted by calculating the score with the scoring
function and assigning the target class suggested by the score. See Fig. 8.5 for an
illustration of a classification model with an internal scoring function.

8.2.2 Classification Algorithms

A variety of classification algorithms and concepts exist, with different approaches


and techniques to handle the data and find classification rules. These methods can
be roughly grouped into three types, linear, nonlinear, and rule-based algorithms.
8.2 General Theory of Classification Models 719

Fig. 8.5 Classification model with scoring function

Fig. 8.6 Overview of the most important classification algorithms

The models of the first two types follow a mainly mathematical approach and
generate functions for scoring and separation of the data and classes. Whereas
linear methods try to separate the different classes with linear functions, nonlinear
classifiers can construct more complex scoring and separating functions. In contrast
to these mathematical approaches, the rule-based models search the input data for
structures and commonalities without transforming the data. These models generate
“if . . . then . . .” clauses on the raw data itself. Figure 8.6 lists the most important
and favored classification algorithms that are also implemented in the SPSS Mod-
eler. Chapters where the particular models are discussed are shown in brackets.
Each classification model has its advantages and disadvantages. There is no
method that outperforms every other model for every kind of data and classification
problem. The right choice of classifier strongly depends on the data type. For
example, some classifiers are more applicable to small datasets, while others are
more accurate on large data. Other considerations when choosing a classifier are the
properties of the model, such as accuracy, robustness, speed, and interpretability of
the results. A very accurate model is never bad, but sometimes a robust model,
which can be easily updated and is insensitive to strange and unusual data, is more
important than accuracy. In Table 8.1, some advantages and disadvantages of the
classification algorithms are listed to give the reader guidelines for the selection of
the right method. See also Tuffery (2011).
720 8 Classification Models

Table 8.1 Advantages and disadvantages of the classification algorithms


+ 
Logistic • Robust (barely influenced by noisy • Problems with high collinearity
regression data) • Gaussian distributed residuals
• (typically) a high precision
• Works with a small sample sizes
• Probabilistic interpretation of results
Linear • Very fast • Needs data preparation
discriminant • Works with small samples • Class-wise Gaussian distributed
analysis • Optimal if data assumptions are • Sensitive to outliers
fulfilled • Only applicable to linear problems
Support • Robust (barely influenced by noisy • Slow in training
vector data) • Results are hard to understand
machine • High precision (blackbox algorithm)
• Is prone to overfitting • Sensitive with regard to the choice
of kernel function
• Choosing the wrong kernel and
parameters can risk overfitting
Neural • Good for large samples • Cannot handle too many variables
network • Can handle very complex • Results are hard to understand
interactions between variables (blackbox algorithm)
• Nonparametric. No distribution • Not always optimal solution
assumptions needed
• Resistant to defective data
K-Nearest • Speed (no training needed) • Performance depends on the
neighbor • Complex concepts can be learned number of dimensions (Curse of
through simple procedures. dimensionality)
• Problem with highly imbalanced
data
• No interpretation of results possible
Decision • No problem with outliers • Slow in training a model
trees • Results and rules understandable • Tends to overfit
• Works with many data types • Detects local, not global, optima
• Fast in prediction • Prefers variables with many
• No assumptions on variable categories or numerical data
distributions needed

8.2.3 Classification vs. Clustering

Recalling the discussion in Sect. 7.2, a clustering algorithm is an unsupervised


learning method. This means that the data points are not labeled, hence, there is no
target variable needed. Clustering is therefore used to group the data points, based
on their similarities and dissimilarities. The purpose is to find new patterns and
structures within the data. Recall the motivating example of Customer segmenta-
tion in Sect. 7.1; customer segmentation is often used by companies to identify high
potential and low potential customers. For example, banks cluster their customers
into groups to identify the customers with high or low risk of loan default.
8.2 General Theory of Classification Models 721

Fig. 8.7 Clustering vs. classification

As described in the previous sections, classification has a different purpose.


When using a classifier, the data points are labeled. In the previous example of loan
default risk, the bank already has a dataset of customers, each of them labeled
“risky” or “non-risky”, based on the customers’ ability to repay his/her loan. Based
on this training data of bank customers, a classification model learns to separate the
“risky” from the “non-risky” customers. Afterwards, new customers can be
categorized into risky or non-risky, so the bank has a better basis on which to
decide to give a customer a loan. Due to this labeled training data, classification
models belong to the supervised learning algorithms. The model learns from some
training data and then predicts the outcome of the response variable based on this
training data.
In Fig. 8.7, the difference between classification and clustering is visualized.

8.2.4 Making a Decision and the Decision Boundary

In the case of a binary target variable, i.e., a “yes” or “no” decision, four possible
events can occur when assigning a category to the target variable via classification.

– True positive (TP). The true value is “yes” and the classifier predicts “yes”.
A patient has cancer and cancer is diagnosed.
– True negative (TN). The true value is “no” and the classifier predicts “no”.
A patient is healthy and no cancer is diagnosed.
– False positive (FP). The true value is “no” and the classifier predicts “yes”.
A patient is healthy but cancer is diagnosed.
– False negative (FN). The true value is “yes” and the classifier predicts “no”.
A patient has cancer but no cancer is diagnosed.
722 8 Classification Models

These four possible classification results are displayed below in Table 8.2.
A model is said to be a good classifier if it predicts the true value of the outcome
variable with high accuracy, that is, if the proportion of TP’s and TN’s is high, and
the number of FP’s and FN’s is very low. Unfortunately, a perfect classifier with no
misclassification is pretty rare. It is almost impossible to find an optimal classifier.
Now, one can justifiably interject that nearly every dataset of different labeled
points can be separated perfectly using a function, which is also called a decision
boundary. This is undoubtedly true. However, a complex decision boundary might
perfectly classify the existing data, but if a new and unknown data point has to be
classified, it can easily lie on the wrong side of the decision boundary, and thus be
misclassified. The perfect classifier no longer exists. The problem here was that the
classification model overfitted the data and was inappropriate for independent data,
even though the doctor perfectly classified the training data. Thus, we have to
reduce our expectations and condone little misclassifications in favor of a simple
and more universal decision boundary. See Kuhn and Johnson (2013), for more
information on the decision boundary and overfitting, as well as Sect. 5.1.2.
These considerations are illustrated in Fig. 8.8. There, the decision boundary in
the graph to the left separates the squares from the circles perfectly. In the middle
graph, a new data point is added, the filled square, but this lies on the circle side of
the boundary. Thence, the decision boundary is no longer optimal, and it overfits the
other data. So, the linear boundary in the graph on the right is as good as the
boundary in the other graph, but much simpler.
So, in order to avoid overfitting, we recommend always using the training/testing
model setting and potentially the validation model setting too.

Table 8.2 Overview of the possible results when performing a classification


True category
Yes No
Predicted category Yes True positive (TP) False positive (FP)
No False negative (FN) True negative (TN)

Fig. 8.8 Illustration of the problem of overfitting, when choosing a decision boundary
8.2 General Theory of Classification Models 723

8.2.5 Performance Measures of Classification Models

Various measures and methods exist for the evaluation of models. Here, we discuss
the most common ones implemented in the SPSS Modeler.

Classification error
The obvious measure for evaluating a classifier is the rate of misclassified instances,
e.g.,

number of FP þ FN
:
number of data points

This quotient is called a Classification error and should be small for a well-fitted
model.

ROC and AUC


Other popular measures for evaluating the performance of a classification model
with binary classes are the Receiver Operating Characteristic curve (ROC) and
the Area Under this Curve (AUC). The ROC curve visualizes the false positive
rate (FPR) against the true positive rate (TPR), i.e.,

FP TP
FPR ¼ and TPR ¼ :
FP þ TN TP þ FN
More precisely, consider a binary classifier (0 and 1 as target categories) with a
score function that calculates the probability of the data point belonging to class
1. Then each of these scores (probabilities) t is considered as a threshold for the
decision, if a data point is of class 1 or 0; that is, a data point is considered to be
of class 1 if its predicted probability is larger than t. From these predictions, the
FPR(t) and TPR(t) are calculated. Thus, the FPR and TPR are actually vectors with
values between 0 and 1. Of course, this works analog with every other form of
score.
As an example, consider the following probability predictions (Fig. 8.9):
To calculate the ROC, the FPR and TPR are determined for thresholds 1, 0.8,
0.6, 0.5, 0.2, 0. These are displayed in Fig. 8.10, together with the coincidence
matrix that contains the TP, FP, FN, and TN values in the different cases.
As an example, let us take a look at t ¼ 0.2. There are three data points with a
probability for class 1 larger than 0.2; hence, they are assumed to belong to class
1. Two of them are actually of class 1, while one is misclassified (the one with

Fig. 8.9 Example


predictions for ROC
724 8 Classification Models

Fig. 8.10 TPR and FPR for different thresholds

Fig. 8.11 Example predictions for ROC of threshold 0.2

probability 0.6) (see Fig. 8.11). The last data point with probability 0.2 is assigned
to class 0 since its predicted probability is not larger than t. From these predicted
classes, compared with the true classes, we can construct the coincidence matrix
and then easily calculate the TPR and FPR. See the second column in Fig. 8.10.
The ideal classifier has TPR ¼ 1 and FNR ¼ 0, which means that a good classifi-
cation model should have an ROC that goes from the origin in a relatively straight
line to the top left corner, and then to point (1,1). The diagonal symbolizes the
random model, and if the curve coincides with it, it doesn’t perform any better.
In cases where the ROC is under the diagonal, it is most likely that the model
classifies the inverse, i.e., 0 is classified as 1 and vice versa. A typical and an ideal
ROC is displayed in Fig. 8.12.
This ROC curve can be transformed into a single goodness of fit measure for
the classification model: the AUC. This is simply the area under the ROC curve
and can take values between 0 and 1. With the AUC, two models can be compared,
and a higher value indicates a more precise classifier and thus a more suitable
model for the data.
For more detailed information on the ROC and AUC, we recommend Kuhn
and Johnson (2013) and Tuffery (2011).
In the SPSS Modeler, the ROC can be drawn with the Evaluation node. This is
demonstrated in Sect. 8.3.2.

Gini index
Another common goodness of fit measure for the classifier is the Gini index, which
is just a transformation of the AUC, i.e.,

Gini ¼ 2 AUC  1:
8.2 General Theory of Classification Models 725

Fig. 8.12 A typical ROC curve is illustrated in the first graph and an ideal ROC curve in the
second graph

As before with the AUC, a higher Gini index indicates a better classification model.
The shaded area in the first graph in Fig. 8.12 visualizes the Gini index and shows
the relationship with the AUC.

8.2.6 The Analysis Node

These presented measures are implemented in the SPSS Modeler and displayed
in the output of the Analysis node. To get these statistics on the model’s
goodness of fit, the Analysis node has to be added to the model nugget and then
executed.
The statistics that can be calculated and displayed with the Analysis node can be
seen in Fig. 8.13. The three most important additional statistics and measures that
are available for a classifier model are listed in Table 8.3, and the output of the
Analysis node for a binary target variable can be viewed in Fig. 8.14. The output in
the multiple classification case is basically the same, except for the missing AUC
and Gini values. For an example of such a multiple classifier output, see Fig. 8.89.
In the output window of the Analysis node, the accuracy is shown at the top,
followed by the coincidence matrix. The Gini and AUC are displayed at the bottom
of the Analysis node output. All statistics are individually calculated for the training
and test set, respectively. See Fig. 8.14 for the example of the Analysis node
statistics for a classifier on the Wisconsin breast cancer data.
726 8 Classification Models

Fig. 8.13 Options in the Analysis node for a classification node

Table 8.3 List of the available statistics for a classification model


Coincidence matrices Display of matrices that show the predicted target values
against the true target categories for each data record.
This helps to identify systematic prediction errors.
Evaluation metric (AUC & Gini) Display of the AUC and Gini index. This option can only be
chosen for a binary classifier.
To calculate the values of these measures, the target variable
has to be set to Flag, as a measurement in the Type node. See
Fig. 8.15.
Confidence figures If the classification model generates a prediction field for
which confidence statistics can be calculated, e.g., logistic
regression and the probability of a sample being in one
category, then this option shows some confidence values
and their connection to the prediction. For a detailed
description of these statistics, see IBM (2015c).
8.2 General Theory of Classification Models 727

Fig. 8.14 Output of the Analysis node of a classifier with two target categories, on the example of
the Wisconsin Breast Cancer Data (see Sect. 10.1.35) and a logistic regression classifier

8.2.7 Exercises

Exercise 1 Building a classifier and prediction for new data

1. Explain in your own words the steps for building a classifier.


2. Consider a classifier for the assignment of a “good” or “bad” credit rating to
bank customers. The classifier has the following set of rules:
728 8 Classification Models

Fig. 8.15 Defining the target variable measurement as “Flag” in the Type node. This ensures
correct calculation of the AUC and Gini

Use this classifier to predict the credit rating of the following customers:

Exercise 2 Calculation of goodness of fit measures


Recall the example of credit card rating from Exercise 1. Assume now that the
classifier uses a scoring function that calculates the probability that a customer has a
8.2 General Theory of Classification Models 729

“good” credit rating. The probabilities of the customers and the true credit ratings
are as follows:

1. Predict the credit rating in cases where the classifier suggests a “good” rating, if
the probability is greater than 0.5, and a “bad” rating otherwise. What are the
number of TP, FP, TN, and FN and the error rate in this case?
2. Calculate the ROC and the AUC from the probabilities of a “good” credit rating
and a true credit rating.

Exercise 3 Pitfalls in classification


When building a classification model, there are many factors you have to keep in
mind, which can negatively influence the ability of the prediction model and lead to
worse models. These include:

– Imbalanced data: The Target variable is very unevenly distributed in your


training data. For example, in a binary target variable there is only 5 % of
1 and 95 % of 0. This can happen in fraud data, for example, since fraud is rare
compared to normal use.
– Outliers: These are data points or variable values that are located far away from
all the others. These can result from measurement errors, for example.
– Missing values: Data may not always be complete, and some data records
contain missing values. That means, not every attribute of a data record has a
proper value, i.e., the data of this variable is missing. This incompleteness can
occur if, for example, the data are unavailable like customer information for
sales transaction data, or the variable weren’t considered to be important in
the past.
– A huge number of predictor variables: The number of input variables is pretty
high compared to the number of observations. One example is gene data, which
comprise a huge number of measured values, but samples and tissues of a
particular disease are often rare, so the number of observations is very few.

Give reasons why these effects can worsen your model and eliminate a good
prediction. Explain the term “overfitting” in your own words.
730 8 Classification Models

8.2.8 Solutions

Exercise 1 Building a classifier and prediction for new data


Theory discussed in section Sect. 8.2.1

1. This has been explained in Sect. 8.2.1, and we refer to this chapter for the answer
to this first question.
2. The predicted credit ratings are listed in Fig. 8.16.

Exercise 2 Calculation of goodness of fit measures


Theory discussed in section Sect. 8.2.4
Sect. 8.2.5

1. The predicted credit ratings are the same as in exercise 1 and can be viewed in
Fig. 8.16. Table 8.4 shows the credit card rating values of TP, FP, TN, FN.
With these values, we obtain the error rate

number of FP þ FN 1þ2
Error rate ¼ ¼ ¼ 0:231:
number of data points 13

2. The values of TPR and FPR for the relevant thresholds are listed in Fig. 8.17, and
they are calculated as described in the ROC part of Sect. 8.2.5. The ROC then
looks like Fig. 8.18.

Fig. 8.16 Predicted credit


ratings
8.2 General Theory of Classification Models 731

Table 8.4 Values of TP, True credit card rating


FP, TN, FP for the
Yes No
credit card
Predicted credit card rating Yes TP ¼ 6 FP ¼ 1
No FN ¼ 2 TN ¼ 4

Fig. 8.17 FPR and TPR for the relevant threshold of the ROC

Fig. 8.18 ROC of the credit rating prediction

Exercise 3 Pitfalls in classification

– Imbalanced data: Classifier algorithms can prefer the majority class and opti-
mize the model to predict this class very accurately. As a consequence, the
minority class is then poorly classified. This leaning towards the larger target
class is often the optimal choice for highest prediction accuracy, when dealing
with highly imbalanced data.
732 8 Classification Models

Fig. 8.19 Visualization of


the imbalanced data problem.
The minority class is highly
misclassified

In Fig. 8.19, the imbalanced data problem is visualized. The dots are under-
represented compared with the rectangles, and thus the decision boundary is
matched to the latter class of data points. This leads to high misclassification in
the dots’ class, hence, 5 out of 7 dots are misclassified, as they lay on the wrong
side of the boundary. See Haibo He and Garcia (2009) for details on this issue
and a variety of methods for dealing with this problem.
– Outliers: An outlier in the data can vastly change the location of the decision
boundary and, thus, highly influence the classification quality. In Fig. 8.20, the
calculated decision boundary is heavily shifted, in order to take into account the
outlier at the top. The dotted decision boundary might be a better choice for
separation of the classes.
– Missing values: There are two pitfalls when it comes to missing values. First, a
data record with missing values leaks of information. For some models, these
data are not usable for training, or they can lead to incorrect models, due to
misinterpretation of the distribution or importance of the variables with missing
values. Moreover, some models are unable to predict the target for incomplete
data. There are several methods to handle missing values and assign a proper
value to the missing field. See Han et al. (2012) for a list of concepts to deal with
missing values.
– A huge number of predictor variables: If the ratio of input variables to the
number of observations is extremely high, classifiers tend to overfit. The more
input variables that exist, the more possibilities for splitting the data. So, in this
case we can definitely find a way to distinguish between the different samples
8.3 Logistic Regression 733

Fig. 8.20 Visualization of


the outlier problem. The
decision boundary is not
chosen optimally, due to the
outlier at the top

through at least one variable. One possible technique for dealing with this
problem is dimension reduction, via PCA or factor analysis (see Chap. 6). For
further information, we refer to James et al. (2013).

Overfitting is a phenomena that occurs when the model is too optimized to the
training data, such that it is very accurate on the training data, but inaccurate when
predicting unseen data. See Sect. 8.2.4, Fig. 8.8, and Sect. 5.1.2.

8.3 Logistic Regression

Logistic regression (LR) is one of the most famous classification models and is used
for many problems in a variety of fields. It is so incisive and of such relevance and
importance, especially in the financial sector, that the main contributors to the theory
received a Nobel prize in economy in 2000 for their work. LR is pretty similar to
Linear Regression as described in Chap. 5. The main difference is the scale of the
target value. Linear regression assumes a numeric/continuous target variable and tries
to estimate the functional relationship between the predictors and the target variables,
whereas in a classification problem, the target variable is categorical and linear
regression models become inadequate for this kind of problem. Hence, a different
approach is required. The key idea is to perform a transformation of the regression
equation to predict the probabilities of the possible outcomes, instead of predicting
the target variable itself. This resulting model is then the “LR model”.
734 8 Classification Models

8.3.1 Theory

The setting for an LR is the following: consider n data records xi1, . . ., xip, each
consisting of p input variables, and each record having an observation yi. The
observations y1, . . ., yn thereby are binary and take values 0 or 1.
Instead of predicting the categories (0 and 1) directly, the LR uses a different
approach and estimates the probability of the observation being 1, based on the
co-variables, i.e.,
  
P yi ¼ 1  xi1 , . . . , xip ;

with a regression
 
h xi1 ; . . . ; xip ¼ β0 þ β1  xi1 þ . . . þ βp  xip :

However, using the naive approach


  however andestimating  the probability
 directly
with regression, that is P yi ¼ 1  xi1 , . . . , xip ¼ h xi1 ; . . . ; xip ; is not feasible
since the regression function is not bound between 0 and 1. More precisely, for
particular values of the input variables h(xi1, . . ., xip), it can be greater than 1 or even
negative.
To solve this problem, the regression function is transformed with a function F,
so that it can only take values in the interval [0, 1], i.e.,
    
P yi ¼ 1  xi1 , . . . , xip ¼ F β0 þ β1  xi1 þ . . . þ βp  xip ;

where F is the logistic (distribution) function

expðtÞ
Fð t Þ ¼ :
1 þ expðtÞ

See Fig. 8.21 for the graph of the logistic function. Hence,
  
   exp h xi1 ; . . . ; xip
P yi ¼ 1  xi1 , . . . , xip ¼    ;
1 þ exp h xi1 ; . . . ; xip

and by taking the inverse of the logistic function


   !
P yi ¼ 1  xi1 , . . . , xip  
log    ¼ h xi1 ; . . . ; xip ;

1  P yi ¼ 1 xi1 , . . . , xip

we get the usual (linear) regression term on the right-hand side. This equation is
referred to as log-odds or logit, and it is usually stated as determination of the
logistic regression equation.
8.3 Logistic Regression 735

Fig. 8.21 Graph of the logistic (distribution) function

Odds
Recall that in the linear regression models, the coefficients β0, . . ., βp give the effect
of the particular input variable. In LR, we can give an alternative interpretation of
the coefficients, by looking at the following equation
  
P yi ¼ 1  xi1 , . . . , xip  
   ¼ expðβ0 Þexpðβ1 xi1 Þ . . . exp βp xip ;
P yi ¼ 0  xi1 , . . . , xip

which is derived from the former equation, by taking the exponential. The quotient
of probabilities on the left-hand side of this equation is called odds and gives the
weight of the probability to the observation being 1, compared to the probability
that the observation is 0. So, if the predictor xik increases by 1, the odds change with
a factor of exp(βk). This particularly means that a coefficient of βk > 0, and thus
expðβk Þ > 1; increases the odds and therefore the probability of the target variable
being 1. On the other hand, if βk < 0; the target variable tends to be of the category
0. A coefficient of 0 does not change the odds, and the associated variable has
therefore no influence on the prediction of the observation.
This model is called the Logit model, due to the transformation function F, and
it is the most common regression model for binary target variables. Besides the
Logit model, there are other models that use the same approach but different
transformation functions, and we refer interested readers to Fahrmeir (2013) for
further information.
To perform an LR, some requirements are needed, which should be checked
before trusting the results of the model.
736 8 Classification Models

Necessary conditions

1. The observations should be independent.


2. The predictor variables should have a low collinearity, as otherwise the impact
of several variables can be difficult to differentiate.
3. The number of samples should be large enough. Hence, the coefficients can be
calculated.

Multinomial logistic regression


If the domain of the target variable has more elements than two, the logistic
regression model described above can be extended. Therefore, a base category of
the domain is chosen, and for every other value a binary logistic regression model is
built against the base category. More precisely, if the target variable takes k different
values, there are k1 independent binary LR models fitted to predict the probability
of each of the possible k1 outcomes. With these k1 models, the probabilities of all
k elements in the domain can be described by logistic regression functions.
For new data, the category with the highest probability estimated by the regres-
sion equations is chosen as the prediction. See Azzalini and Scarpa (2012) for more
details.

Goodness of fit measures
Besides the common measures of fit introduced in Sect. 8.2.5, there are a variety of
other measures and parameters to quantify the goodness of fit of the logistic
regression model to the data. First, we mention the Pseudo R-Square Measures,
in particular the Cox and Snell, the Nagelkerke, and the McFadden measures,
which compare the fitted model with the naı̈ve model, i.e., the model which
includes only the intercept. See Tuffery (2011) and Allison (2014). These
measures are similar to the coefficient of determination R2, but each of them has
its limitations. We would further like to point out that we have to pay attention
when interpreting the Cox and Snell R2 measure since its maximum value is always
less than 1.
Another measure for the goodness of fit used by the Modeler is the likelihood
ratio test, which is also a measure that compares the fitted model with the model
including only the intercept. This statistic is asymptotically Chi-square distributed.
If the value of the likelihood ratio test is large, the predictors significantly improve
the model fit. For details see Azzalini and Scarpa (2012).

8.3.2 Building the Model in SPSS Modeler

A logistic regression model can be built with the Logistic node in the SPSS
Modeler. We will present how to set up the stream for a logistic regression, using
this node and the Wisconsin breast cancer dataset, which comprises data from
breast tissue samples for the diagnosis of breast cancer. A more detailed description
of the variables and the medical experiment can be gleaned from motivating
Example 1, Sect. 10.1.35 and Wolberg and Mangasarian (1990).
8.3 Logistic Regression 737

Description of the model


Stream name Logistic regression
Based on dataset WisconsinBreastCancerData.csv (see Sect. 10.1.35)
Stream structure

Important additional remarks:


The target variable should be categorical. If it is continuous, a (multiple) linear regression might
be more appropriate for this problem (see Chap. 5).
Related exercises: All exercises in Sect. 8.3.7

Fig. 8.22 Template stream


of the Wisconsin breast
cancer data

1. We open the template stream “016 Template-Stream_WisconsinBreastCancer”


(see Fig. 8.22) and save it under a different name. The target variable is called
“class” and takes the values 2 for “benign” and 4 for “malignant” samples.
2. Open the Type node to specify the variable types. See Sect. 2.1 for the descrip-
tion of this node. Figure 8.23 shows an example of how a Type node should look
in this case.
3. We add the Logistic node to the canvas and connect it with the Type node. We
open the Logistic node with a double-click, to set the parameters of the logistic
regression model.
4. In the Fields tab, we can select the target and input variables. For this model, the
target variable can only be categorical since we intend to build a classification
model. Here, the target variable “class” indicates if a sample is tumorous (4 for
malignant) or not (2 for benign). See Fig. 8.24 for the selection of variables for
738 8 Classification Models

Fig. 8.23 Detailed view of the Type node for the Wisconsin Breast Cancer data

this example. As input variables, we select all but the “SampleID”, as this
variable only names the samples, and so is irrelevant for any classification.
5. In the Model tab, we can select if the target variable is binary or has more than
two categories. See top arrow in Fig. 8.25. We recommend using the Multino-
mial option, as this is more flexible and the procedure used to estimate the model
doesn’t differ much from the Binomial, while the results are equally valid.
As in the other regression models (see Chap. 5), we can choose between
several variable selection methods. See Sect. 5.3.1 for more information on these
methods. Table 8.5 shows which variable selection methods are available for
Multinomial and Binomial logistic regression.
In the example of the breast cancer data, we choose the Backwards stepwise
method, see middle arrow in Fig. 8.25. That means, that the selection process starts
with a complete model, i.e., all variables included, and then removes variables step
by step, until the resulting model cannot be anymore improved upon.
We can also specify the base category, which is the category of the target
variables that all other variables are compared with. In other words, the base
category is interpreted as 0 for the logistic regression, and the probability of
non-occurrence in the base category is estimated by the model.
By default, each input variable will be considered separately, with no
dependencies or interactions between each other. This is the case when we
8.3 Logistic Regression 739

Fig. 8.24 Selection of the criterion variable and input variables for logistic regression, in the case
of the Wisconsin Breast Cancer data

select the model type “Main Effects” in the model options (see Fig. 8.25). On the
other hand, if “Full Factorial” is selected as the model type, all the possible
interactions between predictor variables are considered. This will lead to a more
complex model that can describe more complicated data structures. The model
may be likely to suffer from overfitting in this situation however, and the
calculations may increase intensively due to the amount of new coefficients
that have to be estimated. If we know the interactions between variables, we can
also declare them manually. This is shown in detail in Sect. 8.3.3.
If we select the Binomial logistic model, it is also possible to specify the
contrast and base category for each categorical input. This can sometimes be
useful if the categories of a variable are in a certain order, or a particular value is
the standard (base) category to which all other categories are compared. As this
is a feature for experienced analysts, however, we omit giving a detailed
description of the options here, and refer interested readers to IBM (2015b).
740 8 Classification Models

Fig. 8.25 Options in the model selection method

Table 8.5 List of the Multinomial Binomial


variable selection methods
Enter (no selection) X X
for Multinomial and
Binomial logistic (Forwards) Stepwise X X
regression Forwards X
Backwards X
Backwards stepwise X X

6. We recommend including the predictor importance calculations in the model,


which can be chosen in the “Analyze” tab. See Fig. 8.26 and the references given
in Sect. 5.3.3, for information on predictor importance measures. This option is
only available for binary target variables.
7. Run the stream to build the model. The model nugget appears and is included in
the stream. We suggest adding an Analysis node to quickly view the hit rate and
goodness of fit statistics and thus evaluate the quality of the model. See Fig. 8.27
8.3 Logistic Regression 741

Fig. 8.26 The “Analyze” tab, with the option “Calculate predictor importance”

Fig. 8.27 Accuracy of the final logistic regression model


742 8 Classification Models

for the accuracy and Gini/AUC values of the model for the Wisconsin Breast
cancer data. Here, we have 97 % correctly classified samples and a very high
Gini value of 0.992. Both indicate a well-fitted model using the training data.
8. As we are faced with a binary classification problem, we add an Evaluation node
to the model nugget, to visualize the ROC in a graph and finish the stream. Open
the node and select the ROC graph as the chart type (see Fig. 8.28).
After clicking on the “Run” button at the bottom, the graph output pops up in a
new window. This is displayed in Fig. 8.29. The ROC is visualized with a line
above the diagonal and has nearly the optimal form (recall Sect. 8.2.6), whereas
the diagonal symbolizes the purely random prediction model.

Fig. 8.28 Options for the ROC in the evaluation node


8.3 Logistic Regression 743

Fig. 8.29 Graph of the ROC of the logistic regression model on the Wisconsin breast cancer data

8.3.3 Optional: Model Types and Variable Interactions

There are three possible model types for a logistic regression. Two of these are:
“Main Effects”, where the input variables are treated individually and indepen-
dently with no interactions between them, and “Full Factorial”, where all possible
dependencies between the predictors are considered for the modeling. In the latter
case, the model describes more complex data dependencies, and itself becomes
more difficult to interpret. Since there is an increase of terms in the equation, the
calculations may take much longer, and the resulting model is likely to suffer from
overfitting.
With the third model type, i.e., “Custom” (see Fig. 8.30), we are able to
manually define variable dependencies that should be considered in the modeling
process. We can declare these in the Model tab of the Logistic node, in the bottom
field, as framed in Fig. 8.30.

1. To add a new variable combination, which should be included in the mode, we


select “Custom” as the model type and click on the button to the right. See arrow
in Fig. 8.30.
2. A new window pops up where the terms that should be added can be selected
(see Fig. 8.31). There are five different possibilities to sub-join a term: Single
interaction, Main effects, All 2-way interactions, All 3-way interactions, All
4-way interactions. Their properties are described below:
744 8 Classification Models

Fig. 8.30 Model type specification area in the Logistic node

– Single interaction: a term is added that is the product of all selected variables.
For example, if A, B, and C are chosen, then the term A*B*C is included in
the model.
– Main effects: each variable is added individually to the model, hence, A, B,
and C separately, for the above example.
– All *-way interaction: All possible products with variable combinations of *,
which stands for 2, 3, or 4, are inserted. In the case of “All 2-way”
interactions for example, this means that for the terms A*B, A*C, and B*C
are added to the logistic regression model.

3. We choose one of the five term types and mark the relevant variables for the term
we want to add, by clicking on them in the field below. In the Preview field, the
selected terms appear and by clicking on the Insert button, these terms are
included in the model. See Fig. 8.32, for an example of “All 2-way” interaction.
The window closes, and we are back in the options view of the Logistic node.
4. The previous steps have to be repeated until every necessary term is added to the
model.
8.3 Logistic Regression 745

Fig. 8.31 Variable interaction selection window

Fig. 8.32 Insert new term to the model


746 8 Classification Models

8.3.4 Final Model and Its Goodness of Fit

The estimated coefficients and the parameters describing the goodness of fit can be
viewed by double-clicking on the model nugget.

Model equation and predictor importance


The first window that opens shows the regression equations, that is, the log-odds
equation with the estimated coefficients of the input variables on the left-hand side
(see Fig. 8.33). In our example, these are two equations, one for each of the two
target categories. If the target variable has more values that it can take, then the
Modeler would estimate more equations and display them on the left-hand side,
one for each possible output. We note, that for the base category, no equation is
estimated, since its probability is determined by the probabilities of all other
categories. It therefore carries that logistic regression will only predict the proba-
bility of “non-base category” categories.
In our example, the regression equation is the following:
  !
P Y ¼ 4 x
log    ¼ 0:5387  Clump thickness þ 0:3469  Cell shape þ . . .
P Y ¼ 2 x
þ 0:5339  Mitosis  9:691

In the right-hand field, the predictor importance of the input variables is displayed.
The predictor importance gives the same description as it would with linear
regression; that is, the relative effect of the variables on the prediction of the
observation in the estimated model. The length of the bars indicates the importance
of the variables in the regression model, which add up to 1. For more information on

Fig. 8.33 Model equations and predictor importance


8.3 Logistic Regression 747

Fig. 8.34 Estimated model parameters and significance level

predictor importance, we refer to the “Multiple Linear Regression” Sect. 5.3.3, and
for a complete description of how these values are calculated, read IBM (2015a).

Estimated coefficients and their significance level


The advanced tab contains a detailed description of the modeling process, e.g., the
variable selection process, and criteria for assessing the goodness of fit of the
regression model. Additionally, this tab contains the estimated coefficients of the
final model and the significance level, together with further information, such as the
confidence interval. These latter values are summarized in a table at the bottom of
the advanced tab, and they are shown when the tab is first opened (see Fig. 8.34).
The first column on the left shows the category for which the probability is
estimated, (in this instance, it is 4). The rest of the table is built as follows: the
columns are dedicated to the input variables (here, generally named with indices)
considered in the model, and the rows describe the various statistical parameters of
the variables. The estimated coefficients are in row “B”, and the significance level is
in row “Sig.” See Fig. 8.34. Here, most of the coefficients are significant without
considerations. The last two variables, Normal nucleoli (clem_var9_) and Mitosis
(clem_var10_), have a significance level of about 0.09.
We would further point to row “Exp(B)”. These values give the factors the odd
changes if the variable increases by 1. For example, the odd increases by a factor of
1.714 if the clump thickness is one unit higher. The equation to estimate the odd
change therefore is
  
P Y ¼ 4 x
   ¼ expð9:591Þ  1:714xi1  1:415xi2  . . .  1:706xip :
P Y ¼ 2 x
748 8 Classification Models

Fig. 8.35 Case summary statistics of logistic regression in the Wisconsin breast cancer data

Fig. 8.36 Model-finding summary of logistic regression

Summary of variable selection process and model fitting criteria


At the top of the advanced tab, we can find a summary of the data processed when
building the model. This can be seen in Fig. 8.35, the categories (classes) with the
number of sample records that belong to the diverse categories. Here, class 2 has
458 samples, and the set with 4 as the target value has 241 samples. All these
records are valid, that means they have no missing data, and can be used to build a
regression model. Hence, the model considers 699 observations. The “Subpopula-
tion” indicates the number of different combinations of input variables used for the
model. Here, 463 different combinations of the predictors are in the whole dataset,
and the footnote indicates that if there are multiple records with the same predictor
values, then these records have the same target value, that is 100 %.
In the table in Fig. 8.36, a summary of the variable selection procedure is shown.
There, for each step of this process we can see the variables that are removed in
the particular step; recall that we use the backwards method. For each of these
8.3 Logistic Regression 749

Fig. 8.37 Model-fitting criteria and values

variables, the model-fitting criteria value is displayed, together with test statistics
that were determined in order to evaluate if the variable should be contained in the
model or removed. Here, the variables “cell size” (clem_var3_) and “single epithe-
lial cell size” (clem_var6_) were removed, based on the 2 log likelihood model-
fitting criteria. For details, we refer to Fahrmeir (2013) and James et al. (2013).
The model-fitting information of the final model can be viewed in an additional
table, also contained in the Advanced tab. See the first table in Fig. 8.37. Here, the
model-fitting criteria is visualized, alongside scores from the model validation test
and the selected variables.
Beneath this overview, further model-fitting criteria values are listed, the pseudo
R2 values. These comprise the Cox and Snell, Nagelkerke, and McFadden R2
values. These characteristic numbers are described in the theory section (Sect.
8.3.1), and the references given there. Here, all R2 values indicate that the regres-
sion model describes the data well.

Output setting
In the Settings tag, we can specify which additional values, besides the estimated
class or category, should be included in the output of a prediction (see Fig. 8.38).
These can be “Confidences”, that is the probability of the estimated target category,
or the “Raw propensity score”, which is the estimated probability actually calcu-
lated by the model. This is the probability of the occurrence in the non-base
category. Alongside these options, all probabilities can also be appended to the
output. An example on how such an output might look is pictured in Fig. 8.39. Next
to the estimated class ($L-class), the probability for the outcome of this class ($LP-
class), the probabilities of all possible outcomes ($LP-2, $LP-4), and the probability
estimated by the model ($LRP-class) are added to the output of each inserted
sample record.
750 8 Classification Models

Fig. 8.38 Output definition of the model, when used for prediction

Fig. 8.39 Output calculated by logistic regression

Fig. 8.40 Prediction of categories with logistic regression

8.3.5 Classification of Unknown Values

Predicting classes for new records or for a logistic regression model with an
unknown dataset is done just like linear regression modeling (see Sect. 5.2.5). We
copy the model nugget and paste it into the modeler canvas. Then, we connect it to
the Source node with the imported data that we want to classify. Finally, we add an
Output node to the stream, e.g., a Table node, and run it to obtain the predicted
classes. The complete prediction stream should look like Fig. 8.40.
8.3 Logistic Regression 751

8.3.6 Cross-Validation of the Model

Cross-validation is a standard concept for validating the goodness of fit of the


model when processing new and unknown data. More precisely, a model might
describe the data it is based on very well, but be unable to predict the correct
categories for unknown data, which are independent of the model. This is a classic
case of overfitting (see Sect. 5.1.2), and cross-validation is needed to eliminate this
phenomenon.
If the test data are in a separate data file, then cross-validation can be performed
by classifying unknown values, see Sect. 8.2.1, but instead of using a Table node for
output, we should use the Analysis node to get the hit counts.
If our initial dataset is large enough however, such that it can be divided into a
training set and a test set, we can include the cross-validation in the model building
process. Therefore, only the Partition node has to be added to the stream in
Sect. 8.3.2.

Description of the model


Stream name Logistic regression cross_valildation
Based on dataset WisconsinBreastCancerData.csv (see Sect. 10.1.35)
Stream structure

Related exercises: 1, 2

1. We consider the stream for the logistic regression as described in Sect. 8.3.2 and
add a Partition node to the stream in order to split the dataset into a training
set and a test set. This can be, for example, placed before the Type node. See
Sect. 2.7.7 for a detailed description of the Partition node. We recommend using
70–80 % of the data for the training set and the rest for the test set. This is a
typical partition ratio. Additionally, let us point out that the model and the hit
rates coincide with the randomly selected training data. To get the same model
in every run, we fix the seed of the random number generator in the Partition
node.
752 8 Classification Models

Fig. 8.41 Definition of the Partitioning field in the Logistic node

2. In the Logistic node, we have to select the field that indicates an affiliation to the
training set or test set. This is done in the Fields tab (see Fig. 8.41).
3. Afterwards, we mark the “Use partitioned data” option in the Model tab (see
Fig. 8.42). Now, the modeler builds the model only on the training data and uses
the test data for cross-validation. All other options can be chosen as in a normal
model building procedure and are described in Sect. 8.3.2.
4. After running the stream, the summary and goodness of fit information can be
viewed in the model nugget that now appears. These are the same as in Sect.
8.3.4, despite the fact that fewer data were used in the modeling procedure. See
Fig. 8.43 for a summary of the data used in the modeling process. Here, we see
that the total number of samples has reduced to 491, which is 70 % of the whole
Wisconsin dataset.
Since fewer data are used to train the model, all parameters, the number of
included variables and the fitting criteria values change. See Fig. 8.44 for the
parameters of this model. We note that compared with the model built on the
whole dataset (see Sect. 8.3.4), this model, using 70 % trained data, contains
only 6 rather than 7 predictor variables.
8.3 Logistic Regression 753

Fig. 8.42 Use the partitioned data option in the Logistic node

Fig. 8.43 Summary of the data size used in the modeling process with the training data

5. In the Analysis node, we mark the “Separate by partition” option to calculate the
hit rates for each partition separately, see Fig. 8.45. This option is enabled as
standard. The output of the Analysis node can be viewed in Fig. 8.14, which
shows the hit rates and further statistics for both the training set and the test set.
754 8 Classification Models

Fig. 8.44 Model parameters of the model built on a subset of the original data

Fig. 8.45 Options in the Analysis node with “Separate by partition” selected

In our example of the Wisconsin Breast Cancer data, the regression model
classifies the data records in both sets very well, e.g., with a hit rate of over
97 % and a Gini of over 0.98.
8.3 Logistic Regression 755

Fig. 8.46 Settings in the Evaluation node. The “Separate by partition” option is enabled, to treat
the training data and test data separately

6. In the Evaluation node, we also enable the “Separate by partition” option, so that
the node draws an ROC for the training set, as well as the test set, separately (see
Fig. 8.46). These plots can be viewed in Fig. 8.47.
756 8 Classification Models

Fig. 8.47 Output of the Evaluation node for the logistic regression of Wisconsin breast cancer
data. ROC for the training sets and test sets separately

8.3.7 Exercises

Exercise 1 Prediction of credit ratings


Banks use demographic and history loan data to decide if they will offer credit to a
customer. Use the tree_credit.sav data, which contains such information, to predict
a “good” or “bad” credit rating for a customer (see Sect. 10.1.33).

1. Import the data and prepare a training—testing situation.


2. Build a logistic regression model, with a stepwise variable selection mechanism,
to predict the credit rating.
3. What are the variables included in the model and which one of them has most
importance? Determine the logistic regression equation for the log-odds.
4. Calculate the performance evaluation measures accuracy and Gini for both the
training set and the test set. Is the model overfitted?
5. Visualize the Gini measure with the ROC. Furthermore, plot the Gains curve
with the Evaluation node. What does this node visualize? You can search the
internet to find the answer.

Exercise 2 Prediction of Titanic survival and dealing with missing data


The titanic.xlsx file contains data on Titanic passengers, including an indicator
variable “survived”, which indicates if a particular passenger survived the Titanic
sinking (see Sect. 10.1.32). Your task in this exercise is to build a model with cross-
validation, which decides from the ticket and personal information of each passen-
ger, whether they survived the Titanic tragedy.
8.3 Logistic Regression 757

1. Import and inspect the Titanic data. How many values are missing from the
dataset and in which variable fields?
2. Build a logistic regression model to predict if a passenger survived the Titanic
tragedy. What is the accuracy of the model and how are the records with missing
values handled?
3. To predict an outcome for the passengers with missing values, the blanks have to
be replaced with appropriate values. Use the Auto Data Prep node to fill in the
missing data. Which values are inserted in the missing fields?
4. Build a second logistic regression model on the data without missing values. Has
the accuracy improved?
5. Compare the model with and without missing values, by calculating the Gini and
plotting the ROC curve. Use the Merge node to show all measures in an Analysis
node output and draw the ROC of both models in one graph.

Exercise 3 Multiple Choice questions


Question Possible answers
1. Let the coefficient of the input variable x be β ¼ 2. □2
What is the factor with which the odds change □ 0.5
when x is increased by 1? □ exp(2)
□ exp(0.5)
□ log(2)
□ log(0.5)
2. Which of the following are valid variable selection □ Stepwise
methods for Logistic Regression? □ Information criterion (AICC)
□ Backwards
□ Forwards
3. What are properties of a Logistic Regression □ It is a linear classifier
Model? □ It is robust to outliers
□ Has a probabilistic interpretation
□ No problems with collinearity
□ The distribution of the target
variable is irrelevant for the
modelling process
4. Consider a Multinomial logistic regression with □ pA ¼ 0:22
3 target categories A, B, C, and C as base class. □ pA ¼ 0:14
Let 0.34 be the odd of class A and 0.22 of class □ pB ¼ 0:22
B. Which are the correct probabilities of the target □ pB ¼ 0:64
categories ( pA, pB, pC)? □ pC ¼ 0:64
□ pC ¼ 0:22
The following questions are yes or no questions. Please Yes No
mark the correct answer.
5. Logistic Regression is a special case of a □ □
Generalized Linear model (GLM).
6. The Logistic Regression outputs the target class □ □
directly.
7. Logistic Regression is a nonparametric classifier. □ □
8. If the target variable has K categories, then the □ □
Multinomial logistic regression model consists
of K1 single binary models.
758 8 Classification Models

8.3.8 Solutions

Exercise 1 Prediction of credit ratings


Name of the solution streams tree_credit_logistic_regression
Theory discussed in section Sect. 8.2
Sect. 8.3.1
Sect. 8.3.6

The final stream for this exercise is shown in Fig. 8.48.

1. We start by opening the stream “000 Template-Stream tree_credit”, which


imports the tree_credit data and already has a Type node attached to it, and
save it under a different name (see Fig. 8.49).

Fig. 8.48 Stream of the credit rating via logistic regression exercise

Fig. 8.49 Template stream for the tree_credit data


8.3 Logistic Regression 759

To set up a cross-validation with training data and testing data, we add a


Partition node to the stream and place it between the Source node and Type
node. Then, we open the node and define 70 % of the data as training data and the
remaining as test data. See Sect. 2.7.7 for the description of the Partition node.
2. We start the modeling process by opening the Type node and defining the
variable “Credit rating” as the target variable. Furthermore, to calculate the
performance measures and in particular the Gini at the end, we set the measure-
ment type of the target as “Flag”. See Fig. 8.50.
Now, we add a Logistic node to the stream, by connecting it to the Type node.
We observe that the name of the node is “Credit rating”, which shows us that the
node has identified this variable as its target.
We open the Logistic node and choose the “Stepwise” variable selection
method in the Model tab. Furthermore, to enable the cross-validation process,
we make sure the “Use partitioned data” option is activated (see Fig. 8.51).
Since the input variables’ importance is needed for the next task, we check the
box “Calculate predictor importance” in the Analysis tab (see Fig. 8.52).
Now we run the stream and the model nugget appears.
3. To inspect the training model and identify the included variables, we open the
model nugget. In the Model view, we see that the three variables, “Income
level”, Number of credit cards, and “Age” are included in the model. Of these
variables, the income level has the most prediction importance. See the right-
hand side of Fig. 8.53.
On the left-hand side of Fig. 8.53, we identify the regression equation in terms
of the log-odds. The equation is, in our case

Fig. 8.50 Definition of the variable “Credit rating” as the target variable in the Type node
760 8 Classification Models

Fig. 8.51 Definition of the stepwise variable selection method and cross-validation process

Fig. 8.52 Enabling of predictor importance calculations


8.3 Logistic Regression 761

Fig. 8.53 Predictor importance and regression equation

  !
P CR ¼ Good x
log    ¼ 1:696 þ 0:1067  Age
P CR ¼ Bad x
8
< 1:787, IL ¼ High
þ 0, IL ¼ Medium
:
1:777, IL ¼ Low

2:124, NCC ¼ 5 or more
þ
0, NCC ¼ Less than 5

where CR represents the “Credit rating”, IL the “Income level”, and NCC the
“Number of credit cards”.
These predictors are significant to the model. This can be viewed in the
Advanced tab at the bottom of the table (see Fig. 8.54).
4. To calculate the performance measures and to evaluate our model, we add an
Analysis node to the stream and choose the coinciding matrix and AUC/Gini
options. See Sect. 8.2.6 for a description of the Analysis node. We run the
stream. See Fig. 8.55 for the output of the Analysis node and the evaluation
statistics.
We note that the hit ratios in both the training set and the testing set are nearly
the same at a bit over 80 %. Furthermore, the Gini and AUC are close to each
other in both cases at about 0.78. All these statistics suggest good prediction
accuracy and a non-overfitting model.
5. We visualize the ROC by adding an Evaluation node to the stream and
connecting it to the model nugget. We open it and choose the ROC chart option,
762 8 Classification Models

Fig. 8.54 Significance of the predictor variables in the model

Fig. 8.55 Evaluation statistics of the logistic regression model for predicting credit rating

as shown in Fig. 8.28. Figure 8.56 shows the ROC of the training dataset and test
dataset. Both curves are similarly shaped.
Now we add a second Evaluation node to the model nugget, open it, and
choose “Gains” as our chart option. Furthermore, we mark the box “Include best
8.3 Logistic Regression 763

Fig. 8.56 ROC of the credit card rating classifier

line”. See Fig. 8.57. Then, we click the “Run” button at the bottom. The output
graph is displayed in Fig. 8.58.
The gain/lift chart visualizes the effectiveness of a model and shows the
amount of positive samples obtained with, and lost by, the model. More pre-
cisely, the gain chart in Fig. 8.58 displays three lines: the diagonal, also called
baseline, and two lines representing the gains of the logistic regression model
(middle line) and the optimal model (top line). The gain chart is drawn as
follows. All customers are sorted according to their score assigned by the logistic
regression (or another classification model). On the x-axis, the percentage of top
scored customers are shown, while the y-axis displays the percentage of all
“Good” rated customers which are in the top scored customers. If we take, for
example, the top 40 % scored customers from the test set, we have identified
about 60 % of all “Good” rated customers with the regression model, while in the
optimal case, we would detect about 65 %. So, we gained 60 %, but lost 5 % of
the “Good” rated customers. In the test set, there are 61 % “Good” rated
customers, but when taking the best 61 % scored, we identified about 83 % of
them, which is still very good.
If we think the other way around and we want to contain 80 % of the “Good”
rated customers in our sub-sample of all customers, we have to select the 57 %
customers that were scored best.
For additional information on the gain/lift chart and its interpretation, we refer
to Kuhn and Johnson (2013).
764 8 Classification Models

Fig. 8.57 Option to draw a Gains chart in the Evaluation node

Fig. 8.58 Gains of the credit card rating classifier


8.3 Logistic Regression 765

Exercise 2 Prediction of Titanic survival and dealing with missing data


Name of the solution streams Titanic_missing_data_logistic_regression
Theory discussed in section Sect. 8.2
Sect. 8.3.1
Sect. 8.3.6
Sect. 3.2.5 (SuperNode)

The final stream for this exercise is shown in Fig. 8.59. Here, the two models that
are built in this exercise are compressed into SuperNodes, so the stream is more
structured and clear.

1. We open the “017 Template-Stream_Titanic” and save it under a different name.


This stream already contains the Partition node, which splits the data into a
training set (70 %) and a test set (30 %). See Fig. 8.60.
We run the stream, and the data screening output of the Data Audit node
appears. In the Quality tab, we see that missing values appear in the fields,
“Age”, “Fare”, “Cabin”, and “Embarked”. See Fig. 8.61. The missing values in
the “Cabin” category are treated as empty strings however and not as no entry at
all. So these are not really missing values. The variables “Fare” and “Embarked”

Fig. 8.59 Stream of the Titanic survival prediction exercise

Fig. 8.60 Template stream of the Titanic dataset


766 8 Classification Models

Fig. 8.61 Quality output with missing value inspection of the Titanic data

Fig. 8.62 Stream of the logistic regression classifier with missing values in the Titanic data

furthermore have only very few missing values, but the “Age” field is only 80 %
filled. This latter variable can become problematic in our further analysis.
2. We now build the logistic regression classifier on the data with missing values.
Figure 8.62 shows the complete part of this stream. This is the part that is
wrapped up in a SuperNode named “with missing data”.
First, we open the Type node and declare the “Survived” variable as the target
field and set the measurement of it to “Flag”. This will ensure that the “Sur-
vived” variable is automatically selected as the target variable in the Modeling
node. See Fig. 8.63. Then, we add a Logistic node to the stream and connect it to
the Type node. After opening it, we specify “stepwise” as our variable selection
method and enable the use of partitioned data and the predictor importance
calculations. See Figs. 8.51 and 8.52 for the set up of these options in the
Logistic node.
Now, we run the stream and the model nugget appears, which we then open.
In the Model tab (see Fig. 8.64), we note that the variables “Sex”, “Pclass”,
“Embarked”, and “Age” are selected as predictors by the stepwise algorithm.
8.3 Logistic Regression 767

Fig. 8.63 Definition of the “Survived” field as the target and its measurement as “Flag”

Fig. 8.64 Input variables and their importance for predicting the survival of a passenger on the
Titanic
768 8 Classification Models

Fig. 8.65 Performance measures of the Titanic classifier with missing values

The most important of them is the variable “Sex”, and we see by looking at the
regression equation on the left that women had a much higher probability of
surviving the Titanic sinking, as their coefficient is 2.377 and the men’s coeffi-
cient is 0. Furthermore, the variables “Age” and “Embarked”, which contain
missing values, are also included in the classifier.
To evaluate the model, we add the usual Analysis node to the model nugget, to
calculate the standard goodness of fit measures. These are the coincidence
matrix, AUC and Gini. See Sect. 8.2.6. After running the stream, these statistics
can be obtained in the opening window. See Fig. 8.65. We observe that accuracy
in the training set and test set are similar (about 63 %). The same holds true for
the Gini (0.53). This is a good indicator for non-overfitting. In the coincidence
matrix, however, we see a column named $null$. These are the passenger
records with missing values. All these records are non-classifiable, because a
person’s age, for example, is necessary for any prediction, and therefore the
record is treated as 0; hence, they are non-survivors. This obviously reduces the
goodness of fit of our model. So, we have to assign a value to the missing fields,
in order to predict the survival status properly.
8.3 Logistic Regression 769

Fig. 8.66 Renaming the prediction fields for the model with missing values

Fig. 8.67 Stream of the logistic regression classifier and missing value data preparation in the
Titanic data

To distinguish between predictions of the model with missing values and the
one built thereafter without missing values, we change the name of the perdition
fields with a Filler node. To do this, we add a Filler node to the model nugget,
and rename these fields, as seen in Fig. 8.66.
3. Now, we use the Auto Data Prep node to replace the missing values and predict
their outcome properly. Figure 8.67 shows this part of the stream, which is
wrapped in a SuperNode named “mean replacement”.
We add an Auto Data Prep node to the Type node and open it. In the Settings
tab, the top box should be marked in order to perform data preparation. Then, we
tick the bottom three boxes under “Inputs”, so the missing values in continuous,
770 8 Classification Models

Fig. 8.68 Auto Data Prep node to replace missing values

nominal, and ordinal fields are replaced with appropriate values. Since the
variables “Age” and “Fare” are continuous, the missing values are replaced by
the mean of the field’s values, and the missing embarked value is replaced by the
field’s mode. See Fig. 8.68. Furthermore, we enable the standardization of
continuous fields. This is not obligatory, but recommended as it typically
improves the prediction, at least it doesn’t worsen it.
After finishing these settings, we click the Analyze Data button at the top,
which starts an analysis process on the data-prepared fields. In the Analysis tab,
the report of the analyzed data can be viewed. We see, e.g., for the “Age”
variable, that in total 263 missing values were replaced by the mean, which is
29.88. Moreover, the age variable was standardized and now follows a normal
distribution, as can be seen in the top right histogram. See Fig. 8.69.
4. To build a classifier on the prepared data, we add another Type node to the
stream, to ensure that the Model node recognizes the input data correctly.
Afterwards, we add a Logistic node to the stream, connect it to the Type node
and choose the same options as in the first Logistic node. See step 2 for details.
Afterwards, we run the stream and the Model nugget appears.
We note that “Sex”, “Pclass”, and “Age” are still included as input variables
in the model, with nearly the same prediction importance. The Embarked
variable is now replaced by “sibsp”, however. This is due to the more powerful
“Age” variable, which is now able to influence more passenger records, as it now
contains complete data. See Fig. 8.70.
8.3 Logistic Regression 771

Fig. 8.69 Analysis of the “Age” field after data preparation by the Auto Data Prep. node

Fig. 8.70 Input variables and their importance for predicting the survival of a passenger on the
Titanic after data preparation
772 8 Classification Models

Fig. 8.71 Performance measures of the Titanic classifier without missing values

To view the Gini and accuracy, we add an Analysis node to the stream and set
up the usual options (see step 2 and Sect. 8.2.6). The output of the Analysis node
can be viewed in Fig. 8.71. We see that no missing values are left in the data and
thus all passengers can be classified with higher accuracy. Consequently, both
the accuracy and Gini have improved, when compared to the model with missing
values (see Fig. 8.65). So, the second model has a higher prediction power than
the one that ignores the missing values.
5. As in the previous model, we add a Filter node to the stream and connect it to the
model nugget; this Filter node allows us to change the name of the prediction
outcome variables by adding “mean repl” as a suffix. See Fig. 8.72.
Now, we add a Merge node to the stream canvas and connect the two Filter
nodes to it. In the Filter tab of the Merge node, we cross out every duplicate field,
to eliminate conflicts from the merging process. See Fig. 8.73.
Now, we add an Evaluation node and an Analysis node to the stream and
connect them to the Merge node. See Fig. 8.74. The settings of these nodes are as
usual. See Sect. 8.2.6 for a description of the Analysis node. In the Evaluation
node, we choose ROC as the chart type, see Fig. 8.28.
8.3 Logistic Regression 773

Fig. 8.72 Renaming of the prediction fields for the model without missing values

Fig. 8.73 Elimination of duplicate fields in the Merge node


774 8 Classification Models

Fig. 8.74 Output nodes are connected to the Merge node

Fig. 8.75 ROC of the logistic regression classifiers with and without missing values

In Fig. 8.75, the ROC of the model is shown, with and without missing values
considered. As can be seen, the ROC of the model without missing values lies
noticeably above the ROC of the classifier ignoring missing values. Hence, the
first model has a better prediction ability.
In the Analysis output, the performance measures of both models are stated in
one output. Additionally, the predictions of both models are compared with each
other. This part of the Analysis output is displayed in Fig. 8.76. The first table
shows the percentage of equal-predicted classes. In other words, the table shows
how many passengers are classified as survivor or non-survivor by both
classifiers. The second table then takes the commonly classified records
(passengers) and compares their prediction with the actual category. Here, of
the equally classified passengers, about 80 % are correct.
8.3 Logistic Regression 775

Fig. 8.76 Comparison of the two models in the Analysis output

Exercise 3 Multiple Choice questions


Theory discussed in section Sect. 8.3.1

1. The odd changes with a factor exp(2).


2. The variable selection methods are Forward, Stepwise, and Backwards. The
AICC is a criterion that compares models with each other and is thus involved in
the variable selection, but not a method itself.
3. The correct answers are:

– It is a linear classifier.
– It is robust to outliers
– Has a probabilistic interpretation.

If the variables are highly correlated, the logistic regression can leak in
performance.
4. The right probabilities are:

pA ¼ 0:22, pb ¼ 0:14 and pC ¼ 0:64:

The calculations are the following. Starting with the odds equations

pA pB
¼ 0:34 and ¼ 0:22;
pC pC
776 8 Classification Models

we imply
pA ¼ 0:34  pC and pB ¼ 0:22  pC :

Since pA þ pB þ pC ¼ 1, we get

1 ¼ 0:34  pC þ 0:22  pC þ pC ¼ 1:56  pC

and finally

pC ¼ 0:64:

Putting the value of pC in the odds equations, we get

pA ¼ 0:22 and pB ¼ 0:14:

5. Yes. The Logistic regression is a GLM. See Sect. 5.4.


6. No. The logistic regression calculates a probability for each target class instead
of predicting the class directly.
7. No. Logistic regression falls in the category of parametric classifiers, as
coefficients have to be estimated which then define the model.
8. Yes. For each non-base target category, a binary model is calculated versus the
base category.

8.4 Linear Discriminate Classification

Linear discriminant analysis is one of the oldest classifiers and goes back to Fisher
(1936) and Welch (1939) and their biological frameworks. This classifier is still one
of the most known and preferred classifiers. For example, the Linear discriminant
classifier (LDC) is very popular with banks, since it works well in the area of credit
scoring. When used correctly, the LDC provides great accuracy and robustness. The
LDC only works properly however, if the data are normally distributed in the target
categories. So, the LDC must make strict assumptions and is therefore not applica-
ble to all data. The LDC furthermore uses a linear approach and tries to find linear
functions that describe an optimal separation of the classes. The theory of LDC is
described in more detail in the subsequent section.

8.4.1 Theory

The idea of LDC goes back to Fisher (1936) and Welch (1939) who developed the
method separate to each other and with different approaches. We here describe the
method by Fisher, called Fisher’s linear discriminant method. The key idea
behind the method is to find linear combinations of input variables that separate
the classes in an optimal way. See, e.g., Fig. 8.77. The LDC chooses a linear
discriminant, such that it maximizes the distance of the classes from each other,
8.4 Linear Discriminate Classification 777

Fig. 8.77 Separation of


classes with linear decision
boundaries

while at the same time minimizing the variation within. In other words, a linear
function is estimated that best separates the distributions of the classes from each
other. This algorithm is outlined in the following binary case. For a more detailed
description, we recommend Runkler (2012), Kuhn and Johnson (2013), Tuffery
(2011) and James et al. (2013).
Consider that the target variable is binary. This method now finds the optimal
linear separator of the two classes in the following way: First, it calculates the mean
of each class. The linear separator now has to go through the mid-point between
these two means. See the top left graph in Fig. 8.78. There, a cross symbolizes the
mid-point, and the decision boundary has to go through this point, just like the two
possibilities in the graph. To find the optimal discriminant function, the data points
are projected along the candidates. Now on the projected data, the “within classes”
variance and “between classes” variance are calculated, and the linear separator
with the minimal covariance within the projected classes, and simultaneously with
a high covariance between the projected classes, is picked as the decision boundary.
This algorithm is visualized in Fig. 8.78. In the top right graph, the data are
projected on the two axes. As can be seen, the class distributions of the projected
data overlap and thus the classes are not clearly distinguishable. So the discriminant
functions, parallel to the axes, are not optimal separators. In the bottom left graph,
however, the data points are projected onto the dotted line. This separates the
distributions better, as can be seen in the bottom right graph. Hence, the solid
black line is chosen as the linear discriminant function.
Compared to logistic regression, LDC makes more strict assumptions on the
data, but if these are fulfilled it typically gives better results, particularly in
778 8 Classification Models

Fig. 8.78 Visualization of the LDA algorithm

accuracy and stability. For example, if the classes are well separated, then logistic
regression can have a surprisingly unstable parameter estimate and thus perform
particularly poorly. On the other hand, logistic regression is more flexible, and thus
it is still preferred in most cases. As an example, linear discriminant analysis
requires numeric/continuous input variables, since the covariance, and so the
distance between data points, has to be calculated. This is a drawback of LDC,
when compared with logistic regression, which can process these kinds of data.

Necessary conditions
Besides the numeric requirement of input variables, Fisher’s discriminant algo-
rithm, and thus LDC, requires the data in each category to be approximately
normally distributed to work properly. If it is not, the data has to first be normalized
before an LDC can work.

Predictor selection methods


Before including all input variables in the finals discriminant model, the Modeler
also provides the “Stepwise” selection method, in order to find the most relevant
predictors for the model. See Sect. 5.3.1 for a description of the Stepwise method.
8.4 Linear Discriminate Classification 779

8.4.2 Building the Model with SPSS Modeler

Description of the model


Stream name Linear discriminate analysis
Based on dataset Iris.csv (see Sect. 10.1.18)
Stream structure

Important additional remarks:


The target variable should be categorical. If it is continuous, a (multiple) linear regression might
be more appropriate for this problem (see Chap. 5). Furthermore, the distribution of each
category has to be approximately Gaussian distributed.
Descriptive pre-analysis should be done before building the model, in order to get insight into the
data, the distribution of the input variables and the possible importance of classifying variables.
Cross-validation is included in the Model node, and it can be performed as described in the
Logistic Regression section of Sect. 8.3.6.
Related exercises: All exercises in Sect. 8.4.4

The Iris dataset contains data on 150 flowers from three different Iris species,
together with width and length information on their petals and sepals. We want to
build a discriminant classifier that can assign the species of an Iris flower, based
on its sepal and petal width and length. The Discriminant node can automatically
perform a cross-validation when a training set and test set are specified. We will
make use of this feature. This is not obligatory however; if the test set comes from
external data, then a cross-validation process will not be needed.

1. To build the above stream, we first open the stream “012 Template-
Stream_Iris” (see Fig. 8.79) and save it under a new name. This stream already
contains a Partition node that divides the Iris dataset into two parts, a training
and a test set in the ratio 70 % to 30 %.

Fig. 8.79 Template stream


of the Iris data
780 8 Classification Models

Fig. 8.80 Variable inspection of the Iris data

2. Next, we run the stream and observe the output of the Audit node. We see that
all the continuous variables are nearly Gaussian distributed, which is necessary
for the discriminant analysis to work properly. See Fig. 8.80.
3. To get a deeper insight into the Iris data, we add several Graphboard nodes
to the stream by connecting them to the Type node. We use these nodes to
visualize the data in scatterplots and boxplots. Further descriptive analysis
could be considered, however. Here, we just inspected the two petal variables
in one scatterplot and two boxplots. See Figs. 8.81, 8.82, and 8.83. As can be
seen in the scatterplot, the petal width and length cluster the data by species.
Furthermore, the petal widths and lengths of the diverse species vary in
different ranges. This is evidenced by the boxplots. These descriptive analyses
indicate that a species-detecting classifier can be built.
4. We add the Discriminant node to the Modeler canvas and connect it to the
Partition node. After opening the former, we specify the target, partition, and
input variables in the Field tab. Thereby, the species variable is chosen as the
target, the Partition field as the partition and all four remaining variables are
chosen as input variables. See Fig. 8.84.
5. In the Model tab, we enable the usual “Use partition data” option, in order to
ensure that the training set and the test set are treated differently in their
purpose, for the cross-validation technique. See Fig. 8.85.
Furthermore, we choose the “Stepwise” method, to find the optimal input
variables for our classification model. See the bottom arrow in Fig. 8.85.
6. In the Analyze tab, we also enable the predictor importance calculation option.
See Fig. 8.86.
7. In the Expert tab, further parameters can be specified, in particular, additional
outputs, which will be displayed in the model nugget. These include the
8.4 Linear Discriminate Classification 781

Fig. 8.81 Scatterplot of the Iris data. Species’ are visualized with different shapes and colors

Fig. 8.82 Boxplot of the petal length


782 8 Classification Models

Fig. 8.83 Boxplot of the petal width

Fig. 8.84 Specification of the input, partition, and target variables

parameters and statistics of the feature selection process and estimated discrim-
inant functions. See Fig. 8.87 for the output options. We omit here the descrip-
tion of every possible additional output and refer the reader to IBM (2015b) for
more information. We recommend trying several of these options; however, in
8.4 Linear Discriminate Classification 783

Fig. 8.85 Model options for


the discriminant analysis

Fig. 8.86 Predictor


importance in the
discriminant node

order to see the surplus profit of information, it gives about the model and its
building process.
8. Now we run the stream, and the model nugget appears on the canvas.
9. We add an Analysis node to the model nugget, to evaluate the goodness of
fit of our discriminant classifier. Since the Iris data has a triple-valued
target variable, we cannot calculate the AUC and Gini index (see
Sect. 8.2.5), but we intend to evaluate the model based on all other measures
(Fig. 8.88).
10. After pressing the “Run” button in the Analysis node, the output statistics
appear and are shown in Fig. 8.89. A look at these model evaluation statistics
shows that the model has very high accuracy. That is, an accuracy of 97 %
correctly classified Iris flowers in the training set and 100 % correctly classified
in the test data. In total numbers, only three of the 150 flowers are
miscategorized.
784 8 Classification Models

Fig. 8.87 Advanced output


options in the
discriminant node

Fig. 8.88 The Analysis node


and the selected output
variables
8.4 Linear Discriminate Classification 785

Fig. 8.89 The output statistics in the Analysis node for the Iris dataset

8.4.3 The Model Nugget and the Estimated Model Parameters

The main information on the built model is provided in the Model tab and the
Advanced tab, the contents of which are roughly described hereafter.

Model tab
When opening the model nugget, first the Model tab is shown with the predictor
importance, provided its calculation was enabled in the Discriminant node. As
presumed by descriptive analysis of the Iris data (see Figs. 8.81, 8.82, and 8.83),
the variables “Petal Length” and “Petal Width” are the most important for
differentiating between the Iris species. See Fig. 8.90.

Advanced tab
The Advanced tab comprises statistics and parameters from the estimated model
and the final parameters. Most of the outputs in the Advanced tab are very technical
786 8 Classification Models

Fig. 8.90 Predictor importance of the linear discriminant analysis for the Iris data

and are only interpretable with extensive background and mathematical knowledge.
We therefore keep the explanations of the parameters and statistics rather short and
refer the reader to IBM (2015b) and Tuffery (2011) for further information if
desired.
The outputs chosen in the output options of the Expert tab in the Discriminant
node are displayed in this tab of the model nugget (see Sect. 8.4.2). Here, we will
briefly describe just the standard reports and statistics of the Advanced tab. There
are several further interesting and valuable statistical tables that can be displayed,
however, and we recommend trying some of these additional outputs.
The first two tables in this tab give a summary of the groups, i.e., the samples
with the same category. These include, e.g., the number of valid sample records in
each group and the number of missing values.
The next two tables show the number of estimated discriminants: the quality of
these functions and their proportion to the classification is shown. Here, two linear
functions are needed to separate the three Iris species properly. See Fig. 8.91. The
parameter calculation of the linear functions can be traced back to an eigenvalue
problem, which then has to be solved. See once again Runkler (2012) and Tuffery
(2011). The first table shows the parameters of the eigenvalue estimation. See
Fig. 8.91. The eigenvalue thereby gives a quantity to the discriminating ability.
8.4 Linear Discriminate Classification 787

Fig. 8.91 Quality measures and parameters of the estimated discriminants, including the
eigenvalues and significance test

Fig. 8.92 Matrix of the estimated coefficients of the discriminant function

In the column “% of Variance”, the percentage of the discriminating ability is


stated, which is basically the proportion of the eigenvalue compared with the sum of
eigenvalues. Here, the first discriminant function explains 99.3 % of the discrimi-
nation and is therefore by far the most important of the two functions.
The last column in this table holds the canonical correlations of the predictor
variables and the grouping variable.
The second table shows statistics from the hypothesis test, which tests for each
function if it’s canonical correlation, and all canonical correlations of the succes-
sive functions are equal to zero. This is done via Wilks’ lambda and a Chi-square
test, see Tuffery (2011). The significance levels are displayed in the last column.
Here, both linear discriminants are significant and, thus, cannot be omitted in the
classification model.
The table “Standardized Canonical Discriminant Function Coefficients”
contains the estimated coefficients of the linear equation, with which the discrimi-
nant scores can be calculated via the values of the predictor variables. See Fig. 8.92.
In the case of our Iris data, we have the following two equations:
788 8 Classification Models

Score1 ¼ 0:432  Sepal Length  0:514  Sepal Width þ 0:943  Petal Length
þ 0:563  Petal Width

Score2 ¼ 0:265  Sepal Length þ 0:629  Sepal Width  0:445  Petal Length
þ 0:583  Petal Width:

Thereby, the coefficients are stated so that the scores are standardized, meaning that
their distributions have zero mean and a standard deviation equal to one. The
coefficients themselves, thereby, specify the magnitude of the effect of the discrim-
inating variable. For example, the “Petal Length” has the highest effect of the
predictors on the first score.

8.4.4 Exercises

Exercise 1 Decide if breast tissue is cancerous via a linear discriminant model


We recall that the “WisconsinBreastCancerData.csv”, comprises breast tissue
samples, see Sect. 10.1.35 and Wolberg and Mangasarian (1990). Your task is to
build a linear discriminant model, which decides if the tumor is benign or malig-
nant. The variable “class” in the dataset is the target variable and indicates either a
benign (2) or malignant (4) sample.

1. Import the data with an appropriate Source node and divide the dataset into
training data and test data.
2. Add a Type node and specify the scale type as well as the role of each variable.
3. Build a linear discriminant classification model with the Discriminant node.
Include the afore done partitioning and the calculation of the predictor impor-
tance in your model.
4. Survey the model nugget. What is the most important predictor variable?
Determine the equation for calculating the discriminant score. Does the model
have a discriminant ability?
5. Add an Analysis node to the nugget and run the stream. Is the model able to
classify the samples? What is the hitting rate and AUC of the test set?

Exercise 2 Comparison of LDA and Logistic Regression and building


an Ensemble Model
In this exercise, for each of the two linear classifier methods, LDA and Logistic
Regression, a model is trained on the “diabetes_data_reduced_sav” dataset (see
Sect. 10.1.8), to predict if a female patient suffers from diabetes
(class_variable ¼ 1). These two models are then compared with each other, as
different classification models vary in their interpretation of the data and, for
example, favor different target classes.

1. Import the dataset “diabetes_data_reduced_sav” with an appropriate Source


node and divide the data into a training set and test set.
2. Build an LDA model that separates the diabetes patients from the non-diabetes
patients. What are the most important predictor variables?
8.4 Linear Discriminate Classification 789

3. Train a Logistic Regression model on the same training data as the Linear
Discriminant model. To do this, use the forwards stepwise variable selection
method. What are the variables included in the model?
4. Compare the performance and goodness of fit of the two models with the
Analysis and Evaluation node.
5. Use the Ensemble node to combine the two models into one model. What are the
performance measures of this new ensemble model?

Exercise 3 Mining in high-dimensional data and dimensional reduction


The dataset “gene_expression_leukemia” contains genomic sample data from sev-
eral patients suffering from one of four different leukemia types (ALL, AML, CLL,
CML) and a healthy control group (see Sect. 10.1.14). The gene expression data are
measured at 851 locations on the human genome and correspond to known cancer
genes. Genomic data are multidimensional with a huge number of features, i.e.,
measurement points on the human genome, and often come from only a few
samples. This can create problems for many classification algorithms, since they
are typically designed for situations with a small number of input variables and
plenty of observations. In this exercise, we use PCA for dimension reduction to
overcome this obstruction.

1. Import the dataset “gene_expression_leukemia.csv” and extract the subset that


contains only healthy and ALL patients. How many patients are left in this
subset and what are the sizes of each patient group? What can be a problem when
building a classifier on this subset to separate the two patient groups?
2. Built an LDA model that separates the “ALL” from the “healthy” patients.
Calculate the accuracy and Gini for the training data and the test data. Explain
the results.
3. Perform a PCA on the gene expression data of the training set. Plot the first two,
with the PCA determined factors against each other. What can you say about the
separability of the two patient groups based on their location (“ALL”, “healthy”).
4. Build a second LDA model on the first 5 factors. What are the accuracy and Gini
for this model? Explain the difference from the first model. What are the
advantages of a dimensional reduction in this case?

8.4.5 Solutions

Exercise 1 Decide if breast tissue is cancerous via a linear discriminant model


Name of the solution streams WisconsinBreastCancer_LDA
Theory discussed in section Sect. 8.2
Sect. 8.4.1

Figure 8.93 shows the final stream for this exercise

1. First, import the dataset with the File Var. File node and connect it to a Partition
node, to divide the dataset into two parts in a 70:30 ratio. See Sect. 2.7.7 for
partitioning datasets.
790 8 Classification Models

Fig. 8.93 Stream of the LDA classifier for the Wisconsin Breast Cancer dataset

Fig. 8.94 Definition of the scale level and role of the variables

2. Now, add a Type node to the stream and open it. The variable “Sample code”
just labels the samples and has thus no further meaning for the model building
process. Hence, its role is declared as “None”. Now, assign the role “Target” to
the “class” variable. Furthermore, it contains flag values. So, set the measure-
ment to “Flag”. See Fig. 8.94. Afterwards, click the “Read Values” button to
make sure that the stream knows all the scale levels of the variables. Otherwise,
the further discriminant analysis might fail and produce an error.
3. Now, build a linear discriminant classification model by adding a Discriminant
node to the stream. Open the node, enable the predictor importance calculation
in the Analyze tab and choose the “Use type node settings” option in the Fields
tab. The latter now uses the roles of the variables as declared in the Type node.
8.4 Linear Discriminate Classification 791

Fig. 8.95 Predictor importance of the LDA for the Wisconsin breast cancer data

Fig. 8.96 Significance test on the built model

Fig. 8.97 Coefficients of the


LDA for the Wisconsin
Breast Cancer data

Furthermore, make sure the “Use partitioned data” option is marked in the
Model tab. Now run the stream to build the model.
4. Open the model nugget. In the Model tab, you can see the predictor importance
plot. The “Bare nuclei” variable, with 0.42, has the most importance for
classifying the cancer probes. See Fig. 8.95.
In the Advanced tab, in the Wilks’ lambda tail, we can see that the discrimi-
nating ability is very high, since the significance level is almost zero. See
Fig. 8.96. In the Standardized Canonical Discriminant Function Coefficients,
you can find a list of the coefficients of the linear equation, from which the
discriminant scores are calculated. See Fig. 8.97 for these coefficients.
792 8 Classification Models

Fig. 8.98 Output of the Analysis node for the LDA

5. Now add an Analysis node to the nugget and select at least the options “Coinci-
dence matrices” and “Evaluation metric” in the node. Then run the stream again.
A window similar to Fig. 8.98 opens with the evaluation statistics. We see that
the LDA performs very well, as the accuracy is greater than 96 % in both the
training data and testing data. Furthermore, the AUC is extremely high, with
0.996 for the training data and 0.992 for the test data.
8.4 Linear Discriminate Classification 793

Exercise 2 Comparison of LDA and Logistic Regression and building


an Ensemble Model
Name of the solution streams Diabetes_logit_lda
Theory discussed in section Sect. 8.2
Sect. 8.4.1
Sect. 8.3.1
Sect. 5.3.6 (Ensembles)

Figure 8.99 shows the final stream for this exercise

1. We start by opening the template stream “015 Template-Stream_Diabetes” and


saving it under a different name. See Fig. 8.100 for the stream.
If the data type of the target variable is not defined as “Flag” in the Type node,
we change the type to “Flag”, so we are able to calculate the Gini and AUC

Fig. 8.99 Stream of the LDA and Logistic Regression classifiers for the diabetes dataset

Fig. 8.100 Template stream for the diabetes data


794 8 Classification Models

Fig. 8.101 Type node with the measurement type of the target variable (class_variable) is set
to “Flag”

evaluation measures of the later trained models. See Fig. 8.101. Furthermore, we
make sure that the role of the “class_variable” is “Target” and all other roles are
“Input”. Now we add the usual Partition node and split the data into training
(70 %) and test (30 %) sets. See Sect. 2.7.7 for a detailed description of the
Partition node.
2. We add an LDA node to the stream, connect it to the Partition node and open
it. As the roles of the variables are already set in the Type node (see Fig. 8.101),
the node automatically identifies the roles of the variables and thus nothing has
to be done in the Fields tab. In the Model tab however, we select the “Stepwise”
variables selection method (see Fig. 8.102) and in the Analyze tab, we enable the
variable importance calculations (see Fig. 8.103).
Now we run the stream and open the model nugget that now appears.
We observe in the Model tab that the variable “glucose_concentration” is
by far the most important input variable, followed by “age”, “BMI”, and
“times_pregnant”. See Fig. 8.104.
3. We now add a Logistic node to the stream and connect it to the Partition node, to
train a logistic regression model. Afterwards, we open the node and select the
8.4 Linear Discriminate Classification 795

Fig. 8.102 Model tab of the LDA and definition of the stepwise variable selection method

Fig. 8.103 Analyze tab in the LDA node and the enabling of the predictor importance calculation

forwards stepwise variable selection method in the Model tab. See Fig. 8.105.
There, we choose the Binomial procedure, since the target is binary. The
multinomial procedure with the stepwise option is also possible and results in
the same model.
Before running the node, we further enable the importance calculation in the
Analyze tab. We then open the model nugget and note that the most important
796 8 Classification Models

Fig. 8.104 Predictor importance in the LDA model

variables included in the LDA are also the most important ones in the logistic
regression model, i.e., “glucose_concentration”, “age”, and “BMI”. These three
are the only variables included in the Model when using the variable selection
method (Fig. 8.106).
4. Before comparing both models, we rearrange the stream by connecting the two
model nuggets in series. See Fig. 8.107. This can easily be done by dragging a
part of the connecting arrow of the Partition node to the logistic regression model
nugget on the LDA model nugget.
Now, we add an Analysis and an Evaluation node to the stream and connect
them to the logistic regression nugget. Then, we run these two nodes to calculate
the accuracy and Gini values. The setting of these two nodes is explained in
Sect. 8.2.5 and Fig. 8.28. In Fig. 8.108 the model statistics of the LDA and
logistic regression can be viewed. We see that the AUC/Gini values are slightly
better for the linear discriminant model; this is also visualized in Fig. 8.109,
where the ROC of the LDA model is located a bit above the Logistic Regression
ROC. Accuracy in the test set is higher in the logistic regression model,
however. This is a good example of how a higher Gini doesn’t have to go
8.4 Linear Discriminate Classification 797

Fig. 8.105 Model tab in the Logistic node. Definition of the variable selection method

along with higher accuracy and vice versa. The decision of a final model is thus
always associated with the choice of the performance measure. Here, the LDA
would be preferred when looking at the Gini, but when taking accuracy as an
indicator, the logistic regression would be slightly preferred.
To analyze where these differences arise from, we inspect the coincidence
matrices. We note that the two models differ in their favoritism of the target
class. The logistic regression model has a higher tendency to the non-diabetes
class (class_variable ¼ 0) than the linear discriminant model. The first one thus
predicts non-diabetes diagnosis more often than the LDA, for both the training
set and the test set. Moreover, the LDA predicts diabetes more often than it
occurs in the data; that is, in the training set the LDA predicts 98 patients will
have diabetes although there are only 90 diabetes patients in the set. This is
similar for the test set. Thus, to compensate for this over-prediction of diabetes
patients and create a more robust prognosis, one possibility is to combine the two
models into an ensemble model.
798 8 Classification Models

Fig. 8.106 Predictor importance in the Logistic Regression model

Fig. 8.107 Rearranging the stream by connecting the model nuggets in series

5. To combine the two models into an ensemble model, we add the Ensemble node
to the stream and connect it to the logistic regression model nugget. We open the
node and choose “class_variable” as the target field for the ensemble model. See
Fig. 8.110. Additionally, we set the aggregation method of the ensemble to
“Average raw propensity”. When running this node now, the predictions,
8.4 Linear Discriminate Classification 799

Fig. 8.108 Performance measures in the LDA and Logistic Regression model for the
diabetes data

namely the probabilities for the target classes, of the two models LDA and
Logistic Regression are averaged, and the target class with the higher probability
wins and is therefore predicted by the ensemble.
For the averaging to work properly, we have to ensure that the propensities of
the models are calculated. This can be enabled in the model nuggets in the
800 8 Classification Models

Fig. 8.109 ROC of the LDA Logistic Regression model

Fig. 8.110 Settings of the Ensemble node


8.4 Linear Discriminate Classification 801

Settings tab. See Fig. 8.111 for the LDA model nugget. This can be done analog
for the Logistic Regression nugget.
We add an Analysis node to the stream and connect it to the Ensemble node.
In Fig. 8.112, the accuracy and Gini are displayed in the ensemble model. We

Fig. 8.111 Enabling of the propensity calculations

Fig. 8.112 Evaluation measures in the ensemble model, consisting of the LDA and Logistic
Regression models
802 8 Classification Models

see that the Gini of the test set has increased by 0.004 points compared to the
LDA (see Fig. 8.108). Furthermore, the accuracy in the training set and test set
has slightly improved. In conclusion, the prediction power of the LDA improves
slightly when it is combined with a Logistic Regression model within an
ensemble model.

Exercise 3 Mining in high-dimensional data and dimensional reduction


Name of the solution streams leukemia_gene_expression_lda_pca
Theory discussed in section Sect. 8.2
Sect. 8.4.1

Figure 8.113 shows the final stream for this exercise.

1. We start by opening the template stream “018 Template-Stream gene_expres-


sion_leukemia” and saving it under a different name. See Fig. 8.114. The
template already comprises a Partition node with a data split, 70 % training
and 30 % test. Furthermore, the roles in the type node are already defined, that is,
the “Leukemia” variable is set as the target and all genomic positions are set as
inputs.
We then add a Select node and place it in the stream between the Source node
and Partition node. We open the Select node and enter the formula that selects
only the “ALL” or “healthy” patients in the dataset. See Fig. 8.115 for the
formula inserted in the Select node.
Afterwards, we add a Distribution node to the Select node to draw the
frequency of the two patients groups. Figure 8.116 shows the distribution of

Fig. 8.113 Solution stream for the exercise of mining gene expression data and dimensional
reduction
8.4 Linear Discriminate Classification 803

Fig. 8.114 Template stream of the gene expression leukemia all data

Fig. 8.115 Selection of the subset that contains only ALL or healthy patients

Fig. 8.116 Distribution of the Leukemia variable within the selected subset

the Leukemia variable in the subset. There are in total 207 patients in the subset,
whereof 73 are healthy and the others suffer from ALL.
When building a classifier on this subset, the large number of input variables,
851, is a problem when compared with the observation records, 207. In this case,
many classifiers tend to overfit.
804 8 Classification Models

We add a Discriminant node to the stream and connect it to the Type node. As
the roles of the variables are already defined in the Type node, the Modeler
automatically detects the target, input, and Partition fields. In the Model tab of
the Discriminant node, we further choose the stepwise variable selection method
before running the node. After the model nugget appears, we add an Analysis
node to the stream and connect it to the nugget. For the properties of the Analysis
node, we recall Sect. 8.2.5. The evaluation statistics, including accuracy and
Gini, are shown in Fig. 8.117. We see that the stats are very good for the training
set, but are clearly worse for the test set. The Gini indicates this especially since
here, it is only half as large in the test set (0.536) as in the training set (1.0). This
signals an overfitted model.
So, the LDA is unable to handle the gene expression data with disproportional
input variables compared to the patients. The reason for this can be that the huge
number of features overlay the basic structure in the data.
2. We add a PCA node to the stream and connect it to the Type node in order to
consolidate variables and identify common structures in the data. We open the
PCA node and select all genomic position variables as inputs in the Fields tab.
See Fig. 8.118.
In the Model node, we mark the “Use partitioned data” option in order to only
use the training data for the factor calculations. The method we intend to use is
the “Principal Component” method, which we also select in this tab. See
Fig. 8.119.

Fig. 8.117 Analysis output for the LDA on the raw gene expression data
8.4 Linear Discriminate Classification 805

Fig. 8.118 Input variable definition in the PCA node

Fig. 8.119 Definition of the extraction method and usage of only the training data to calculate the
factors
806 8 Classification Models

Fig. 8.120 Setup of the scatterplot to visualize factors 1 and 2, determined by the PCA

Now, we run the PCA node and add a Plot node to the appearing nugget. In the
Plot node, we select factor 1 and factor 2 as the X and Y field. Furthermore, we
define the “Leukemia” variable as a coloring and shape indicator, so the two
groups can be distinguished in the plot. See Fig. 8.120.
In Fig. 8.121, the scatterplot of the first two factors is shown. As can be seen,
the “ALL” patients and the “healthy” patients are concentrated in clusters and
the two groups can be separated by a linear boundary.
3. We add another Type node to the stream and connect it to the PCA model
nugget, so the following Discriminant node is able to identify the new variables
with its measurements. As just mentioned, we add a Discriminant node and
connect it to the new Type node. In the Fields tab of the Model node, we select
“Leukemia” as the target, “Partition” as the partition variable and all 5 factors,
which were determined by the PCA, as input variables. See Fig. 8.122. In the
Model tab, we further choose the stepwise variable selection method before
running the stream.
4. We connect the model nugget of the second LDA to a new Analysis node and
choose the common evaluation statistics as the output. See Sect. 8.2.5. The
output of this Analysis node is displayed in Fig. 8.123. We immediately see
the accuracy, as well as the Gini, has improved for the test set, while the stats in
the training set have not decreased. Both values now indicate extremely good
separation and classification power, and the problem of overfitting has faded.
8.4 Linear Discriminate Classification 807

Fig. 8.121 Scatterplot of the two factors determined by the PCA

Fig. 8.122 Selection of the factors as inputs for the second LDA
808 8 Classification Models

Fig. 8.123 Analysis output for the LDA with factors as input variables

An explanation of why the LDA on the PCA calculated factors performs better
than on the raw data is already given in previous steps of this solution. The huge
number of input variables has supposedly hidden the basic structure that
separates the two patient groups, and the model trained on the raw data has
therefore overfitted. The PCA has now uncovered this basic structure and the
two groups are now linearly separable, as visualized in Fig. 8.121. Another
advantage of this dimensional reduction is an improvement in the time it takes to
build the model and predict the classes. Furthermore, less memory is needed to
save the data in this reduced form.

8.5 Support Vector Machine

In the previous two sections, we introduced the linear classifiers, namely, logistic
regression and linear discriminant analysis. From now on, we leave linear cases and
turn to the nonlinear classification algorithms. The first one in this list is the
Support Vector Machine (SVM), which is one of the most powerful and flexible
classification algorithms and is discussed below. After a brief description of the
theory, we attend to the usage of SVM within the SPSS Modeler.
8.5 Support Vector Machine 809

8.5.1 Theory

The SVM is one of the most effective and flexible classification methods, and it can
be seen as a connection between linear and nonlinear classification. Although the
SVM separates the classes via a linear function, it is often categorized as a nonlinear
classifier due to the following fact: the SVM comprises a preprocessing step, in
which the data are transformed so that previously nonlinear separable data can now
be divided via a linear function. This transformation technique makes the SVM
applicable to a variety of problems, by constructing highly complex decision
boundaries. We refer to James et al. (2013) and Lantz (2013).

The support vectors


The SVM constructs a linear function (a hyper-plane in higher dimensions) to
separate the different target classes. It thereby chooses the decision boundary
with the following approach: Consider a set of data containing two classes, circles
and rectangles, which are perfectly separable by a linear function. See the left graph
in Fig. 8.124, where two possible decision boundaries are displayed. The SVM now
chooses the one with the largest margin. The margin is the distance between the
decision boundary and the nearest data points. Hence, the SVM chooses the linear
function as the decision boundary, with the furthest distance for the classes. The
closest data points characterize the decision boundary uniquely and they are called
support vectors. In the right graph in Fig. 8.124, the classification boundary is
shown as a solid line, and the two support vectors, marked with arrows, uniquely
define the largest possible margin, which is indicated by the dashed lines.

Mapping of data and the kernel function


In the case of nonlinearly separable classes, as in the left graph in Fig. 8.125 for
example, the SVM uses a kernel trick by mapping the data into a higher dimen-
sional space, in which the data are then linearly classifiable. This shifting of the
data into a higher dimension reduces the complexity and thus simplifies the

Fig. 8.124 Illustration of the decision boundary discovery and the support vectors
810 8 Classification Models

Fig. 8.125 Transformation of the data via a kernel function

classification problem. This is an unusual approach compared with the other


classification methods, as it transforms the data in such a way that it is useful for
the method, instead of trying to construct a complex separator into the training data,
as inserted in the modeling process. The process of data transformation is
demonstrated in Fig. 8.125.
The mapping function is defined by the choice of kernel function, and there are
several standard kernel function types that are commonly used. The ones supported
by the SPSS Modeler are:

• Linear
• Polynomial
• Sigmoid
• Radial basis function (RBF)

See Lantz (2013) for a more detailed descriptions of the kernels. Among these,
the RBF kernel is the most popular for SVM, as it performs well on most data types.
So this kernel is always a good starting point.
The right choice of the kernel guarantees robustness and a high accuracy in
classification. The kernel function and its parameters must be chosen carefully
however, as an unsuitable kernel can cause the model to overfit the training data.
For this reason, a cross-validation with the training set and test set is always
strongly recommended. Furthermore, an additional validation set can be used to
find the optimal kernel parameters (Sect. 8.2.1). See Sch€olkopf and Smola (2002)
and Lantz (2013) for additional information on the kernel trick.

8.5.2 Building the Model with SPSS Modeler

In this section, we introduce the SVM node of the Modeler and explain how it is
used in a classification stream. The model we are building here refers to the sleep
detection problem described as a motivating example in Sect. 8.1.
8.5 Support Vector Machine 811

The dataset EEG_Sleep_signal.csv contains EEG signal data from a single


person in a “drowsiness and awake” state (see Sect. 10.1.10). The electrical
impulses of the brain are measured every 10 ms, and the data is split into segments
of 30 s. The task is then to classify a 30 s EEG signal as either a drowsiness or an
awake state. The problem now is that the states cannot be classified based on the
raw signals, for EEG signals have a natural volatility and can fluctuate between
different levels (Niedermeyer et al. (2011)). So, the structure of a signal is more
important than its actual measured value. See Fig. 8.126 for an excerpt of the EEG
signals. Every row is a 30 s signal segment.
In summary, before building a classifier on the EEG signals, we generate a new
data matrix that contains features calculated from the EEG data. See Fig. 8.127 for
the complete stream, named “EEG_Sleepdetection_svm_COMPLETE”. The first

Fig. 8.126 Excerpt from the EEG signals

Fig. 8.127 EEG_Sleepdetection_svm_COMPLETE stream, which builds a classifier to detect


sleepiness in EEG signals
812 8 Classification Models

part is dedicated to the feature calculation, and the model is built in the second part
of the stream.
The feature calculation is performed in R (R Core Team (2014)) via an R node.
Therefore, R has to be installed on the computer and included within the SPSS
Modeler. This process and the usage of the R node are explained in detail in
Chap. 9. We split the stream up into two separate streams, a feature calculation
stream and a model building stream. If one is not interested in the feature calcula-
tion, the first part can be skipped, as the model building process is described on the
already generated feature matrix. For the interested reader, we refer to Niedermeyer
et al. (2011) for detailed information on EEG signals, their properties, and analysis.
In this situation, the raw data has to be pre-processed and transformed into more
appropriate features on which the model is able to separate the data. This transfor-
mation of data into a more suitable form and generating new variables out of given
ones is common in data analytics. Finding new variables that will improve model
performance is one of the major tasks of data science. We will experience this in
Exercise 3 in Sect. 8.5.4, where new variables are obtained from the given ones,
which will increase the prediction power of a classifier.

Feature generation

Description of the model


Stream name EEG_Sleep_calculate_features
Based on dataset EEG_Sleep_Signals.csv (see Sect. 10.1.10)
Stream structure

Important additional remarks:


The calculation of features is done in the R node, since this is the quickest and easiest way. How
to include R in the SPSS Modeler and the usage of R nodes is explained in more detail in Chap. 9.
Related exercises: 3 and all exercises in Chap. 9

1. We start the feature calculations by importing the EEG signals with a Var. File
node. The imported data then look like Fig. 8.126.
2. Now, we add an R Transform node, in which the feature calculations are then
performed. Table 8.6 lists the features we extract from the signals.
The first three features are called Hjorth parameters. They are true classic
statistical forms within signal processing and are often used for analytical
purposes, see Niedermeyer et al. (2011) and Oh et al. (2014).
8.5 Support Vector Machine 813

Table 8.6 Features calculated from the EEG signals


Feature Description
Activity Variation in the signal
Mobility Represents mean frequency
Complexity Describes the change in frequency
Range Difference in maximum and minimum values in the signal
Crossings Number of x-axis crossings in the standardized signal

Fig. 8.128 The R Transform node in which the feature calculations are declared

We open the R Transform node and include in the R Transform syntax field
the R syntax that calculates the features. See Fig. 8.128 for the node and
Fig. 8.129 for the syntax inserted into the R node. The syntax is thereby
displayed in the R programming environment RStudio, RStudio Team (2015).
The syntax is provided under the name “feature_calculation_syntax.R”.
814 8 Classification Models

Fig. 8.129 R syntax that calculates the features and converts them back to SPSS Modeler format,
displayed in RStudio

Now we will explain the syntax in detail. The data inserted into the R
Transform node for manipulation is always named “modelerData”, so that the
SPSS Modeler can identify the input and output data of the R nodes, when the
calculations are complete.
Row 1: In the first row, the modelerData are assigned to a new variable named
“old_variable”.
Rows 3 + 4: Here, the signal data (the first 3000 columns) and the sleep
statuses of the signal segments are assigned to the variables “signals” and
“sleepiness”.
Row 6: Here, a function is defined that calculates the mobility of a signal
segment x.
Rows 8–15: Calculation of the features. These, together with the sleepiness
states, are then consolidated in a data.frame, which is in an R-matrix format.
This data.frame is then assigned to the variable “modelerData”. Now the Mod-
eler can further process the feature matrix, as the variable “modelerData” is
normally passed onto the next node in the stream.
Row 17–24: In order that the data can be processed correctly, the SPSS
Modeler must know the fields and their measurement types, and so the fields
in the “modelerData” data.frame have to be specified in the data.frame variable
“modelerDataModel”. The storage type is defined for each field, which is “real”
for the features and “string” for the sleepiness variable.
3. We add a Data Audit node to the R Transform node, to inspect the new
calculated features. See Fig. 8.130 for the distributions and statistics of the
feature variables.
4. To save the calculated features, we add the output node Flat File to the stream
and define a filename and path, as well as a column delimiter and the further
structure of the file.
8.5 Support Vector Machine 815

Fig. 8.130 Statistics on the new features

Building the support vector machine on the new feature data

Description of the model


Stream name EEG_Sleepdetection_svm
Based on dataset Features_eeg_signals.csv (see Sect. 10.1.13)
Stream structure

Important additional remarks:


This is a sub-stream of the EEG_Sleepdetection_svm_COMPLETE stream.
Cross-validation is included in the model node, and it can be performed as described in the
Logistic Regression section of Sect. 8.3.6.
Related exercises: All exercises in Sect. 8.5.4

By preprocessing the EEG signals, we are now able to build an SVM classifier
that separates sleepiness states from awake states.
816 8 Classification Models

1. We start by importing the data in the features_eeg_signal.csv file with a Var. File
node. The statistics obtained with the Data Audit node can be viewed in
Fig. 8.130. Afterwards, we add a Partition node to the stream and allot 70 %
of the data as a training set and 30 % as a test set. See Sect. 2.7.7 for how to use
the Partition node.
2. Now, we add the usual Type node to the stream and open it. We have to make
sure that the measurement type of the “sleepiness variable”, i.e., the target
variable, is “nominal” or “flag” (see Fig. 8.131).
3. We add an SVM node to the stream and connect it to the Type node. After
opening it, we declare the variable “sleepiness” as our target in the Fields tab,
“Partition” as the partitioning variable and the remaining variable, i.e., the
calculated features listed in Table 8.6, are declared as input variables (see
Fig. 8.132).
4. In the Model tab, we enable the “Use partitioned data” option so that cross-
validation is performed. In other words, the node will only use the training data
to build and the test set to validate the model (see Fig. 8.133).
5. In the Expert tab, the kernel function and parameters for use in the training
process can be defined. By default, the simple mode is selected (see Fig. 8.134).
This mode utilizes the RBF kernel function, with its standard parameters for
building the model. We recommend using this option if the reader is unfamiliar
with the precise kernels, their tuning parameters, and their properties. We
proceed with the simple node in this description of the SVM node.

Fig. 8.131 Setting of the target measurement as “Flag”


8.5 Support Vector Machine 817

Fig. 8.132 Definition of the input and target variables

Fig. 8.133 Enabling of the cross-validation procedure

If one has knowledge and experience with the different kernels, then the
kernel function and the related parameters can be specified in the expert mode
(see Fig. 8.135). We omit a precise description of the kernel parameters and refer
interested readers to Lantz (2013) and Sch€olkopf and Smola (2002).
6. Now, we run the stream and the model nugget appears.
7. We connect the nugget to an Analysis node and an Evaluation node. See
Sect. 8.2.5 and Fig. 8.28 for the settings of these nodes. Figures 8.136 and
8.137 display the evaluation statistics and the ROC of the training set and test
818 8 Classification Models

Fig. 8.134 Standard kernel setting

Fig. 8.135 Expert tab. Definition of kernel type and parameters


8.5 Support Vector Machine 819

Fig. 8.136 Evaluation statistics for the sleep detection classifier

Fig. 8.137 ROC of the sleep detection SVM classifier


820 8 Classification Models

set for the SVM. We note that accuracy in both the training and test sets is very
high (over 90 %), and the Gini is also of high value in both cases. This is
visualized by the ROC curves. In conclusion, the model is able to separate
sleepiness from an awake state.

8.5.3 The Model Nugget

Statistics and goodness of fit measures within the SVM model nugget are very few
when compared to other model nuggets. The only statistic the SPSS Modeler
provides in the SVM node is the predictor importance view (see Fig. 8.138). For
the sleep detection model, the “crossing0” feature is the most important variable,
followed by “Range”, “Complexity”, and “Activity”. The importance of x-axis
crossings suggests that the fluctuation around the mean of the signal is an indicator
of being asleep or being awake.

Fig. 8.138 Predictor importance view in the SVM model nugget


8.5 Support Vector Machine 821

8.5.4 Exercises

Exercise 1 Detection of leukemia in gene expression data


The dataset “gene_expression_leukemia_short.csv” contains gene expression
measurements from 39 human genome positions of various leukemia patients (see
Sect. 10.1.15). This genomic data is the basis on which doctors obtain their
diagnosis of whether a patient has leukemia. Your task is to build an SVM classifier
that decides for each patient whether or not they have blood cancer.

1. Import the data and familiarize yourself with the gene expression data. How
many different types of leukemia are there, and how often do they occur in the
dataset?
2. Unite all leukemia types into a new variable value that just indicates the patient
has cancer. How many patients have cancer and how many are healthy?
3. Build an SVM classifier that can differentiate between a leukemia patient and a
non-leukemia patient. What is the accuracy and Gini value for the training set
and the test set? Draw an ROC to visualize the Gini.

Exercise 2 Classification of leukemia types—The choice of kernel problem


Again, consider the dataset “gene_expression_leukemia_short.csv” from Exercise
1 that contains gene expression measurements for 39 human genome positions of
various leukemia patients (see Sect. 10.1.15).

1. Import the data and set up a cross-validation stream.


2. Build a classification model with the SVM node. To do this, use the sigmoid
kernel to predict the type of leukemia, based on the gene expression data.
3. What is the accuracy of the model after the previous step? Is the model suitable
for distinguishing between the different leukemia types?
4. Build a second SVM model with RBF as the kernel function. Compare this
model with the sigmoid kernel model. Which one has more prediction power?

Exercise 3 Titanic survival prediction and feature engineering


Deriving new information from given variables is a major part of data science. The
titanic.xlsx file contains data on Titanic passengers, including an indicator variable
“survived”, which indicates if a particular passenger survived the Titanic sinking
(see Sect. 10.1.32). Your task in this exercise is to generate new variables from the
Titanic data that will improve the prediction of passenger survival with an SVM
classifier. Pay attention to the missing data within the Titanic dataset. See also
Sect. 8.3.7, Exercise 2, where a logistic regression classifier has to be built for this
problem, and missing values handling must be performed with the Auto Data Prep
node.
822 8 Classification Models

1. Import the Titanic data and inspect each variable. What additional information
can be derived from the data?
2. Create three new variables that describe the deck of the passenger’s cabin,
his/her (academic) title, and the family size. The deck can be extracted from
the cabin variable, as it is the first symbol in the cabin code. The passenger’s title
can be derived from the name variable, it is located between the symbols “,” and
“.” in the name variable entries. The family size is just the sum of the variables
“sibsp” and “parch”. Use the Derive node to generate these new variables. What
are the values and frequencies of these variables? Adjoin the values that only
occur once with other similar values.
3. Use the Auto Data Prep node to normalize the continuous variables and replace
the missing data.
4. Build four classifiers with the SVM node to retrace the prediction accuracy and
performance when adding new variables. So, the first model is based on the
original variables, the second model includes the deck variable, and the third
also comprises the title of the passengers. Finally, the fourth model also takes the
family size of the passenger as an input variable.
5. Determine the most important input variables for each of the models. Are the
new variables (deck, title, family size) relevant for the prediction? Compare the
accuracy and Gini of the models. Have the new variables improved the predic-
tion power? Visualize the change in model fitness by plotting the ROC of all four
models with the Evaluation node.

8.5.5 Solutions

Exercise 1 Detection of leukemia in gene expression data


Name of the solution streams Gene_expression_leukemia_short_svm
Theory discussed in section Sect. 8.2
Sect. 8.5.1

The stream displayed in Fig. 8.139 is the complete solution of this exercise.

1. We open the template stream “019 Template-Stream gene_expression_leuke-


mia_short”, as shown in Fig. 8.140. This template already contains a Partition
node that splits the data into a training set and a test set in the usual ratio.
Now we run the stream, to view the statistics in the Data Audit node. The last
variable is “Leukemia”, which describes the cancer type of the patients. By
double-clicking on the graph symbol, the distribution of this variable is shown in
a new window. See Fig. 8.141 for the Data Audit node and Fig. 8.142 for the
distribution plot of the “Leukemia” variable. We learn from the graph that there
are patients with 4 types of leukemia in the dataset, and about 5.73 % of the
whole dataset contains healthy people with no leukemia. The leukemia types are
ALL (10.53 % of the patients), AML (42.58 %), CLL (35.19 %), and CML
(5.97 %).
8.5 Support Vector Machine 823

Fig. 8.139 Complete stream of the leukemia detection classifier

Fig. 8.140 Template stream of the gene expression data

Fig. 8.141 Data Audit node for the gene expression data

2. To assign all four leukemia types (AML, ALL, CLL, CML) to a joint “cancer”
class, we add a Reclassify node to the stream. In the node options, select the
variable “Leukemia” as the reclassify field and click on the “Get” button, which
will load the variable categories as original values. In the New value fields of the
four leukemia types, we put “Cancer” and “healthy” for the non-leukemia entry
(see Fig. 8.143). At the top, we then choose the option “reclassify into existing
824 8 Classification Models

Fig. 8.142 Distribution of the leukemia variable

Fig. 8.143 Reclassification node that assigns the common variable label “Cancer” to all leukemia
types

field”, to overwrite the original values of the “Leukemia” variable with the new
values.
Now, we add a Distribution node to the stream and connect it to the Reclassify
node. In the graph options, we choose “Leukemia” as the field and run the
8.5 Support Vector Machine 825

stream. Figure 8.144 shows the distribution of the new assigned values of the
“Leukemia” variable. 94.27 % of the patients in the dataset have leukemia,
whereas 5.73 % are healthy.
3. Before building the SVM, we add another Type node to the stream and define in
it the target variable, i.e., the “Leukemia” field (see Fig. 8.145). Furthermore, we
set the role of the “Patient_ID” field to None, as this field is just the patients
identifier and irrelevant for predicting leukemia.
Now, we add the SVM node to the stream. As with the previous definitions
of the variable roles, the node automatically identifies the target, input, and
partitioning variables. Here, we choose the default settings for training the SVM.

Fig. 8.144 Distribution of the new leukemia variable classes

Fig. 8.145 Definition of the target variable


826 8 Classification Models

Hence, nothing has to be changed in the SVM node. We run the stream and the
model nugget appears.
To evaluate the model performance, we add an Analysis node and Evaluation
node to the stream and connect it to the nugget. The options in these nodes are
described in Sect. 8.2.5 and Fig. 8.28. After running these two nodes, the
goodness of fit statistics and the ROC pop up in a new window. These are
displayed in Figs. 8.146 and 8.147. We see that SVM model accuracy is pretty
high in both the training set and the test set, and the Gini also indicates good
prediction ability. The latter is visualized by the ROC in Fig. 8.147.

Although these statistics look pretty good, we wish to point out one fact that could
be seen as a little drawback of this model. When just looking at the prediction
performance of the “Healthy” patients, we note that only 11 out of 18 predictions
are correct in the test set. This is just 61 % correctness in this class, compared to
96 % overall accuracy. This could be caused by the high imbalance of the target
classes, and the SVM may slightly favor the majority class, i.e., “Cancer”. One
should keep that in mind when working with the model.

Fig. 8.146 Output of the Analysis node


8.5 Support Vector Machine 827

Fig. 8.147 ROC of the training set and test set of the SVM leukemia classifier

Fig. 8.148 Complete stream of the leukemia detection classifier with sigmoid and RBF kernel

Exercise 2 Classification of leukemia types—The choice of kernel problem


Name of the solution streams gene_expression_leukemia_short_KernelFinding_svm
Theory discussed in section Sect. 8.2
Sect. 8.5.1

The stream displayed in Fig. 8.148 is the complete solution of this exercise.

1. We follow the first part of the solution to the previous exercise, and open the
template stream “019 Template-Stream gene_expression_leukemia_short”, as
828 8 Classification Models

seen in Fig. 8.140, and save it under a different name. This template already
contains a Partition node, which splits the data into a training set and a test set in
the usual ratio. We add a Distribution node to the Type node, to display the
frequencies of the leukemia types. This can be viewed in Fig. 8.142.
In the Type node, we set the roles of the variables as described in step 3 of the
previous solution. See also Fig. 8.145. Now, the SVM nodes can automatically
identify the target and input variables.
2. We add an SVM node to the stream and connect it to the Type node. As we
intend to use a sigmoid kernel, we open the model building node and go to the
Expert tab. There, we enable the expert mode and choose sigmoid as the kernel
function. See Fig. 8.149. Here, we work with the default parameter settings, i.e.,
Gamma is 1 and Bias equals 0. See IBM (2015a); Lantz for the meaning and
influence of the parameters. Now, we run the stream.
3. We add an Analysis node to the model nugget and set the usual evaluation
statistics calculations. Note, that the target variable is multinomial here and no
Gini or AUC can be calculated. See Sect. 8.2.5. We run the Analysis node
and inspect the output statistics in Fig. 8.150. We observe that the training set
and the test set are about 72–73 % accurate, which is still a good value. When
looking at the coincidence matrix however, we see that the model only predicts
the majority classes AML and CLL. The minority classes ALL, CLL, and
Non-leukemia are neglected and cannot be predicted by the SVM model with
a sigmoid kernel. Although the accuracy is quite good, the sigmoid kernel

Fig. 8.149 Choosing the sigmoid kernel type in the SVM node
8.5 Support Vector Machine 829

Fig. 8.150 Accuracy and coincidence matrix in the SVM with sigmoid kernel

mapping of the data is therefore defective for the purpose of distinguishing


between all five target classes.
4. We now train another SVM model that uses an RBF kernel. For that purpose, we
add another SVM node to the stream and connect it to the Type node. As we
want to apply the SVM to the RBF kernel and default parameters, no options
have to be changed in the SVM node. We can just use the default setting
provided by the SPSS Modeler. We run the SVM node so that the model nugget
appears.
After adding the usual Analysis node to the model nugget, we run the stream
again and compare the accuracy statistics with the ones of the sigmoid model.
See Fig. 8.151 for the coincidence matrix and the accuracy of the SVM model
with an RBF kernel. As can be seen, the overall accuracy of the RBF model has
increased (over 90 %) compared with the sigmoid model. Furthermore, the RBF
model takes all target categories into account and can predict all of these classes.
In conclusion, the sigmoid kernel type is an inappropriate data transformation
tool, as the SVM cannot identify the minority classes. The RBF on the other
hand is able to predict the majority as well as the minority classes. Hence, the
830 8 Classification Models

Fig. 8.151 Accuracy and coincidence matrix of the SVM with RBF kernel type

second model describes the data better and is therefore preferred to the sigmoid
one. This is an example of how the wrong choice of kernel function can lead to
corrupted and inadequate models.

Exercise 3 Titanic survival prediction and feature engineering


Name of the solution streams titanic_data_feature_generation_SVM
Theory discussed in section Sect. 8.2
Sect. 8.5.1
Sect. 8.5.2

The stream displayed in Fig. 8.152 is the complete solution to this exercise. The
stream contains commentaries that point to its main steps.

1. We start by opening the template stream “017 Template-Stream_Titanic” and


saving it under a different name. See Fig. 8.153. The template already comprises
a Partition node with a data split of 70 % to training and 30 % to test.
When thinking of a sinking ship, passengers in the upper decks have a better
chance of getting to the lifeboats in time. Therefore, the deck of the passenger’s
cabin can be a relevant variable. See Fig. 8.154 for an insight into the cabin
variable. When there is one, each cabin number occurs just a few times (once or
twice). Hence, the exact cabin number of a passenger is kind of unique and a
consolidation of cabins on the same deck (first letter of the cabin number) can
increase the prediction power, as it describes a more general structure.
8.5 Support Vector Machine 831

Fig. 8.152 Complete stream of the Titanic survival prediction stream with new feature
generation

Fig. 8.153 Template stream of the Titanic data

Fig. 8.154 Insight into the cabin variable


832 8 Classification Models

The sex of the passenger is already one of the variables, as a woman more
likely survives a sinking ship than a man. There are more differences in survival
indicators, e.g., masters are normally rescued before their servants. Furthermore,
the probability of survival can differ for married or unmarried passengers, as for
example a married woman may refuse to leave her husband on the ship. This
information is hidden in the name variable. There, the civil status of the person,
academic status, or aristocratic title are located after the “,” (Fig. 8.155).
Furthermore, when thinking of the chaotic situation on a sinking ship, families
are separated, some get lost and fall behind and passengers are looking for
their relatives. Thus, it is reasonable to assume that the family size can have
an influence on the survival probability. The two variables “sibsp” and “parch”
describe the number of siblings/spouses and parents/children. So, the sum of
these two variables gives the number of relatives that were traveling with the
passenger.
2. Figure 8.156 shows the stream to derive these three new variables “deck”,
“title”, and “family size”. In Fig. 8.152, the complete stream is joined together
into a SuperNode.
First, we add a Derive node to the Type node, to extract the deck from the
cabin number. In this node, we set the field type to nominal, since the values are
letters. The formula to extract the deck can be seen in Fig. 8.157. If the cabin
number is present, then the first character is taken as the deck, and otherwise, the
deck is named with the dummy value “Z”. In Fig. 8.158, the distribution of the
new deck variable is displayed.

Fig. 8.155 Insight into the name variable


8.5 Support Vector Machine 833

Fig. 8.156 SuperNode deriving the new features

Fig. 8.157 Derive node of the deck variable

Next, we extract the title from the name variable. For that purpose, we add
another Derive node to the stream, open it, and choose “nominal” as the field
type once again. The formula that dissects the title from the name can be seen in
Fig. 8.159. The title is located between the characters “,” and “.”. Therefore, the
locations of these two characters are established with the “locchar” statement,
and then the sub-string between these two positions is extracted. Figure 8.160
visualizes the occurrence of the titles in the names.
When reviewing Figs. 8.158 and 8.160, we note that there are some values of
the new variables, “deck” and “title”, that occur uniquely, in particular in the
834 8 Classification Models

Fig. 8.158 Frequencies of the deck values

Fig. 8.159 Derive node of the title variable

“title” variable. We assign these single values a similar, more often occurring
value. To do that, we first add a Type node to the stream and click on the “read
value” button, so the values of the two new variables are known in the
succeeding nodes. Now, we add a Reclassify node to the stream and open
it. We choose the deck variable as our reclassify field and check the reclassify
8.5 Support Vector Machine 835

Fig. 8.160 Frequencies of the title values

into the existing field. The latter ensures that the new values are replaced and no
new field is created. Then, we click on the “Get” button to get all the values of
the deck variable. Lastly, we put all the existing values as new values, except for
the value “T” which is assigned to the “Z” category. See Fig. 8.161.
We proceed similarly with the title variable. We add a Reclassify node to the
stream, select “Title” as the reclassify field, and make sure that the values are
overwritten by the new variables and no additional variable is created. We click
on “Get” and assign the new category to the original values. Thereby, the
following values were reclassified:

Old value New value


Capt, Don, Major Sir
Dona, Jonkheer, the Countess Lady
Mme. Mlle.

All other values remain the same. See Fig. 8.162.


Finally, we add another Derive node to the stream that calculates the family
size by just adding the variables “Sibsp” and “Parch”. See Fig. 8.163. In
Fig. 8.162 the distribution of the “famSize” variable is displayed separately for
the surviving and non-surviving passengers. We see that passengers with smaller
travelling families have a better chance of survival than passengers with a large
family (Fig. 8.164).
836 8 Classification Models

Fig. 8.161 Reclassification


of the deck values

Fig. 8.162 Reclassification


of the title values
8.5 Support Vector Machine 837

Fig. 8.163 Calculation of


the famSize variable

Fig. 8.164 Histogram of the famSize variable

3. We add an Auto Data Prep node to the stream and select the standard options for
the data preparation, i.e., replacement of the missing values with the mean and
mode, and performing of a z-transformation for the continuous variables.
See Fig. 8.165 and Sect. 2.7.6 for additional information on the Auto Data
Prep node. After running the Auto Data Prep node, we add another Type node
to the stream, in order to determine all the variable values, and we define the
838 8 Classification Models

Fig. 8.165 Auto Data Prep node

“survival” variable as our target variable and set the measurement type to
“Flag”.
4. Now, we add four SVM nodes to the stream and connect them all to the last Type
node. We open the first one and in the Fields node we define the variable
“survived” as the target and the following variables as input: “sibsp_transformed”,
“parch_transformed”, “age_transformed”, “fare_transformed”, “sex_transformed”,
“embarked_transformed”, “pclass_transformed”. Furthermore, we put “Parti-
tion” as the partition field. In the Analyze tab, we then enable the predictor
importance calculations. See Fig. 8.166.
We proceed with the other three SVM nodes in the same manner, but add
successively the new established variables “deck”, “Title”, and “famSize”. We
then run all four SVM nodes and rearrange the appearing model nuggets by
connecting them into a series. See Fig. 8.167, where the alignment of the model
nuggets is displayed.
5. We open the model nuggets one after another to determine the predictor impor-
tance. These are displayed in Figs. 8.168, 8.169, 8.170, and 8.171. We observe
that in the model with only the original variables as input, the “sex” variable is
the most important for survival prediction, followed by “pclass” and
“embarked.”
If the “deck” variable is considered as an additional input variable, the
importance of the “sex” is reduced, but this variable is still the most important
one. The second most important variable for the prediction is the new variable
8.5 Support Vector Machine 839

Fig. 8.166 Selection of the variable roles in the SVM node of the model without new established
features

Fig. 8.167 Sub-stream with alignment of the model nuggets in a series

“deck”, however. This means that this new variable describes a new aspect in
the data.
When the “Title” variable is also included in the SVM model, it becomes the
most important one, even before the “sex” variable. This could be due to the fact
that the Title variable describes the civil standing of a passenger as well as their
gender and therefore contains more information.
Therefore, “famSize” is the variable with the least predictor importance in the
model that includes all variables. See Fig. 8.171.
840 8 Classification Models

Fig. 8.168 Variable importance in the SVM with no new features included

Fig. 8.169 Variable importance in the SVM with the deck variable included
8.5 Support Vector Machine 841

Fig. 8.170 Variable importance in the SVM, with deck and title included

Fig. 8.171 Variable importance in the SVM, with “deck,” “Title,” and “famSize” included
842 8 Classification Models

Fig. 8.172 Renaming of the


prediction fields in the
Filter node

Fig. 8.173 Evaluation statistics calculated by the Analysis node for the four Titanic survival
SVM classifiers
8.6 Neuronal Networks 843

Fig. 8.174 ROC of the four Titanic survival SVM classifiers

We now add a Filter node to the stream and connect it to the last model
nugget. This is only done to rename the predictor fields. See Fig. 8.172
We then add the Analysis and Evaluation nodes to the stream and connect
them to the Filter node. See Sect. 8.2.5 and Fig. 8.28 for the options in these
nodes. When inspecting the evaluation statistics from the Analysis node
(Fig. 8.173), we observe that accuracy as well as the Gini increase successively
in the training set and test set. There is just one exception in the test set statistics.
When adding the “deck” variable, the accuracy and Gini are both a bit lower than
when we exclude the “deck”. All in all, however, the newly generated features
improve the prediction performance, more precisely, from 0.723 Gini points to
0.737 points in the test set and 0.68–0.735 in the training data. This improvement
is visualized by the ROC of the four classifiers in Fig. 8.174. There, the model
including all new generated variables lies above the other ones.

8.6 Neuronal Networks

Neural networks (NN) are inspired by the functionality of the brain. They also
consist of many connected units that receive multiple inputs from other units,
process them, and pass new information onto yet other units. This network of
units simulates the processes of the brain in a very basic way. Due to the relation-
ship to the brain, the units are also called neurons, hence, neural network. An NN is
844 8 Classification Models

a black box algorithm, just like the SVM, since the structure and mechanism of data
transformation and the transfer of neurons are so complex and unintuitive. The
results of an NN are difficult to retrace and therefore hard to interpret. On the other
hand, its complexity and flexibility makes the NN one of the most powerful and
universal classifiers, which can be applied to a variety of problems where other
methods, such as rule based ones, would fail. In the first section, we describe in brief
the theoretical background of a NN in more detail by following Lantz (2013),
before proceeding to look at utilization in the SPSS Modeler.

8.6.1 Theory

The concept of a NN is motivated by the human brain and its functionality. An NN


is intended to simulate easy brain processes, and like its original, an NN consists of
multiple neurons or units that process and pass information between each other.

Functionality of one neuron and the activation function


During data processing, a neuron receives weighted signals from some other
neurons and transforms the sum of these weighted signals into new information
via an activation function φ.
For example, in the illustration of this mechanism in Fig. 8.175, the input signals
þ1, x1 , x2 are multiplied by the weights ω0, ω1, ω2 and then added up. This sum
is then transformed via the activation function φ and passed to the next neuron.
Hence, the output of the neuron in the middle is
!
X
2
y¼ φ ωi  x i ;
i¼0

Fig. 8.175 Function of a neuron


8.6 Neuronal Networks 845

where x0 ¼ 1: The input x0 is added as a constant in the sum and is often called a
bias. The purpose of the weight is to regularize the contribution of each input signal
to the sum. Since every neuron has multiple inputs with different weights, this gives
huge flexibility in tuning the inputs individually for each neuron. This is one of the
biggest strengths of the NN, qualifying their application to a variety of complex
classification problems. The weights are not interpretable in their contribution to
the results however, due to the complexity of the network.
The activation function φ is typically the sigmoid function or the hyperbolic
tangent function, where the latter is used in the SPSS Modeler, see IBM (2015a). A
linear function is also plausible, but a function that is linear in a neighborhood of
0 and nonlinear at the limits, as the two above-mentioned functions are, is a more
suitable choice, since both situations can be modeled with these kind of functions.

The topology and layers of an NN


Besides the activation function, the topology is also important, that is the number of
neurons and their connection to each other, for the definition and functionality of
the NN. In more detail, the neurons of an NN are structured in layers between which
the information gets passed through. There are three types of layers in an NN: an
Input layer, one or multiple Hidden layer(s), and an Output layer. See Fig. 8.176
for a sketch of a typical NN with three layers.
The input layer comprises the initial neurons, which receive unprocessed raw
data. Each neuron in the input layer thereby is responsible for handling one input
variable, which it transforms via the activation function and then passes the
outcome onto the next layer neurons. The neurons in the output layer, on the
other hand, receive the data, which were processed by multiple neurons in the
network, and calculate a final score, e.g., a probability, and prediction for each
target class. Each neuron in the output layer represents one target category and

Fig. 8.176 Sketch of a typical neural network


846 8 Classification Models

outputs the score for this category. Between the input and output layers can be one
or multiple hidden layers. The neurons of these layers get the data from neurons of
the previous layer and process them as described above. The manipulated data are
then passed to the neurons of the next hidden or the output layer.

Necessary conditions and further remarks


The NN as described until now is the most common and well-known Multilayer
Perceptron (MLP) model. Other NN models exist, however, for example when the
activation functions follow a Gaussian structure. In this case, the network consists
of only one hidden layer and uses a distance measure instead of the weighted sum,
as the input for the activation function. This model is called a Radial Basis
Function (RBF) network and is also available in the SPSS Modeler. For more
information on differences between the two network types, we refer to
Tuffery (2011).
Recalling the complex structure of the network and the large number of weights,
and thus tuning parameters, the NN is one of the most flexible, powerful, and
accurate data mining methods. These many parameters do cause drawbacks, how-
ever, since the NN tends to overfit the training set. One has to be aware of this
phenomenon and always use a test set to verify the generalization ability of the
model. Due to this problem, many software applications, such as the SPSS Modeler,
have implemented overfitting prevention in the NN, where a small part of the data is
used to validate the model during training and thus warn if overfitting occurs.
One of the greatest dangers with NN is the possibility of a nonoptimal solution.
This results in the mechanism of parameter/weight estimation. The parameters are
calculated with an approximation algorithm, which can lead to a nonoptimal
solution.
Furthermore, all inputs have to be continuous for the neurons to process the data.
The SPSS Modeler automatically translates categorical and discrete variables into
numeric ones, however, so the user doesn’t have to worry about this source of error.
The NN can also be used for regression problems. We won’t describe those
situations here, but refer interested readers to Runkler (2012) and Cheng and
Titterington (1994).
For additional remarks and assumptions with the NN, in classification or regres-
sion cases, see Lantz (2013), Tuffery (2011) and Cheng and Titterington (1994).

8.6.2 Building a Network with SPSS Modeler

A neural network (NN) can be trained with the Neural Network node in the SPSS
Modeler. We now present how to set up the stream for an NN with this node, based
on the digits recognition data, which comprises data on handwritten digits from
different people. The goal now is to build a classifier that is able to identify the
correct digit from an arbitrary person’s handwriting. These classifiers are already in
use in many areas, as described in Sect. 8.1.
8.6 Neuronal Networks 847

Description of the model


Stream name Neural_network_digits_recognition
Based on dataset Optdigits_training.txt, optdigits_test.txt (see Sect. 10.1.25)
Stream structure

Important additional remarks:


For a classification model, the target variable should be categorical. Otherwise, a regression
model should be trained by the Neural Network node. See Runkler (2012) and Cheng and
Titterington (1994) for regression with an NN.
Related exercises: All exercises in Sect. 8.6.4

The stream consists of two parts, the training and the validation of the model. We
therefore split the description into two parts also.
Training of an NN
Here, we describe how to build the training part of the stream. This is displayed in
Fig. 8.177.

1. We start by opening the template stream “020 Template-Stream_digits” and


saving it under a different name. The template stream consists of two parts, in

Fig. 8.177 Training part of the stream, which builds a digits identification classifier
848 8 Classification Models

Fig. 8.178 Template stream of handwritten digits data

which the training (“optdigits_training.txt”) and test sets (“optdigits_test.txt”)


are imported. See Fig. 8.178.
A Filter node is then attached to each Source node, to rename the field with the
digit labels, i.e., Field65, becomes “Digit”. This field is then assigned in the Type
nodes as the target variable. See Fig. 8.179 for definition of the target variable
and its values.
2. We now concentrate on the training stream and add a Distribution node to the
Type node, to display how often each digit occurs within the training set. For
the description of the Distribution node, see Sect. 3.2.2. The frequencies of the
handwritten digits can be viewed in Fig. 8.180. We note that the digits 0–9
appear almost equally in the training data.
3. Now we add a Neural Network node to the stream and connect it to the Type
node. In the Fields tab of the node options, the target and input variables for the
NN can be defined. Here, the “Digit” variable is the target variable that contains
the digits label for each handwritten data record. All other variables, i.e.,
“Field1” to “Field64”, are treated as inputs in the network. See Fig. 8.181.
In the Build Option tab, the parameters for the model training process are
defined. Firstly in the Objective options, we can choose between building a new
model and continuing to train an existing one. The latter is useful if new data are
available and a model has to be updated, to avoid building the new model from
scratch. Furthermore, we can choose to build a standard or an ensemble model.
For a description of ensemble models, boosting and bagging, we refer to Sect.
5.3.6. Here, we intend to train a new, standard model. See Fig. 8.182.
8.6 Neuronal Networks 849

Fig. 8.179 Type node of the digits data and assignment of the target field and values

Fig. 8.180 Frequency of the digit within the training set


850 8 Classification Models

Fig. 8.181 Definition of the target and input variables in the Neural Network node

In the Basic options, the type of the network, with its activation function and
topology, has to be specified. The Neural Network node provides the two
network models MLP and RBF, see Sect. 8.6.1 for a description of these two
model types. We choose the MLP, which is the default setting and the most
common one. See Fig. 8.183. The number of hidden layers and the unit size can
be specified here too. Only networks with a maximum of 2 hidden layers can be
built with the Neural Network node, however. Furthermore, the SPSS Modeler
provides an algorithm that automatically determines the number of layers and
units. This option is enabled by default and we accept. See bottom arrow in
Fig. 8.183. We should point out that automatic determination of the network
topology is not always optimal, but a good choice to go with in the beginning.
With the next options, the stopping rules of network training can be defined.
Building a neural network can be time and resource consuming. Therefore, the
SPSS Modeler provides a couple of possibilities for terminating the training
process at a specific time. These include a maximum training time, a maximum
number of iterations of the coefficient estimation algorithm, and a minimum
8.6 Neuronal Networks 851

Fig. 8.182 Selection of the model type

accuracy. The latter can be set if a particular accuracy has adequate prediction
power. We choose the default setting and fix a maximum processing time of
15 min. See Fig. 8.184.
In the Ensemble options, the aggregation function and number of models in
the ensemble can be specified. See Fig. 8.185. These options are only relevant if
an ensemble model is trained.
The available aggregation options for a categorical target, as in classifiers, are
listed in Table 8.7. For additional information on ensemble models, as well as
boosting and bagging, we refer to Sect. 5.3.6.
In the Advanced option view, the size of the overfitting prevention set can be
specified; 30 % is the default setting. Furthermore, an NN is unable to handle
missing values. Therefore, a missing values handling tool should be specified.
The options here are the deletion of data records with missing values or the
replacement of missing values. Continuous variables impute the average of
minimum and maximum value, while the category field imputes the most
852 8 Classification Models

Fig. 8.183 Definition of the network type and determination of the layer and unit number

Fig. 8.184 The stopping rules are set


8.6 Neuronal Networks 853

Fig. 8.185 Definition of Ensemble model parameters

Table 8.7 Aggregation mechanism for the ensemble models of a classifier


Mechanism Description
Voting The category that is mostly predicted by the single model wins.
Highest probability The category with the highest probability over all models is predicted.
wins
Highest mean The probabilities for each category or averages over all models and the
probability one with the highest average win.

frequent category. See Fig. 8.186 for the Advanced option view of the Neural
Network node.
In the Model Options tab, the usual calculation of predictor importance can be
enabled, which we do in this example. See Fig. 8.187.
4. The Option setting for the training process is now completed and we run the
stream, thus producing the model nugget. The model nugget, with its graphic and
statistics, is explained in the subsequent Sect. 8.6.3.
5. We now add an Analysis node to the stream, connect it to the model nugget, and
enable the calculation of the coincidence matrix. See Sect. 8.2.5 for a description
of the Analysis node. The output of the Analysis node can be viewed in
Fig. 8.188. We see that accuracy is extremely high, with a recognition rate of
over 97 % on handwritten digits. On the coincidence matrix, we can also see that
854 8 Classification Models

Fig. 8.186 Overfit prevention and the setting of missing values handling

Fig. 8.187 The predictor importance calculation is enabled in the Model Option tab
8.6 Neuronal Networks 855

Fig. 8.188 Analysis node statistics for the digits training and NN classifier

Fig. 8.189 Validation part of the stream, which builds a digits identification classifier

the prediction is very precise for all digits. In other words, there is no digit that
falls behind significantly in the accuracy of the prediction by the NN.

Validation of the NN
Now, we validate the trained model from part one with a test set. Figure 8.189
shows the stream.
Since the validation of a classifier is part of the modeling process, we continue
the enumeration of the model training here.

6. First, we copy the model nugget and paste it into the stream canvas. Afterwards,
we connect the new nugget to the Type node of the stream segment that imports
the test data (“optdigits_test.csv”).
856 8 Classification Models

Fig. 8.190 Analysis node statistics for the digits test set and NN classifier

7. We then add another Analysis node to the stream and connect it to the new
nugget. After setting the option in the Analysis node, see Sect. 8.2.5, we run it
and the validation statistics open in another window. See Fig. 8.190 for the
accuracy statistics of the test set. We observe that the NN predicts the digits still
very precisely with over 94 % accuracy, without neglecting any digit. Hence, we
see that the digits recognition model is applicable to independent data and can
identify digits from unfamiliar handwritings.

8.6.3 The Model Nugget

In this section, we introduce the contents of the Neural Network model nugget. All
graphs and statistics from the model are located in the Model tab, which is
described in detail below.

Model summary
The first view in the Model tab of the nugget displays a summary of the trained
network. See Fig. 8.191. There, the target variable and model type, here “Digit” and
MLP, are listed as well as the number of neurons in every hidden layer that was
included in the network structure. In our example, the NN contains one hidden layer
with 18 neurons. This information on the number of hidden layers and neurons is
8.6 Neuronal Networks 857

Fig. 8.191 Summary of


the NN

particularly useful when the SPSS Modeler automatically determined them. Fur-
thermore, the reason for stopping is displayed. This is important to know, as a
termination due to time issues or overfitting reasons, instead of an “Error cannot be
further decreased” stopping, means that the model is not optimal and can be
improved by adjusting parameters or having a larger run-time.
Basic information on the accuracy of the NN on the training data is displayed
below. Here, the handwriting digits NN classifier has a 97.9 % accuracy. See
Fig. 8.191.

Predictor importance
The next view displays the importance of the input variables in the NN. See
Fig. 8.192. This view is similar to the one in the Logistic node, and we refer to
Sect. 8.3.4 for a description of this graph. At the bottom of the graph there is a
sliding regulator, where the number of displayed input fields can be selected. This is
convenient when the model includes many variables, as in our case with the digits
data. We see that nearly all fields are equally important for digit identification.

Coincidence matrix
In the classification view, the predicted values against the original values are
displayed in a heat map. See Fig. 8.193. The background color intensity of a cell
thereby correlates with its proportion of cross-classified data records. The entries on
the matrix can be changed at the bottom, see arrow in Fig. 8.193. Depending on the
selected option, the matrix displays the percentage of correctly identified values for
each target category, the absolute counts, or just a heat map without entries.

Network structure
The Network view visualizes the constructed neural network. See Fig. 8.194. This
can be very complex with a lot of neurons in each layer, especially in the input
layer. Therefore, only a portion of the input variables can be selected, e.g., the most
858 8 Classification Models

Fig. 8.192 Predictor importance in the NN

Fig. 8.193 Heat map of the classification of an NN


8.6 Neuronal Networks 859

Fig. 8.194 Visualization of the NN

Fig. 8.195 Visualization of the coefficients of the NN

important variables, by the sliding regulator at the bottom. Furthermore, the align-
ment of the drawn network can be changed at the bottom, from horizontal to vertical
or bottom to top orientation. See left arrow in Fig. 8.194.
Besides the structure of the network, the estimated weights or coefficients within
the NN can be displayed in network form. See Fig. 8.195. One can switch between
these two views with the select list. See bottom right arrow in Fig. 8.194. Each
connecting line of the coefficients network represents a weight, which is displayed
when the mouse cursor is moved over it. Each line is also colored; a darker tones
indicate a positive weight, and lighter tones indicate a negative weight.
860 8 Classification Models

8.6.4 Exercises

Exercise 1 Prediction of chess endgame outcomes and comparisons with other


classifiers
The dataset “chess_endgame_data.txt” contains 28,056 chess endgame positions,
with the white king, a white rook, and the black king only left on the board (see
Sect. 10.1.6). The goal of this exercise is to train an NN to predict the outcome of
such endgames, i.e., whether white wins or black achieves a draw. The variable
“Result for White” thereby describes the number of moves white needs, to win or
reach a draw.

1. Import the chess endgame data with a proper Source node and reclassify all
“Result for White” variables into a binary field that indicates whether white wins
the game or not. What are the proportion of “draws” in the dataset?
2. Train a neural network with 70 % of the chess data and use the other 30 % as a
test set. What are the accuracy and Gini values for the training set and test set?
Does the classifier overfit the training set?
3. Build an SVM and a Logistic Regression model on the same training data. What
are the accuracy and Gini values for these models? Compare all three models
with each other and plot the ROC for each of them.

Exercise 2 Credit rating with a Neural network and finding the best network
topology
The “tree_credit” dataset (see Sect. 10.1.33) comprises demographic and historical
loan data from bank customers, as well as a prognosis on credit worthiness (“good”
or “bad”). In this exercise, a neural network has to be trained that decides if the bank
should give a certain customer a loan or not.

1. Import the credit data and set up a cross-validation scenario with training, test,
and validation sets, in order to compare two networks with different topologies
from each other.
2. Build a neural network to predict the credit rating of the customers. To do this,
use the default settings provided by the SPSS Modeler and, in particular, the
automatic hidden layer and unit determination method. How many hidden layers
and units are included in the model and what is its accuracy?
3. Build a second neural network with customer defined hidden layers and
units. Try to improve the performance of the automatically determined network.
Is there a set-up with a higher accuracy on the training data and what is its
topology?
4. Compare the two models, automatic and custom determination of the network
topology, by applying the trained models to the validation and test sets. Identify
the Gini values and accuracy for both models. Draw the ROC for both models.
8.6 Neuronal Networks 861

Exercise 3 Construction of neural networks simulating logical functions


Neural networks have their origin in calculating logical operations, i.e., AND, OR,
and NOT. We explain this in the example of the AND operator. Consider the simple
network shown in Fig. 8.175, with only a single neuron in the hidden layer and two
input variables x1, x2. In this case, these variables can take the values 0 or 1. The
activation function φ is the sigmoid function, i.e.,

1
φð x Þ ¼ :
1 þ ex
The task now is to assign values to the weights ω0, ω1, ω2, such that

1, x ¼ 1 and x2 ¼ 1
φðxÞ  if 1
0, otherwise:

A proper solution for the AND operator is shown in Fig. 8.196. When looking at the
four possible input values and calculating the output of this NN we get

x1 x2 φ(x)
0 0 φð200Þ  0
1 0 φð50Þ  0
0 1 φð50Þ  0
1 1 φð100Þ  1

Construct a simple neural network for the logical OR and NOT operators, by
proceeding in the same manner as just described for the logical AND. Hint: the
logical OR is not exclusionary, which means the output of the network has to be
nearly 0, if and only if both input variables are 0.

Fig. 8.196 Neural network


for the logical AND
862 8 Classification Models

8.6.5 Solutions

Exercise 1 Prediction of chess endgame outcomes and comparisons with other


classifiers
Name of the solution streams chess_endgame_prediction_nn_svm_logit
Theory discussed in section Sect. 8.2
Sect. 8.6.1

Figure 8.197 shows the final stream for this exercise.

1. First, we import the dataset with the File Var. File node and connect it to a Type
node. Then open the latter one and click the “Read Values” button, to make sure
the following nodes know the values and types of variables in the data. After-
wards, we add a Reclassify node and connect it to the Type node. In the
Reclassify node, we select the “Result for White” field, since we intend to
change its values. By clicking on the “Get” button, the original values appear
and can be assigned to another value. In the New value column, we write “win”
next to each original value that represents a win for white, i.e., those which are
not a “draw”. The “draw” value, however, remains the same in the newly
assigned variable. See Fig. 8.198.
We now connect a Distribution node to the Reclassify node, to inspect how
often a “win” or “draw” occurs. See Sect. 3.2.2 for how to plot a bar plot with the
Distribution node. In Fig. 8.199, we observe that a draw occurs about 10 % of the
time in the present chess dataset.
2. We add another Type node to the stream after the Reclassify node, and set the
“Result for White” variable as the target field and its measurement type to
“Flag”, which ensures a calculation of the Gini values for the following
classifiers. See Fig. 8.200.
We now add a Partition node to the stream to split the data into a training set
(70 %) and test set (30 %). See Sect. 2.7.7 for how to perform this step in the
Partition node. Afterwards, we add a Neural Network node to the stream and
connect it to the Partition node. Since the target and input variables are defined in
the preceding Type node, the roles of the variables should be automatically

Fig. 8.197 Stream of the chess endgame prediction exercise


8.6 Neuronal Networks 863

Fig. 8.198 Reclassification of the “Results for White” variable to a binary field

Fig. 8.199 Distribution of the reclassified “Result for White” variable

identified by the Neural Network node and, hence, appear in right role field. See
Fig. 8.201.
We furthermore use the default settings of the SPSS Modeler, which in
particular include the MLP network type, as well as an automatic determination
of the neurons and hidden layers. See Fig. 8.202. We also make sure that the
864 8 Classification Models

Fig. 8.200 Definition of the target variable and setting its measurement type to “Flag”

Fig. 8.201 Target and input variable definition in the Neural Network node
8.6 Neuronal Networks 865

Fig. 8.202 Network type and topology settings for the chess outcome prediction model

Fig. 8.203 Summary of the neural network classifier that predicts the outcome of a chess game

predictor importance calculation option is enabled in the Model Options tab. See
how this is done in Fig. 8.187.
Now, we run the stream and the model nugget appears. In the following, the
model nugget is inspected and the results and statistics interpreted.
The first thing that strikes our attention is the enormous accuracy of 99.5 %
correct outcome predictions in the training data. See Fig. 8.203. To eliminate
overfitting, we have to inspect the statistics for the test set later in the Analysis
866 8 Classification Models

Fig. 8.204 Importance of the pieces” positions on the chessboard for outcome prediction

Fig. 8.205 Classification heat map of counts of predicted versus observed outcomes

node. These are almost as good (see Fig. 8.206) as the accuracy here however,
and we can assume that the model is not overfitting the training data. Moreover,
we see in the model summary overview, that one hidden layer with 7 neurons
was included in the network. See Fig. 8.203.
When inspecting the predictor importance, we detect that the positions of the
white rook and black king are important for prediction, while the white king’s
position on the board plays only a minor role in the outcome of the game in
16 moves. See Fig. 8.204.
In the classification heat map of absolute counts, we see that only 100 out of
19,632 game outcomes are misclassified. See Fig. 8.205.
To also calculate the accuracy of the test set, and the Gini values for both
datasets, we add an Analysis node to the neural network model nugget. See
Sect. 8.2.5 for the setting options in the Analysis node. In Fig. 8.206, the output
of the Analysis node is displayed, and we verify that the model performs as well
8.6 Neuronal Networks 867

Fig. 8.206 Analysis node statistics of the chess outcome Neural Network classifier

on the test set as on the training set. More precisely, the accuracies are 99.49 %
or 99.48 %, and the Gini values are 0.999 and 0.998 for the training and test set,
respectively. Hence, the model does not overfit the training set.
3. To build an SVM and Logistic Regression model on the same training set, we
add an SVM node and a Logistic node to the stream and connect each of them to
the Partition node. In the SVM node, no options have to be modified, while in the
Logistic node, we choose the “Stepwise” variable selection method. See
Fig. 8.207. Afterwards, we run the stream.
Before comparing the prediction performances of the three models, NN,
SVM, and Logistic Regression, we rearrange the model nuggets and connect
them into a series. The models are now executed successively on the data.
Compare the rearrangement of the model nuggets with Fig. 8.107.
We now add an Analysis node and an Evaluation node to the last model
nugget in this series and run these two nodes. See Sect. 8.2.5 and Fig. 8.28 for a
description of these two nodes. The outputs are displayed in Figs. 8.208, 8.209,
and 8.210. When looking at the accuracy of the three models, we notice that the
NN performs best here; it has over 99 % accuracy in the training set and test set,
868 8 Classification Models

Fig. 8.207 Variable selection method in the Logistic node

followed by the SVM with about 96 % or 95 % accuracy, and lastly the Logistic
Regression, with still a very good accuracy of about 90 % in both datasets. The
coincidence matrix, however, gives a more detailed insight into the prediction
performance and reveals the actual inaccuracy of the last model. While NN and
SVM are able to detect both “win” and “draw” outcomes, the Logistic Regres-
sion model has assigned every game as a win for white. See Fig. 8.208. This
technique gives good accuracy, since only 10 % of the games end with a draw,
recall Fig. 8.199, but that is by chance. The reason for this behavior could be that
the chess question is non- linear, and a linear classifier such as Logistic Regres-
sion is unable to separate the two classes from each other, intensified by
imbalance in the data. For this chess question, a nonlinear classifier performs
better, and this is also confirmed by the Gini values of the three models in
Fig. 8.209. The Gini of the NN and SVM are pretty high and nearly 1, while the
Gini of the Logistic Regression is about 0.2, noticeably smaller. This also
indicates an abnormality in the model. The Gini or AUC is visualized by the
ROC in Fig. 8.210. The ROC of the NN and SVM are almost perfect, while the
ROC of the Logistic Regression runs clearly beneath the other two.
8.6 Neuronal Networks 869

Fig. 8.208 Accuracy of the Neural Network, SVM, and Logistic regression chess outcome
classifier

Fig. 8.209 AUC/Gini of the Neural Network, SVM, and Logistic regression chess outcome
classifier
870 8 Classification Models

Fig. 8.210 ROC of the Neural Network, SVM, and Logistic regression chess outcome classifier

In conclusion, the problem of predicting the outcome of a chess endgame is


very complex, and linear methods reach limits here. An NN, on the other hand, is
well suited for such problems and outperforms the other methods shown in this
exercise.

Exercise 2 Credit rating with a Neural network and finding the best network
topology
Name of the solution streams tree_credit_nn
Theory discussed in section Sect. 8.2
Sect. 8.6.1

Figure 8.211 shows the final stream in this exercise.

1. We start by opening the stream “000 Template-Stream tree_credit,” which


imports the tree_credit data and already has a Type node attached to it, and
save it under a different name. See Fig. 8.212.
To set up a cross-validation with training, validation, and testing data, we add
a Partition node to the stream and place it between the Source node and Type
node. Then, we open the node and define 60 % of the data as training data, 20 %
as validation data, and the remaining 20 % as test data. See Sect. 2.7.7 for a
description of the Partition node.
2. We open the Type node and define the measurement type of the variable’s credit
rating as “Flag” and its role as “Target”. This is done as in the previous solution,
see Fig. 8.200. Now, we add a Neural Network node to the stream and connect it
to the Type node. The variable roles are automatically identified, see Fig. 8.213,
8.6 Neuronal Networks 871

Fig. 8.211 Stream of credit rating prediction with an NN exercise

Fig. 8.212 Template stream for the tree_credit data

and since we use the default settings, nothing has to be modified in the network
settings. In particular, we use the MLP network type and automatic topology
determination. See Fig. 8.214.
Now we run the stream.
In the Model summary view in the model nugget, we see that one hidden layer
with six neurons is included while training the data. See Fig. 8.215. We also
notice that the model has an accuracy of 76.4 % in the training data. The network
is visualized in Fig. 8.216.
3. We add a second Neural Network node to the stream and connect it to the Type
node. Unlike with the first Neural Network node, here we manually define the
number of hidden layers and units. We choose two hidden layers with ten and
five neurons as our network structure, respectively. See Fig. 8.217. The type
remains MLP.
We run this node and open the model nugget that appears. The accuracy of this
model with two hidden layers (ten units in the first one and five in the second),
has increased to 77.9 %. See Fig. 8.218. Hence, the model performs better on the
872 8 Classification Models

Fig. 8.213 Definition of target and input fields in the Neural Network node

Fig. 8.214 Default network type and topology setting in the Neural Network node
8.6 Neuronal Networks 873

Fig. 8.215 Summary of the NN, with automatic topology determination that predicts credit
ratings

Fig. 8.216 Visualization of the NN, with automatic topology determination that predicts credit
ratings
874 8 Classification Models

Fig. 8.217 Manually define the network topology of the Neural Network node

Fig. 8.218 Summary of the NN, with manually defined topology that predicts credit ratings

training data than the automatically established network. Figure 8.219 visualizes
the structure of this network with two hidden layers.
4. To compare the two models with each other, we add a Filter node to each model
nugget and rename the prediction fields with a meaningful name. See Fig. 8.220
for the Filter node, after the model nugget with automatic network determina-
tion. The setting of the Filter node for the second model is analog, except for the
inclusion of the “Credit rating” variable, since this is needed to calculate the
evaluation statistics.
8.6 Neuronal Networks 875

Fig. 8.219 Visualization of the NN, with manually defined topology that predicts credit ratings

Fig. 8.220 Filter node to rename the prediction fields of the NN

With a Merge node, we combine the predictions of both models. We refer to


Sect. 2.7.9 for a description of the Merge node. We add the usual Analysis and
Evaluation nodes to the Merge node and set the usual options for a binary target
variable, by recalling Sect. 8.2.5 and Fig. 8.28. In Fig. 8.221, the accuracy and
876 8 Classification Models

Fig. 8.221 Evaluation statistics from the two networks that predict credit ratings

Fig. 8.222 ROC of the two neural networks

Gini measures can be viewed for both models and all three subsets. We notice
that for the network with a manually defined structure, the accuracy is higher in
all three sets (training, validation, and test), as are the Gini values. This is also
visualized by the ROCs in Fig. 8.222, where the curves of the manually defined
8.6 Neuronal Networks 877

topology network lie above the automatically defined network. Hence, a network
with two hidden layers having ten and five units describes the data slightly better
than a network with one hidden layer containing six neurons.

Exercise 3 Construction of neural networks simulating logical functions


The logical “OR” network
A logical “OR” network is displayed in Fig. 8.223. When looking at the four
possible input values and calculating the output of the neuron, we get

x1 x2 φ(x)
0 0 φð100Þ  0
1 0 φð100Þ  1
0 1 φð100Þ  1
1 1 φð300Þ  1

The logical “NOT” network


A logical “NOT” network has only one input variable and should output 1 if the
input is 0, and vice versa. A solution is displayed in Fig. 8.224. When looking at the
two possible input values and calculating the output of the neuron, we get

Fig. 8.223 Neural network


for the logical “OR”

Fig. 8.224 Neural network


for the logical “NOT”
878 8 Classification Models

x1 φ(x)
0 φð100Þ  1
1 φð100Þ  0

8.7 k-Nearest Neighbor

The k-nearest neighbor (kNN) algorithm is nonparametric and one of the simplest
among the classification methods in machine learning. It is based on the assumption
that data points similar to each other are of the same class. So, the classification of
an object is simply done by majority voting within the data points in a neighbor-
hood. The theory and concept of kNN is described in the next section. Afterwards,
we turn to the application of kNN on real data with the SPSS Modeler.

8.7.1 Theory

The kNN algorithm is nonparametric, which means that model parameters don’t
have to be calculated. A kNN classifier is trained with just the set of training data
and the values of the involved features. This learning technique is also called lazy-
learning. So training of a model is pretty fast. In return, however, the prediction of
new data points can be very resource and time consuming. We refer to Lantz (2013)
and Peterson (2009) for information that goes beyond our short instruction here.

Description of the algorithm and selection of k


The kNN is one of the simplest methods among machine learning algorithms.
Classification of a data point is done by identifying the k nearest data points and
counting the frequency of each class among the k nearest neighbors. The class
occurring most often wins, and the data point is assigned to this class. In the left
graph of Fig. 8.225, this procedure is demonstrated for a 3-nearest neighbor
algorithm. The data point marked with a star has to be assigned to either the circle
or the rectangle class. The three nearest neighbors of this point are identified as one
rectangle and two circles. Hence, the star point is classified as a circle in the case of
k ¼ 3.
The above example already points to a difficulty, however, namely that the
choice of k, i.e., the size of the neighborhood, massively affects the classification.
This is caused by the kNN algorithm’s sensitivity to the local structure of the data.
For example, in the right graph in Fig. 8.225, the same data point as before has to be
classified (the star), but this time, a 1-nearest neighbor method is used. Since the
nearest data point is a rectangle, the star point is classified as a rectangle. So, the
same data point has two different classifications, a circle or a rectangle, depending
on the choice of k.
8.7 k-Nearest Neighbor 879

Fig. 8.225 Visualization of the kNN method for different k, for k ¼ 3 on the left and k ¼ 1 on the
right graph

Unfortunately, choosing the right k is important. A small k will take into account
only points within a small radius and thus give each neighbor a very high impor-
tance when classifying. In so doing, however, it makes the model prone to noisy
data and outliers. If, for example, the rectangle nearest to the star is an outlier in the
right graph of Fig. 8.225, the star will probably be misclassified, since it would
rather belong to the circle group. If k is large, then the model becomes more stable
and less affected by noise. On the other hand, more attention will then be given to
the majority class, as a huge number of data points are engaged in the decision
making. This can be a big problem for skewed data. In this case, the majority class
most often wins, suppressing the minority class, which is thus ignored in the
prediction.
Of course the choice of k depends upon the structure of the data and the number
of observations and features. In practice, k is typically set somewhere between three
and ten. A common choice for k is the square root of the number of observations.
This is just a suggestion that turned out to be a good choice of k for many problems,
but does not have to be the optimal value for k, and can even result in poor
predictions.
The usual way to identify the best k is via cross-validation. That is, various
models are trained for different values of k and validated on an independent set. The
model with the lowest error rate is then selected as the final model.

Distance metrics
A fundamental aspect of the kNN algorithms is the metric with which the distance
of data points is calculated. The most common distance metrics are the Euclidian
distance and City-block distance. Both distances are visualized in Fig. 8.226. The
880 8 Classification Models

Fig. 8.226 Distance


between two data points in a
2-dimensional space

Table 8.8 Overview of the distance measures


Distance measure Formula
Object x and object y are described by
ðvariable1 ; variable2 ; . . . ; variablen Þ ¼ ðx1 ; x2 ; . . . ; xn Þ and (y1,
y2, . . ., yn). Using the vector components xi and yi, the metrics are
defined as follows:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean distance Xn
dðx; yÞ ¼ ðxi  yi Þ2
i¼1
City-block-metric X
n

(Manhattan metric) dðx; yÞ ¼ j xi  yi j


i¼1

Euclidian metric describes the usual distance between data points and is the black
solid line between the two points in Fig. 8.226. The City-block distance is the sum
of the distance between the points in every dimension. This is indicated as the
dotted line in Fig. 8.226. The City-block distance can be also thought of as the way
a person has to walk in Manhattan to get from one street corner to another one. This
is why this metric is also called the Manhattan metric. Both distance formulas are
shown in Table 8.8, and we also refer to the Clustering Chap. 7 and IBM (2015a).

Feature normalization
A problem that occurs when calculating the distances of different features is the
scaling of those features. The values of different variables are located on differing
scales. For example, consider the following data in Fig. 8.227, of bank customers
who could be rated for credit worthiness, for example.
8.7 k-Nearest Neighbor 881

Fig. 8.227 Examples of bank customers

When calculating the distance between two customers, it is obvious that


“Income” dominates the distance, as the variations between customers in this
variable are on a much larger scale than the variation in the “Number of credit
cards” or the “Age” variable. The Euclidean distance between John and Frank for
example is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
d ðJohn, FrankÞ ¼ ð21  34Þ2 þ ð2000  3500Þ2 þ ð5  7Þ2  1500:06:

So, the contribution of the “Income” difference to the squared distance is


 
ð2000  3500Þ2 2 ¼ 1500
2
 0:99:
dðJohn, FrankÞ 1500:062

The consequence of this is that a change in the number of credit cards that John or
Frank own has nearly no effect on the distance between these two customers,
although this might have a huge influence on their credit scoring.
To prevent this problem, all features are transformed before being entered into
the kNN algorithm, so they all lie on the same scale and thus contribute equally to
the distance. This process is called normalization. One most common normaliza-
tion is the min–max normalization, that is

x  minðXÞ
xnorm ¼
maxðXÞ  minðXÞ

for a value x of the feature X. The SPSS Modeler provides an adjusted min–max
normalization, namely,

2  ðx  minðXÞÞ
xnorm ¼  1:
maxðXÞ  minðXÞ

Whereas the min–max normalization maps the data between 0 and 1, the adjusted
min–max normalized data take values between 1 and 1.
Additionally, to calculate distances all variables have to be numeric. To provide
this, the categorical variables are transferred into numerical ones, by dummy coding
the categories of these variables with integers.
882 8 Classification Models

Another way to address the feature-scaling problem, or simply give more


important features a higher influence on the distance, is the identification of
neighbors. Thus, the prediction tool weights the features, e.g., by their importance.
With this technique, an input variable that has high prediction power gets a bigger
weighting. This method is provided by the SPSS Modeler; it weights the features
with prediction importance, see IBM (2015b).

Dimension reduction and the Curse of Dimensionality


The Curse of Dimensionality describes the phenomenon whereby in high-
dimensional space the Euclidian distance of a data point to all other points is nearly
the same. In high dimensions, multiple variables contribute to the distance and thus
equalize the tendencies of each other, so all data points can be thought of as lying on
the surface around the query point. See Beyer et al. (1999).
So when dealing with high-dimensional data (data with more than ten
dimensions), a data reduction process is usually performed prior to applying the
kNN algorithm. Therefore, often a dimensional reduction technique, such as PCA
(see Sect. 6.3), is pre-applied to the data, in order to reduce the feature dimensions
and consolidate the variables into less, but more meaningful, features.
Another option is to exclude “unimportant” input variables and only consider the
most relevant variables in the model. This process in also provided by the SPSS
Modeler, and can be performed by training the model, whereas a PCA for example
has to be done before executing the kNN method.
Besides the curse of dimensionality, another reason to reduce dimensions is the
huge amount of resources, time, and memory that the calculations would otherwise
consume during the prediction process. There are a couple of algorithms that are
more efficient, but when predicting the class of a new data point, the distance to all
other data points has to be calculated, which can result in a huge number of computer
operations, which can then lead to memory problems or simply take forever. Smaller
dimensions can reduce the computational time and resources needed, by keeping the
prediction efficient. We will later see that this is a real issue for the SPSS Modeler.
For further information on the kNN method, we refer the reader to Lantz (2013)
and Peterson (2009).

8.7.2 Building the Model with SPSS Modeler

A k-nearest neighbor classifier (kNN) can be trained with the KNN node in the
SPSS Modeler. Here, we show how this node is utilized for classification of wine
data. This dataset contains chemical analysis data on three Italian wines, and the
goal is to identify the wine based on its chemical characteristics.
8.7 k-Nearest Neighbor 883

Description of the model


Stream name k-nearest neighbor—wine
Based on dataset Wine_data.txt (see Sect. 10.1.34)
Stream structure

Important additional remarks:


The KNN node is sensitive to variable names. In order for all the graphs to work properly, we
recommend avoiding special characters and blanks in the variable names.
Related exercises: All exercises in Sect. 8.7.5

1. First, we open the template stream “021 Template-Stream wine” (see Fig. 8.228)
and save it under a different name. The target variable is called “Wine” and can
take values 1, 2, or 3, indicating the wine type.
This template stream already contains a Type node, in which the variable
“Wine” is set as the target variable and its measurement type is set to nominal.
See Fig. 8.229. Additionally, a Partition node is already included that splits the
wine data into a training set (70 %) and a test set (30 %), so a proper validation of
the kNN is provided. See Sect. 2.7.7 for a description of the Partition node.
2. Next, we add a KNN node to the stream and connect it to the Partition node.
After opening the KNN node, the model options can be set.
3. In the Objectives tab, we will define the purpose of the KNN analysis. Besides
the usual prediction of a target field, it is also possible to identify just the nearest
neighbors, to get an insight into the data and maybe find the best data
representatives to use as training data. Here, we are interested in predicting the
type of wine and thus select the “predict target field” option. See Fig. 8.230.

Fig. 8.228 Template stream of the wine data


884 8 Classification Models

Fig. 8.229 Type node in the template stream

Fig. 8.230 Objectives tab in the KNN node


8.7 k-Nearest Neighbor 885

Fig. 8.231 Variable selection in the KNN node

Furthermore, predefined settings can be chosen in this tab, which are related to
the performance properties, speed, and accuracy. Switching between these
options changes the settings of the KNN, in order to provide the desired
performance. When changing the setting manually, the option changes automat-
ically to “custom analysis”.
4. In the Fields tab, the target and input variables can be selected. See Fig. 8.231 for
variable role definition with the wine data. If the roles of the fields are already
defined in a Type node, the KNN node will automatically identify them.
5. The model parameters are set in the Settings tab. If a predefined setting is chosen
in the Objectives tab, these are present according to the choice of objective.

Model
In the Model options, the cross-validation process can be initialized by marking the
usual “Use partitioned data” option. See top arrow in Fig. 8.232. Furthermore,
feature normalization can be enabled. See bottom arrow in Fig. 8.232. The SPSS
Modeler uses the adjusted min–max normalization method; see Sect. 8.7.1 for
details and the formula. We also recommend always normalizing the data, since
this improves the prediction in almost all cases.

Neighbors
In the Neighbors view, the value of k, i.e., the number of neighbors, is specified
along with the metric used to calculate the distances. See Fig. 8.233. The KNN node
provides an automatic selection of the “best” k. Therefore, a range for k has to be
886 8 Classification Models

Fig. 8.232 Model setting and normalization are enabled in the KNN node

Fig. 8.233 The neighborhood and metric are defined in the KNN node
8.7 k-Nearest Neighbor 887

defined, in which the node identifies the “optimal” k, based on the classification
error. The value with the lowest error rate is then chosen as k. Detection of the best
k is done either by feature selection or cross-validation. This depends on whether or
not the feature selection option is requested in the Feature Selection panel. Both
options are discussed later.

" The method used to detect the best k depends upon whether or not
the feature selection option is requested in the Feature Selection
panel.

1. If the feature selection method is enabled, then the model will be


built by identification and inclusion of the best features for each k.
2. If features selection is not in effect, a V-fold cross validation (see
subsequent section) will be performed for each k, in order to
identify the optimal number of neighbors.

" In both cases, the k with the lowest error rate will be chosen as the
size of the neighborhood. A combination of both options cannot be
selected, due to performance issues. These options are described in
the introduction within their panels.

The number of neighbors can also be fixed to a specific value. See framed field in
Fig. 8.233. Here, we chose the automatic selection of k, with a range between 3 and
5.
As an option, the features, i.e., the input variables, can be weighted by their
importance, so more relevant features have a greater influence on the prediction, in
order to improve accuracy. See Sect. 8.7.1 for details. When this option is selected,
see bottom arrow in Fig. 8.233, the predictor importance is calculated and shown in
the model nugget. We therefore select this option in our example with the
wine data.

Feature Selection
In the Feature Selection panel, feature selection can be enabled in the model
training process. See arrow in Fig. 8.234. Thereby, features are added one by one
to the model, until one of the stopping criteria is reached. More precisely, the
feature that reduces the error rate most is included next. The stopping criteria will
either be a predefined maximum number of features or a minimum error rate
improvement. See Fig. 8.234. Feature selection has the advantage of reducing
dimensionality and just focusing on a subset of the most relevant input variables.
See Sect. 8.7.1 for further information on the “curse of dimensionality”.
We exclude feature selection from the model training process, as we intend to
use the cross-validation process to find the optimal k.
888 8 Classification Models

Fig. 8.234 Feature selection options in the KNN node

Cross-validation
This panel defines the setting of the cross-validation process for finding the best k in
a range. The method used here is V-fold cross-validation, which randomly
separates the dataset into V segments of equal size. Then V models are trained,
each on a different combination of V1 of these subsets, and the remaining subset
is not included in the training, but is used as a test set. The V error rates on the test
sets are then aggregated into one final error rate. V-fold cross-validation gives very
reliable information on model performance.

" The process of V-fold cross-validation entails the following:

1. The dataset is separated randomly into equally sized V subsets,


called folds.
2. These subsets are rearranged into V training and test set
combinations. Thereby, each of the subsets is treated as a test
set, while the other V1 subsets are the training set in one of
these new combinations. This is demonstrated in Fig. 8.235, where
the gray segments indicate the test data. The remaining white
subsets are the training data.
3. A model is trained and tested on each of these new combinations.
4. The error rates of the test set are then aggregated (averaged) into
one final error rate.
8.7 k-Nearest Neighbor 889

Fig. 8.235 Visualization of the V-fold cross-validation process. The gray boxes are the test sets
and the rest are subsets of the training data in each case

" Since each data record is used for test data, V-fold cross-validation
eliminates a “good luck” error rate that can occur by splitting the data
into training and test sets just once. For this reason, it gives very
reliable information on model performance.

5. The splitting method is defined, in the Cross-Validation panel. Unless there is a


field that splits the data into groups, typically the way to go here is with random
partitioning into equally sized parts. See Fig. 8.236. If this option is chosen, as in
our case, the number of folds has to be specified. A 10-fold cross-validation is
very common, and we select this procedure in our case too.
6. Now, we run the KNN node and the model nugget appears. In this nugget, the
final selection of k can be reviewed. The graphs provided by the nugget,
including the final selection of k, are described in the next Sect. 8.7.3, along
with more information.
7. To evaluate the model performance, we add the usual Analysis node to the
stream and connect it to the model nugget. See Sect. 8.2.5 for information on the
Analysis node. The output of this node, with its evaluation statistics, is shown in
Fig. 8.237. We see that the KNN model has very high prediction accuracy, since
it has an error rate in the training set and test set of only about 2 %. The
coincidence tables further show that all three wines are equally well identified
from their chemical characteristics.
890 8 Classification Models

Fig. 8.236 Cross-validation options in the KNN node

Fig. 8.237 Analysis output and evaluation statistics in the KNN wine classifier
8.7 k-Nearest Neighbor 891

8.7.3 The Model Nugget

The main tab in the model nugget of the KNN node comprises all the graphs and
visualizations from the model finding process. The tab is split into two frames. See
Fig. 8.239. In the left frame, a 3-dimensional scatterplot of the data is shown,
colored by the target classes. The three axes describe the most important variables
in the model. Here, these are “Alcohol”, “Proline”, and “color_intensity”. The
scatterplot is interactive and can be spun around with the mouse cursor. Further-
more, the axes variables can be changed. A detailed description would exceed the
purpose of this book, so we refer interested readers to IBM (2015b).

" The KNN node and the model nugget are pretty sensitive to variable
names with special characters or blanks, and they have problems
dealing with them. Therefore, in order for all graphs in the model
nugget to work properly and for predictions to work properly, we
recommend avoiding special characters and blanks in the variable
names. Otherwise, the graphic will not be displayed or parts will be
missing. Furthermore, predictions with the model nugget might fail,
and produce errors, as shown in Fig. 8.238.

The second graph in Fig. 8.239 shows error rates for each k considered in the
neighborhood selection process. To do this, the “k Selection” option is selected in
the drop-down menu at the bottom of the panel. See arrow in Fig. 8.239. In our
example of the wine data, the model with k ¼ 4 has the lowest error rate, indicated
by the minimum of the curve in the second graphic, and is thus picked as the final
model.

Fig. 8.238 Error message when a variable with a special character, such as (, is present in the
model
892 8 Classification Models

Fig. 8.239 Main graphics view with a 3-dimensional scatterplot (left side) and the error rates of
the different variants of k (right side)

Fig. 8.240 Predictor importance graph in the KNN model nugget

When selecting the “Predictor Importance” option in the bottom right drop-
down menu, the predictor importance graph is shown, as we already know from the
other models. See Fig. 8.240 and Sect. 8.3.4, for more information on this chart. In
8.7 k-Nearest Neighbor 893

the 4-nearest neighbor model on the wine data, all the variables are almost equally
relevant with “Alcohol”, “Proline”, and “color_intensity” being the top three most
important in the model.

8.7.4 Dimensional Reduction with PCA for Data Preprocessing

As mentioned in Sect. 8.7.1, dimensional reduction is a common preprocessing step


prior to applying kNN to data. Here, we present this procedure on the multidimen-
sional leukemia gene expression data, which comprises 851 different variables. Our
goal is to build a nearest neighbor classifier that can differentiate between acute
(AML, ALL) and chronic (CML, CLL) leukemia patients, based on their gene
expression. Here, we use a PCA to reduce dimensionality, which is the standard
way to go in these situations.

Description of the model


Stream name knn_pca_gene_expression_acute_chronic_leukemia
Based on dataset gene_expression_leukemia.csv (see Sect 10.1.14)
Stream structure

Important additional remarks:


Dimensional reduction is a common way to reduce multidimensional data, saving memory space,
increasing computational speed, and counteracting the “curse of dimensionality” problem.

The stream is split into two parts, the data import and target setting part, and
the dimensional reduction and model building part. As the focus of this section is
dimension reduction with the PCA, we keep the target setting part short, as it is
unimportant for the purpose of this section.

Data importing and target setting

1. We open the template stream “018 Template-Stream gene_expression_


leukemia” and save it under a different name. See Fig. 8.114 in a prior exercise
894 8 Classification Models

Fig. 8.241 Reclassification


node, which joins the acute
and chronic leukemia types to
groups

solution. The template already consists of the usual Type and Partition nodes,
which define the target variable and split the data into 70 % training data and
30 % test data.
2. We then add a second Type node, a Reclassify node, and a Select node to the
stream and insert them between the Source node and the Partition node. In the
Type node, we click the “Read Values” button, so that the nodes that follow
know the variable measurements and value types. In the Reclassify node, we
select the “Leukemia” variable as our reclassification field, enable value replace-
ment at the top, and run the “Get” button. Then, we merge AML and ALL into
one group named “acute” and proceed in the same manner with the chronic
leukemia types CML and CLL by relabeling both “chronic”. See Fig. 8.241.
3. As we are only interested in the treatment of leukemia patients, we add a Select
node to exclude the healthy people. See Fig. 8.242 for the formula in the Select
node.
8.7 k-Nearest Neighbor 895

Fig. 8.242 Healthy people are excluded with the Select node

Dimensional reduction with the PCA node


Here, we present in brief how to set up a PCA as a preprocessing step. For details on
PCA, and a complete description of the node, we refer to Sect. 6.3.

4. To perform dimension reduction with PCA, we add a PCA node to the stream
and connect it to the last Type node. In the PCA node, we then select all the
genomic positions as inputs and the partition indicator variable as the partition
field. See Fig. 8.243.
In the Model tab, we mark the box that enables the use of partitioned data, so
the PCA is only performed on the training data. Furthermore, we select the
principle component method, so that the node uses PCA. See Fig. 8.244.
In the Expert tab, we choose the predefined “Simple” set-up (Fig. 8.245). This
ensures a calculation of the first five factors. If more factors are needed, these
can be customized under the “Expert” options. Please see Sect. 6.3 for more
information on these options.
5. After running the PCA node, it calculates the first five factors, and we observe in
the PCA model nugget that these five factors explain 41.6 % of the data
variation. See Fig. 8.246.
6. An advantage of dimensional reduction with PCA is the consolidation of mutual
variables into much more meaningful new variables. With these, we can get an
impression of the position of the groups (acute, chronic) and the potential for
classification. For that purpose, we add a Plot node to draw a scatterplot of the
first two factors and a Graphboard node to draw a 3D scatterplot of the first three
896 8 Classification Models

Fig. 8.243 Variables are selected in the PCA node

Fig. 8.244 Selection of the PCA method

factors. In both cases, the data points are colored and shaped according to their
group (acute, chronic). These plots are shown in Figs. 8.247 and 8.248, respec-
tively, and we immediately see that the two groups are separated, especially in
the 3D graph, which indicates that separation might be feasible.
8.7 k-Nearest Neighbor 897

Fig. 8.245 Expert tab and set-up of the PCA factors that are determined by PCA

Fig. 8.246 Variance explained by the first five factors calculated by PCA, on the gene expression
dataset

kNN on the reduced data

7. Now, we are ready to build a kNN on the five established factors with PCA.
Before we do, we have to add another Type node to the stream, so that the KNN
node is able to read the measurement and value types of the factor variables.
898 8 Classification Models

Fig. 8.247 Scatterplot of the first two factors

8. We finally add a KNN node to the stream and connect it to the last Type node.
In the KNN node, we select the 5 factors calculated by the PCA as input
variables and the “Leukemia” field as the target variable. See Fig. 8.249.
In the Settings tab, we select the Euclidian metric and set k to 3. See
Fig. 8.250. This ensures a 3-nearest neighbor model.
9. Now we run the stream and the model nugget appears. We connect an Analysis
node to it and run it to get the evaluation statistics of the model. See Sect. 8.2.5
for a description on the options in the Analysis node. See Fig. 8.251, for the
output from the Analysis node. As can be seen, the model has high prediction
accuracy for both the training data and the test data. Furthermore, the Gini
values are extremely high. So, we conclude that the model fits the reduced data
well and is able to distinguish between acute and chronic leukemia from factor
variables alone, which explain just 41.6 % of the variance, with a minimal error
rate.
8.7 k-Nearest Neighbor 899

Fig. 8.248 3-D scatterplot of the first three factors

Fig. 8.249 The factors are selected as input variables in the KNN node
900 8 Classification Models

Fig. 8.250 The metric and neighborhood size k are selected

The Gini error rate can probably be improved by adding more factors,
calculated by PCA, as input variables to the model.
10. To sum up, the PCA reduced the multidimensional data (851 features) into a
smaller dataset with only five variables. This smaller data, with 170 times fewer
data entries, contains almost the same information as the multidimensional
dataset, and the kNN model has very high prediction power on the reduced
dataset. Furthermore, the development and prediction speed have massively
increased for the model trained on the smaller dataset. Training and evaluation
of a kNN model on the original and multidimensional data takes minutes,
whereas the same process requires only seconds on the dimensionally reduced
data (five factors), without suffering in prediction power. The smaller dataset
obviously need less memory too, which is another argument for dimension
reduction.
8.7 k-Nearest Neighbor 901

Fig. 8.251 Evaluation statistics of the kNN on dimensionally reduced data

8.7.5 Exercises

Exercise 1 Identification of nearest neighbors and prediction of credit rating


Consider the following credit rating dataset in Fig. 8.252, consisting of nine bank
customers, with information on their age, income, and number of credit cards held.
Figure 8.253 comprises a list of four bank customers that have received no credit
rating yet.
In this exercise, the credit rating for these four new customers should be established
with the k-nearest neighbor method.

1. Normalize the features of the training set and the new data, with the min–max
normalization method.
2. Use the normalized data to calculate the Euclidian distance between the
customers John, Frank, Penny, and Howard, and each of the customers in the
training set.
3. Determine the 3-nearest neighbors of the four new customers and assign a credit
rating to each of them.
902 8 Classification Models

Fig. 8.252 The training dataset of bank customers that already have a credit rating

Fig. 8.253 List of four new bank customers that need to be rated

4. Repeat steps two and three, with the City-block metric as your distance measure.
Has the prediction changed?

Exercise 2 Feature selection within the KNN node


In Sect. 8.7.2, a kNN classifier was trained on the wine data with the KNN node.
There, a cross-validation approach was used to find the best value for k, which
turned out to be 4. In this exercise, the 4-nearest neighbor classifier is revisited,
which is able to identify the wine based on its chemical characteristics. This time,
however, the feature selection method should be used. More precisely,

1. Build a 4-nearest neighbor model as in Sect. 8.7.2 on the wine data, but enable
the feature selection method, to include only a subset of the most relevant
variables in the model.
2. Inspect the model nugget. Which variables are included in the model?
3. What is the accuracy of the model for the training data and test data?

Exercise 3 Dimensional reduction and the influence of k for imbalanced data


Consider the dataset “gene_expression_leukemia”, which contains genomic sample
data from several patients suffering from one of four different types of leukemia
(ALL, AML, CLL, CML) and data from a small healthy control group (see Sect.
10.1.14). The gene expression data are measured at 851 locations in the human
genome and correspond to known cancer genes. The goal is to build a kNN
classifier that is able to separate the healthy patients from the ill patients. Perform
a PCA on the data in order to reduce dimensionality.
8.7 k-Nearest Neighbor 903

1. Import the dataset “gene_expression_leukemia.csv” and merge the data from all
leukemia patients (AML, ALL, CML, CLL) into a single “Cancer” group. What
is the frequency of both the leukemia and the healthy data records in the dataset?
Is the dataset skewed?
2. Perform a PCA on the gene expression data of the training set.
3. Build a kNN model on the factors calculated in the above step. Use the automatic
k selection method with a range of 3–5. What is the performance of this model?
Interpret the results.
4. Build three more kNN models for k equals to 10, 3, and 1, respectively. Compare
these three models and the one from the previous step with each other. What are
the evaluation statistics? Draw the ROC. Which is the best performing model
from these four and why?

Hint: Use the stream of Sect. 8.7.4 as a starting point.

8.7.6 Solutions

Exercise 1 Identification of nearest neighbors and prediction of credit rating

1. Figure 8.254 shows the min–max normalized input data, i.e., age, income,
number of credit cards. The values 0 and 1 indicate the minimum and maximum
values. As can be seen, all variables are now located in the same range, and thus
the effect of the large numbers and differences in Income are reduced.

Fig. 8.254 Normalized


values of the input data
904 8 Classification Models

Fig. 8.255 Euclidean distance between John, Frank, Penny, and Howard and all the other bank
customers

Fig. 8.256 Final credit rating for John, Frank, Penny, and Howard

2. The calculated Euclidian distance between customers John, Frank, Penny, and
Howard and each of the other customers is shown in Fig. 8.255. As an example,
we show here how the distance between John and Sandy is calculated:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
dðJohn; SandyÞ ¼ ð0  0:29Þ2 þ ð0:17  0:44Þ2 þ ð0:67  0:67Þ2  0:4:

3. The 3-nearest neighbors for each new customer (John, Frank, Penny, Howard)
are highlighted in Fig. 8.255. For example, John’s 3-nearest neighbors are
Andrea, Ted, and Walter, with values 0.3, 0.31, and 0.27.
Counting the credit ratings of these three neighbors for each of the four new
customers, we get, by majority voting, the following credit ratings, as listed in
Fig. 8.256.
4. Figure 8.257 displays the City-block distances between John, Frank, Penny, and
Howard and each of the training samples. The colored values symbolize the
3-nearest neighbors in this case. As an example, here we calculate the distance
between John and Sandy:

dðJohn; SandyÞ ¼ j0  0:29j þ j0:17  0:44j þ j0:67  0:67j ¼ 0:56:


8.7 k-Nearest Neighbor 905

Fig. 8.257 City-block distance between John, Frank, Penny, and Howard and all the other bank
customers

We see that the nearest neighbors haven’t changed, from comparing the
Euclidian distance. Hence, the credit ratings are the same as before and can be
viewed in Fig. 8.256.

Exercise 2 Feature selection within the KNN node


Name of the solution streams wine_knn_feature_selection
Theory discussed in section Sect. 8.2
Sect. 8.7.1
Sect. 8.7.2

Figure 8.258 shows the final stream for this exercise.

1. Since this is the same stream construct as in Sect. 8.7.2, we open the stream
“k-nearest neighbor—wine” and save it under a different name. We recall that
this stream imports the wine data, defines the target variables and measurement,
splits the data into 70 % training data and 30 % test data, and builds a kNN model
that automatically selects the best k; in this case k ¼ 4. This is the reason why we
perform a 4-nearest neighbor search in this exercise.
To enable a feature selection process for a 4-nearest neighbor model, we open
the KNN node and go to the Neighbors panel in the Settings tab. There, we fix
k to be 4, by changing the automatic selection option to “specify fixed k” and
writing “4” in the corresponding field. See Fig. 8.259.
In the Feature Selection panel, we now activate the feature selection process
by checking the box at the top. See top arrow in Fig. 8.260. This enables all the
options in this panel, and we can choose the stopping criteria. We can choose to
stop when a maximum number of features is included in the model or when the
error rate cannot be lowered by more than a minimum value. We choose the
second stopping criteria and define the minimum improvement for each inserted
feature by 0.01. See Fig. 8.260.
906 8 Classification Models

Fig. 8.258 Stream of the wine identification exercise

Fig. 8.259 k is defined in the KNN node

Activation of the feature selection process disables the cross-validation pro-


cedure, and all options in the Cross-Validation panel are shown as grayed out.
This finishes the option setting and we run the stream.
2. We open the model nugget and observe first that the scatterplot on the right has
changed, compared with Fig. 8.239, due to the different axes. See Fig. 8.261. In
the right drop-down menu, we select the “Predictor Selection” option, and the
visualization of the feature selection process is shown. The curve in the graph
shows the error rate for each further feature added to the model. The
corresponding name of the added variable is printed next to the point in the
8.7 k-Nearest Neighbor 907

Fig. 8.260 Activation of the features selection procedure in the KNN node

Fig. 8.261 Model view in the model nugget and visualization of the feature selection process

graph. See Fig. 8.261. In this case, seven features are included in the model. These
are, in order of their inclusion in the model, “Flavanoids”, “Color_intensity”,
“Alcohol”, “Magnesium”, “Proline”, “Proanthocyanins”, and “Ash”.
908 8 Classification Models

Fig. 8.262 Predictor importance in the case of feature selection on the wine data

Fig. 8.263 Evaluation


statistics in the 4-nearest
neighbor model on the
wine data

Since not all variables are considered in the model, the predictor importance
has changed, which we already recognized by the axes variables in the 3-D
scatterplot. The importance of variables in this case is shown in Fig. 8.262.
3. The output of the Analysis node is displayed in Fig. 8.263. As can be seen, the
accuracy has not changed much, since the performance of the model in
Sect. 8.7.2 was already extremely precise. The accuracy has also not suffered
from feature selection, however, and the dimensionality of the data has halved,
which reduces memory requirements and hastens the prediction performance.
8.7 k-Nearest Neighbor 909

Exercise 3 Dimensional reduction and the influence of k on imbalanced data


Name of the solution streams gene_expression_leukemia_knn_imbalanced_data
Theory discussed in section Sect. 8.2
Sect. 8.7.1
Sect. 8.7.4

Figure 8.264 shows the final stream for this exercise.

1. Since we are dealing here with the same data as in Sect. 8.7.4, where a kNN model
was also built on PCA processed data, we use this stream as a starting point. We
therefore open the stream “knn_pca_gene_expression_acute_chronic_leukemia”
and save it under a different name. See Fig. 8.265 for the stream in Sect. 8.7.4.
This stream has already imported the proper dataset, partitioned it into a
training set and a test set, and reduced the data with PCA. See Sect. 8.7.4 for
details.
Starting from here, we first need to change the reclassification of target values.
For that purpose, we delete the Select node and connect the Distribution node to
the Reclassify node. The latter is then opened and the value of the “Leukemia”
variable representing all leukemia types (AML, ALL, CML, CLL) is set to
“Cancer”. See Fig. 8.266.

Fig. 8.264 Stream of the kNN leukemia and healthy patient identification exercise

Fig. 8.265 The stream “knn_pca_gene_expression_acute_chronic_leukemia”


910 8 Classification Models

Fig. 8.266 Reclassification of leukemia types into a general cancer group

Fig. 8.267 Distribution plot of the reclassified “Leukemia” variable

Next we run the Distribution node, to inspect the frequency of the new values
of the “Leukemia” variable. As can be seen in Fig. 8.267, the dataset is highly
skewed, with the healthy data being the minority class, comprising only 5.73 %
of the whole data.
8.7 k-Nearest Neighbor 911

Fig. 8.268 Explained variation in the first five components established by PCA

Fig. 8.269 Scatterplot matrix for all five factors calculated by PCA

2. The PCA node is already included in the stream, but since the healthy patients
are now added to the data, unlike in the stream from Fig. 8.265, we have to run
the PCA node again to renew the component computations. In the PCA model
nugget, we can then view the variance in the data, explained by the first five
components. This is 41.719, as shown in Fig. 8.268.
To get a feeling for the PCA components, we add a Graphboard node to the
PCA model nugget and plot a scatterplot matrix for all five factors. See
Fig. 8.269. We notice that in all dimensions, the healthy patient data seems to
912 8 Classification Models

Fig. 8.270 Automatic selection of k is activated in the KNN node

be located in a cluster. This is a slight indicator that classification will be


successful; a clearer statement is not possible, however.
3. Now, we open the KNN node and switch the fixed definition of k to an automatic
calculation, with a range between 3 and 5. See Fig. 8.270.
We now run the KNN node to update the model nugget, in which the selection
process of k can be viewed. Figure 8.271 shows the graph in the nugget related to
this process, and we observe that four neighbors results in the lowest error in the
training data.
Afterwards, we run the connected Analysis node, to view the accuracy and
Gini values. These statistics are presented in Fig. 8.272, and we observe that
accuracy is extremely high, as is the Gini, and at first glance this suggests a good
model. When inspecting the coincidence matrix, however, we see that only 20 %
of the healthy patients are classified correctly. See arrow in Fig. 8.272. This issue
results from an unbalance of the data. The minority class “Healthy” is ignored in
favor of the majority class “Cancer”.
4. To build a 10-nearest neighbor model, we copy the existing KNN node and paste
it onto the Modeler canvas. Then, we connect it to the last Type node. In the node
itself, we fix the number of neighbors to 10, as shown in Fig. 8.273.
8.7 k-Nearest Neighbor 913

Fig. 8.271 Visualization of the k selection process

Fig. 8.272 Evaluation statistics in the KNN model, with automatic neighborhood selection
914 8 Classification Models

Fig. 8.273 A 10-nearest neighbor model is defined in the KNN node

To build the 3-nearest and 1-nearest neighbor models, we proceed in the same
manner. Afterwards, we run all three models (10-, 3-, and 1-nearest neighbor).
To provide a clear overview and make comparison of the models easier, we
connect all KNN model nuggets into a series. See Fig. 8.107 as an example of
how this is performed. Next, we add a Filter node to the end of the stream, to
rename the predictor fields of the models with a proper name.
We then add an Analysis node and Evaluation node to the stream and connect
them to the Filter node. See Sect. 8.2.5 and Fig. 8.28 for options within these two
nodes. Figure 8.274 shows the accuracy and Gini, as calculated by the Analysis
node. We notice that the models improve according to Gini and accuracy
measures, by reducing the value of k. The smaller the neighborhood, the better
the model prediction. Thus, the statistics indicate that a 1-nearest neighbor
classifier is the best one for this data. This is evidenced by the perfect classifica-
tion of this model, i.e., accuracy of 100 %. Analysis further exposes that the
8.7 k-Nearest Neighbor 915

Fig. 8.274 Accuracy and Gini for the four kNN models

automatic k selection method output is not always the best solution. k ¼ 3 may be
more appropriate than k ¼ 4 in this situation.
Improvement of the models is also visualized by the ROCs in Fig. 8.275.
The coincidence matrices can also confirm improvement in the prediction
performance of models with a smaller k. A smaller k brings more attention to the
absolute nearest data points, and ergo the minority class. Hence, misclassi-
fication of “Healthy” patients is reduced when k is lowered. See Fig. 8.276.
916 8 Classification Models

Fig. 8.275 ROC of the four kNN models

Fig. 8.276 Coincidence matrices on the four kNN models


8.8 Decision Trees 917

8.8 Decision Trees

We now turn to the rule-based classification methods, from which Decision Trees
(DT) are the most famous group of algorithms. In comparison with the classifica-
tion methods discussed so far, rule-based classifiers have a completely different
approach; they inspect the raw data for rules and structures that are common in a
target class. These identified rules then become the basis for decision making.
This approach is closer to real-life decision making. Consider a situation where
we plan to play tennis. The decision on whether to play tennis or not depends on the
weather forecast. We will go to the tennis court “if it is not raining and the
temperature is over 15 C”. If “rainy weather or a temperature of less than 15 C”
is forecasted we decide to stay at home. These rules are shown in Fig. 8.277 and we
recall Sect. 8.2.1 for a similar example.
A decision tree algorithm now constructs and represents this logical structure in
the form of a tree, see Fig. 8.278. The concept of finding appropriate rules within
the data is discussed in the next chapter. Then the building of a DT model with the
SPSS Modeler is presented.
Rule-based classifiers have the advantage of being easily understandable and
interpretable without statistical knowledge. For this reason, DT, and rule-based
classifiers in general, are widely used in fields where the decision has to be
transparent, i.e., in credit scoring, where the scoring result has to be explained to
the customer.

8.8.1 Theory

Decision trees (DT) belong to a group of rule-based classifiers. The main charac-
teristic of a DT lies within how it orders the rules in a tree structure. Looking again
at the tennis example, where a decision is made based upon the weather, the logical
rule on which the decision will be based is shown in Fig. 8.277. In a DT, these rules
are represented in a tree, as can be seen in Fig. 8.278.

Fig. 8.277 Rules for making the decision to play tennis


918 8 Classification Models

Fig. 8.278 Decision tree for


the playing tennis example

A DT is like a flow chart. A data record is classified by going through the tree,
starting at the root node, and deciding in each tree node which of the conditions the
particular variable satisfies, and then following the branch of this condition. This
procedure is repeated until a leaf is reached that brings the combination of decisions
made thus far to a final classification. For example, let us consider an outlook “No
rain”, with a temperature of 13 C. We want to make our decision to play tennis
using the DT in Fig. 8.278. We start at the root node and see that our outlook is “No
rain” and we follow the left branch to the next node. This looks at the temperature
variable, is it larger or smaller than 15 C? As the temperature in our case is 13 C,
we follow the branch to the right and arrive at a “No” leaf. Hence, in our example,
we decide to stay home as the temperature is too cold.
As can be seen in this short example, our decision to cancel our plans to play
tennis was easily determined by cold weather. This shows the great advantage of
DT or rule-based classifiers in general. The cause of a certain decision can be
reconstructed and is interpretable without statistical knowledge. This is the reason
for DT’s popularity of use for a variety of fields and problems, where the classifi-
cation has to be interpretable and justifiable to other people.

The tree building process


Decision trees are built with a recursive partitioning mechanism. At each node, the
data is split into two distinct groups by the values of a feature, resulting in subsets,
which are then split again into smaller subsets, and so on. This process of splitting a
problem into smaller and smaller subproblems of the same type is also known as the
divide and conquer approach. See Cormen (2009), for a detailed description of the
divide and conquer paradigm.
The recursive process of building a DT is described below:
8.8 Decision Trees 919

1. The DT consists of a single root node.


2. In the root node, a variable and partitioning of its values are determined with the
“best” partitioning ability, that is, partitioning into groups of similar target
classes. There are several methods to select the “best” variable, and they are
described in the section hereafter.
3. The data is split into distinct subgroups, based on the previously chosen splitting
criteria.
4. These new data subsets are the basis for the nodes in the first level of the DT.
5. Each of these new nodes again identifies the “best” variable for partitioning the
subset and splits it according to splitting criteria. The new subsets are then the
basis for the splitting nodes in the second level of the DT.
6. This node partitioning process is repeated until a stopping criteria is fulfilled for
that particular node.
Stopping criteria are:

– All (or almost all) training data in the node are of the same target category.
– The data can no longer be partitioned by the variables.
– The tree has reached its predefined maximum level.
– The node has reached the minimum occupation size.

7. A node that fulfills one of the stopping criteria is called a leaf and indicates the
final classification. The target category that occurs most often in the subset of the
leaf is taken as the predictor class. This means each data sample that ends up in
this leaf when passed through the DT is classified as the majority target category
of this leaf.

In Fig. 8.279, node partitioning is demonstrated with the “play tennis” DT


example, see Fig. 8.278, where the circles represent days we decide to play tennis,

Fig. 8.279 The DT nodes are partitioned for the “play tennis” example
920 8 Classification Models

and the rectangles indicate days we stay at home. In the first node, the variable
“Outlook” is chosen and the data are divided into the subsets “Non rain” and
“Rain”. See the first graph in Fig. 8.279. If the outlook is “Rain”, we choose not
to play tennis, while in the case of “No rain”, we can make another split on the
values of the temperature. See right graph in Fig. 8.279. Recalling Fig. 8.278, this
split is done at 15 C, whereby on a hotter day, we decide to play a tennis match.
As can be seen in the above example and Fig. 8.279, a DT is only able to perform
axis-parallel splits in each node. This divides the data space into several rectangles,
where each of them is assigned to a target category.

Pruning
If the stopping criteria are set too narrowly, the finished DT is very small and tends
to underfit the training data. In other words, the tree is unable to describe particular
structures in the data, as the criteria are too vague. Underfitting can occur, for
example, if the minimum occupation size of each node is set too large, or the
maximum level size is set too small. On the other hand, if the stopping criteria are
too broad, the DT can continue splitting the training data until each data point is
perfectly classified, and then the tree will be overfitted. The constructed DT is
typically extremely large and complex.
Several pruning methods have therefore been developed to solve this dilemma,
originally seen by Breiman et al. (1984). The concept is basically the following:
instead of stopping the tree growth at some point, e.g., at a maximum tree level, the
tree is over-constructed, allowing the tree to overfit the data. Then, nodes and
“sub-branches” are removed from the overgrown tree, which do not contribute to
the general accuracy.
Growing the tree to its full size and then cutting it back is usually more effective
than stopping at a certain point, since determination of the optimal tree depth is
difficult without growing it first. As this method allows us to better identify
important structures in the data, this pruning approach generally improves the
generalization and prediction performance.
For more information on the pruning process and further pruning methods, see
Esposito et al. (1997).

Decision tree algorithms and separation methods


There are numerous implementations of decision trees that mainly differ in the
splitting mechanism, that is, the method of finding the optimal partition and the
number of new nodes that can be grown from a single node. We outline below the
most well-known decision tree algorithms (also provided by the SPSS Modeler) and
their splitting algorithms:

CART (Classification and regression tree)


The CART is a binary splitting tree. That means, each non-leaf node has exactly
two outreaching branches. Furthermore, it provides the pruning process as
described above, to prevent over- and underfitting.
8.8 Decision Trees 921

The split in each node is selected with the Gini coefficient, sometimes also
called the Gini index. The Gini coefficient is an impurity measure and describes the
dispersion of a split. The Gini coefficient should not be confused with the Gini
index that measures the performance of a classifier, see Sect. 8.2.5. The Gini
coefficient at node σ is defined as

XN ðσ; jÞ 2
Giniðσ Þ ¼ 1  ;
j
N ðσ Þ

where j is a category of the target variable, N(σ, j) the number of data in node σ with
category j, and N(σ) the total number of data in node σ. In other words,
.
N ðσ; jÞ
N ðσ Þ

is the relative frequency of category j upon the data in node σ. The Gini coefficient
reaches its maximum, when the data in the node are equally distributed across the
categories. If, on the other hand, all data belong to the same category in the node,
then the Gini equals 0, its minimum value. A split is now measured with the Gini
Gain

N ðσ L Þ N ðσ R Þ
GiniGainðσ; sÞ ¼ Giniðσ Þ  Giniðσ L Þ  Giniðσ R Þ;
N ðσ Þ N ðσ Þ

where σ L and σ R are the two child nodes of σ, and s is the splitting criteria. The
binomial split that maximizes the Gini Gain will be chosen. In this case, the child
nodes deliver maximal purity, with regard to category distribution, and so it is best
to partition the data along these categories.
When there are a large number of target categories, the Gini coefficient can
encounter problems. The CART therefore provides another partitioning selection
measure, called twoing. Briefly, twoing divides the data into two groups of equal
size, instead of trying to split the data so that the subgroups are as pure as possible.
We resist a detailed description of this measure within the CART here and refer to
Breiman et al. (1984) and IBM (2015a).

C5.0
The C5.0 algorithm was developed by Ross Quinlan and is an evolution of his own
C4.5 algorithm Quinlan (1993), which itself originated from the ID3 decision tree
Quinlan (1986). Its ability to split is not strictly binary, but allows for partitioning of
a data segment into more than two subgroups. As with the CART, the C5.0 tree
provides a pruning method after the tree has grown, and the splitting rules of the
nodes are selected via an impurity measure. The measure used is the Information
Gain of the Entropy. The entropy quantifies the homogeneity of categories into a
node and is given by
922 8 Classification Models

X N ðσ; jÞ 
N ðσ; jÞ
Entropyðσ Þ ¼  log2 ;
j
N ðσ Þ N ðσ Þ

where σ symbolizes the current node, j is a category, N(σ, j) the number of data
records of category j in node σ, and N(σ) the total number of data in node σ (see the
description of CART). If all categories are equally distributed in a node segment,
the entropy maximizes, and if all data are of the same class then it takes its
minimum as 0. The Information Gain is now defined as

X N ðσ 1 Þ
InformationGainðσ; sÞ ¼ Entropyðσ Þ  Entropyðσ 1 Þ;
σ1
N ðσ Þ

with σ 1 being one of the child nodes of σ resulting from the split and measuring the
change in purity in the data segments of the nodes. The splitting criteria s, which
maximizes the Information Gain, is selected for this particular node.
In 2008, the C4.5 was picked as one of the top ten algorithms for data mining Wu
et al. (2008). More information on the C4.5 and C5.0 decision trees can be found in
Quinlan (1993) and Lantz (2013).

CHAID (CHi-squared Automatic Interaction Detector)


The CHAID is one of the oldest decision tree algorithms Kass (1980) and allows
splitting into more than two subgroups. Pruning is not provided with this algorithm,
however. The CHAID uses the Chi-square independence test, Kanji (2009), to
decide on the splitting rule for each node. As the Chi-square test is only applicable
to categorical data, all numerical input variables have to be grouped into categories.
The algorithm does this automatically. For each input variable, the classes are
merged into a super-class, based on their statistical similarity, and maintained if
they are statistically dissimilar. These super-class variables are then compared with
the target variable for dependency, i.e., similarity, with the Chi-square indepen-
dence test. The one with the highest significance is then selected as the splitting
criteria for the node.
For more information, see Kass (1980) and IBM (2015a).

QUEST (Quick, Unbiased, Efficient Statistical Tree)


The QUEST algorithm, Loh and Shih (1997), only constructs binary trees and has
been specially designed to reduce the execution time of large CART trees. It was
furthermore developed to reduce the tendency for input variables that allow more
splits, i.e., numeric variables or categorical variables with many classes. For each
split, an ANOVA F-test (numerical) or Chi-square test (categorical variable), see
Kanji (2009), is performed, to determine the association between each input
variable and the target. QUEST further provides an automatic pruning method, to
avoid overfitting and improve the determination of an optimal tree depth. For more
detailed information, see Loh and Shih (1997) and IBM (2015a).
8.8 Decision Trees 923

Table 8.9 Decision tree algorithms with the corresponding node in the modeler
Decision tree Method SPSS Modeler node
CART – Gini coefficient C&R Tree
– Twoing criteria
C5.0 Information Gain (Entropy) C5.0
CHAID Chi-squared statistics CHAID
QUEST Significance statistics QUEST

In Table 8.9, the decision tree algorithms, their splitting methods, and the
corresponding nodes in the Modeler are displayed.
For additional information and more detailed descriptions of these decision
trees, we refer the interested reader to Mingers (1989); Rokach and Maimon.

Boosting (AdaBoost)
In Sect. 5.3.6, the technique of ensemble modeling and particularly Boosting was
discussed. Since the concept of Boosting originated with decision trees and is still
mostly used for classification models, we hereby explain the technique once again,
but in more detail.
Boosting was developed to increase prediction accuracy by building a sequence
of models. The key idea behind this method is to give misclassified data records a
higher weight and correct classified records a lower weight, to point the focus of the
next component onto the incorrectly predicted records. With this approach, the
classification problem is then shifted to the data records, which usually perish in the
analysis, and the records that are easy to handle and correctly classified anyway are
neglected. All component models in the ensemble are built on the entire dataset, and
the weighted models are aggregated into a final prediction.
This process is demonstrated in Fig. 8.280. In the first step, the unweighted data
(circles and rectangles) are divided into two groups. With this separation, two
rectangles are located on the wrong side of the decision boundary, hence, they
are misclassified. See the circled rectangles in the top right graph. These two data
rectangles are now more heavily weighted, and all other points are down-weighted.
This is symbolized by the size of the points in the bottom left graph in Fig. 8.280.
Now, the division is repeated with the weighted data. This results in another
decision boundary, and thus, another model. See the bottom right graph in
Fig. 8.280. These two separations are now combined through aggregation, which
results in perfect classification of all points. See Fig. 8.281.
The most common and best-known boosting algorithm is the AdaBoost or
adaptive boosting Zhou (2012). We refer to Sect. 5.3.6 and Lantz (2013), Tuffery
(2011), Wu et al. (2008), Zhou (2012) and James et al. (2013) for further informa-
tion on boosting and other ensemble methods, such as bagging.
924 8 Classification Models

Fig. 8.280 Illustration of the boosting method

Fig. 8.281 The concept of boosting. Aggregation of the models and the final model

One additional remark


Nodes depend upon each other during the splitting process, as the data from where
the next partition rule is selected is created by splitting the previous node. This is
one reason why splitting one node into multiple segments can improve the predic-
tion performance of a tree.
8.8 Decision Trees 925

8.8.2 Building a Decision Tree with the C5.0 Node

There are four nodes that can be used to build one of the above described decision
trees in the SPSS Modeler. See Table 8.9. As their selection options are relatively
similar, we only present the C5.0 in this section and the CHAID node in the
subsequent chapter and refer to the exercises for usage of the remaining nodes.
We show how a C5.0 tree is trained, based on credit rating data “tree_credit”, which
contains demographic and historic loan data from bank customers and their related
credit rating (“good” or “bad”).

Description of the model


Stream name C5.0_credit_rating
Based on dataset Tree_credit.sav (see Sect. 10.1.33)
Stream structure

Important additional remarks:


The target variable should be categorical (i.e., nominal or ordinal) for the C5.0 to work properly.
Related exercises: 3, 4

1. First, we open the stream “000 Template-Stream tree_credit”, which imports the
tree_credit data and already has a Type node attached to it. See Fig. 8.282. We
save the stream under a different name.
2. To set up a validation of the tree, we add a Partition node to the stream and place
it between the source and the Type node. Then, we open the node and define
70 % of the data as training data and 30 % as test data. See Sect. 2.7.7 for a
description of the Partition node.
3. Now we add a C5.0 node to the stream and connect it to the Type node.
4. In the Fields tab of the C5.0 node, the target and input fields have to be selected.
As in the other model nodes, we can choose between a manual setting and
automatic identification. The latter is only applicable if the roles of the variables
have already been defined in a previous Type node. Here, we select the “Credit
rating” variable as the target and “Partition” as the partition defining field. All
the other variables are chosen as the input. See Fig. 8.283.
5. In the Model tab, the parameters of the tree building process are set. We first
enable the “Use partitioned data” option, in order to build the tree on the training
data and validate it with the test data. Besides this common option, the C5.0 tree
offers two other output types. In addition to the decision tree, one can choose
926 8 Classification Models

Fig. 8.282 Template stream for the tree_credit data

Fig. 8.283 Variables are set in the C5.0 node

“Rule set” as the output. In this case, the set of rules is derived from the tree and
contains a simplified version of the most important information of the tree. Rule
sets are handled a bit differently, as now, multiple rules, or no rule at all, can
apply to a particular data record. The final classification is thus done by voting,
8.8 Decision Trees 927

Fig. 8.284 The options for the C5.0 tree training process are set

see IBM (2015b). Here, we select a decision tree as our output; see arrow in
Fig. 8.284. For “Rule set” building, see exercise 4 in Sect. 8.8.4.
For the training process, three additional methods, which may improve the
quality of the tree, can be selected. See Fig. 8.284. These are:

– Group symbolics. This method attempts to group categories with a similar


structure of variables to the target variable.
– Boosting to improve the models accuracy. See Sect. 8.8.1, for a description of
this method.
– Cross-Validation, more precisely V-fold cross validation, which is useful if
the data size is small. It also generates a more robust tree. See Sect. 8.7.2, for
the concept of V-fold cross-validation.

In the bottom area of the Model tab, the parameters for the pruning process are
specified. See Fig. 8.284. We can choose between a “Simple” mode, with many
predefined parameters, and an “Expert” mode, in which experienced users are
able to define the pruning settings in more detail. We select the “Simple” mode
and declare accuracy as more important than generality. In this case, the pruning
process will focus on improving the models accuracy, whereas if the
928 8 Classification Models

Fig. 8.285 Misclassification cost options in the C5.0 node

“Generality” option was selected, trees that are less susceptible to the problem
would be favored.
If the proportion of noisy records in the training data is known, this informa-
tion can be included in the model building process and will be considered while
fitting the tree. For further explanation of the “Simple” options and the “Expert”
options, we refer to IBM (2015b).
6. In the Cost tab, one can specify the cost when a data record is misclassified. See
Fig. 8.285. With some problems, misclassifications are more costly than others.
For example, in the case of a pregnancy test, a diagnosis of non-pregnancy of a
pregnant woman might be more costly than the other way around, since in this
case the woman might return to drinking alcohol or smoking. To incorporate this
into the model training, the error costs can be specified in the misclassification
cost matrix of the Costs tab. By default, all misclassification costs are set to 1. To
change particular values, enable the “Use misclassification costs” option and
enter new values into the matrix below. Here, we stay with the default misclassi-
fication settings.
7. In the Analyze node, we further select the “Calculate predictor importance”
option.
8. Now we run the stream and the model nugget appears. Views of the model
nugget are presented in the Sect. 8.8.3.
9. We add the usual Analysis node to the stream, to calculate the accuracy and Gini
for the training set and test set. See Sect. 8.2.6 for a detailed description of the
Analysis node options. The output of the Analysis node is displayed in
Fig. 8.286. Both the training set and testing set have an accuracy of about
80 % and a Gini of 0.704 and 0.687, respectively. This shows quite good
prediction performance, and the tree doesn’t seem to be overfitting the training
data.
8.8 Decision Trees 929

Fig. 8.286 Accuracy and Gini in the C5.0 decision tree on the “tree_credit” data

8.8.3 The Model Nugget

The model nuggets of all four decision trees C5.0, CHAID, C&R Tree, and QUEST
are exactly the same with the same views, graphs, and options. Here, we present the
model nuggets and graphs of these four trees, by inspecting the model nugget of the
C5.0 model built in the previous Sect. 8.8.2 on the credit rating data.

Model tab—Tree structure rules and predictor importance


The model tab is split into two panels. See Fig. 8.287. If the predictor importance
calculation is selected in the model settings, the usual graph that visualizes these
statistics is displayed in the right panel. In this graph, we can also get a quick
overview of all the variables that are used to build the tree and define the splitting
criteria into at least one node. In our case with the C5.0 tree and the credit rating
data, node splitting involves the three variables “Income level”, “Number of credit
cards”, and “Age”. The variables “Education” and “Car loans”, which were also
selected as input variables (see Fig. 8.283), are not considered in any node
partitioning and, thus, not included in the final model.
The left panel in Fig. 8.287 shows the rules that define the decision tree. These
are displayed in a tree structure, where each rule represents one element of the tree
930 8 Classification Models

Fig. 8.287 Model view in the model nugget. The tree structure is shown on the left and the
predictor importance in the right panel

Fig. 8.288 Part of the rule tree of the C5.0 tree, which classifies customers into credit score
groups

that the properties of a data record have to fulfill, in order to belong in this branch.
Figure 8.288 shows part of the tree structure in the left panel. Behind each rule, the
mode of the branch is displayed, that is, the majority target category of the branch
belonging to this element. If the element ends in a leaf, the final classification is
further added, symbolized by an arrow. In Fig. 8.288, for example, the last two rules
define the splitting of a node using the age of a customer. Both elements end in a
leaf, where one assigns the customers a “Bad” credit rating (Age
29.28) and the
other a “Good” credit rating (Age > 29.28).

Detailed view of the decision tree and its nodes


In the View tab, the decision tree can be viewed in more detail with the occupation
statistics of each node. Figure 8.289 shows the C5.0 tree of the credit rating, which
was built in Sect. 8.8.2. We see that the tree consists of eight nodes, with the root
node on the left, five leaves, four splits, and five levels. The view of the tree can be
easily modified in the top panel. There, the orientation of the tree can be specified,
as well as the visualization mode of the tree nodes.
Figure 8.290 shows the three visualization node types, which can be chosen in
the top options (see left arrow in Fig. 8.289). The default node type is the first one in
the figure, which displays the occupation statistics from the training data of this
8.8 Decision Trees 931

Fig. 8.289 Visualization of the tree in the Viewer tab of a tree model nugget

Fig. 8.290 Node types in a tree view

node. First, the absolute and relative frequencies of the whole training data belong-
ing to this node are shown at the bottom. In our example, 49.94 % of the training
data are in node 3, which is 839 data records in total. Furthermore, the distribution
of the target variable of the training data in this node is displayed. More precisely,
for each category in the target, the absolute and relative frequencies of the data in
the node are shown. In this example, 70.56 % of the training data from node 3 are
from customers with a “Bad” credit rating. That is 592 in total. See first graph in
Fig. 8.290. Besides statistics visualization, the node can also be chosen to show only
a bar-graph, visualizing the distribution of the data’s target variable in the node. See
the second graph in Fig. 8.290. As a third option, the nodes in the tree can combine
both the statistics and the bar-graph visualization and present them in the tree view.
See last graph in Fig. 8.290.
The choice of visualization just depends on the analyst’s preference, the situa-
tion, and the audience for the results.
932 8 Classification Models

8.8.4 Building a decision tree with the CHAID node

Here, we present how to build a classifier with the CHAID node. To do this we
reuse the credit rating data, from Sect. 8.8.2, while building a C5.0 decision tree.
The options and structure of the CHAID node are similar to the C&R Tree and
QUEST node and are introduced as representative of these three nodes. The C&R
Tree and QUEST node are described in more detail in the exercises later.

Description of the model


Stream name CHAID_credit_rating
Based on dataset Tree_credit.sav (see Sect. 10.1.33)
Stream structure

Important additional remarks:


The CHAID node can also be used for regression. In order to build a classification model, the
target variable has to be categorical. The nodes C&R Tree and QUEST comprise very similar
options to the CHAID node. See exercises.
Related exercises: 1, 2, 3

1. As in Sect. 8.8.2, we first open the stream “000 Template-Stream tree_credit”


and save it under a different name. See Fig. 8.282. The stream imports the
tree_credit data and defines the roles and measurements of the variables with a
Type node.
2. We insert a Partition node between the Source and the Type node, to set up a
validation process in the tree, and define 70 % of the data as training and 30 % as
test. See Sect. 2.7.7 for a description of the Partition node.
3. Now we add a CHAID node to the stream and connect it to the Type node.
4. In the Fields tab of the CHAID node, the target and input fields can be selected.
As in the other model nodes, we can choose between a manual setting and
automatic identification. The latter is only applicable if the roles of the variables
are already defined in a previous Type node. Here, we select the “Credit rating”
variable as our target. The partitioning defined in the Partition node is identified
automatically, and all other variables are chosen as the input. See Fig. 8.291.
The tree building options are specified in the Building Options tab. In the
Objective view, the general tree building parameters can be set. Here, we can
choose between building a new model and continuing to train an existing one.
The latter is useful if new data are available and a model has to be updated with
8.8 Decision Trees 933

Fig. 8.291 Node types in a tree view

the new data; it will save us from building a completely new one. Furthermore,
we can select to build a single decision tree or to use an ensemble model to train
several trees and combine them into a final prediction. The CHAID node
provides a boosting and bagging procedure for creating a tree. For a description
of ensemble models, boosting, and bagging, we refer to Sects. 5.3.6 and 8.8.1.
Here, we select to build a single tree. See Fig. 8.292.
If an ensemble algorithm is selected as the training method, the finer options
of these models can be specified in the Ensemble view. There, the number of
models in the ensemble, as well as the aggregation methods, can be selected. As
this is the same for other classification nodes that provide ensemble techniques,
we refer to Fig. 8.185 and Table 8.7 in the chapter on neural networks (Sect. 8.6)
for further details.
In the Basics view, the tree growing algorithm is defined. See Fig. 8.293. By
default, this is the CHAID algorithm, but the CHAID node provides another
algorithm, the “Exhaustive CHAID” algorithm, which is a modification of the
basic CHAID. We refer to Biggs et al. (1991) and IBM (2015a). Furthermore, in
this panel the maximum tree depth is defined. See bottom arrow in Fig. 8.293.
The Default tree depth is five, which means the final tree has only five levels
beneath the root node. The maximum height of the decision tree can be changed
by clicking in the Custom option and inserting the favored height.
934 8 Classification Models

Fig. 8.292 Selection of the general model process, i.e., a single or ensemble tree

Fig. 8.293 The tree growing algorithm is selected in the CHAID node
8.8 Decision Trees 935

Fig. 8.294 Stopping rules in the CHAID node

In the Stopping Rules panel, we define the criteria for when a node stops
splitting and is defined as a leaf. See Fig. 8.294. These criteria are based on the
number of records in the current or child nodes, respectively. If these are too low
in absolute numbers relative to the total data size, the tree will stop branching in
that particular node. Here, we stay with the default settings, which are pertaining
to the relative number of data records on the whole dataset in the current node
(2 %) and child node (1 %). The latter stopping rules come into effect if one of
the child nodes would contains less than 1 % of the whole dataset.
In the Cost panel, the misclassification cost can be adjusted. This is useful if
some classification errors are more costly than others. See Fig. 8.295 for the Cost
panel and Sect. 8.8.2 for a description of the misclassification problem and how
to change the default misclassification cost.
In the Advanced view (see Fig. 8.296), the tree building process can be fine-
tuned, mainly the parameters of the algorithm that selects the best splitting
criteria. As these options should only be applied by experienced users, and the
explanation of each option would be far beyond the scope of this book, so we
omit a detailed description here, and refer the interested reader to IBM (2015b).
5. We are now finished with the parameters setting in the mode building process
and can run the stream. The model nugget appears.
6. Before inspecting the model nugget, we add an Analysis node to the nugget and
run it to view the accuracy and Gini of the model. See Sect. 8.2.6 for a
description of the Analysis node. The output of the Analysis node is displayed
in Fig. 8.297. We notice that the accuracy of the training set and testing set are
both slightly over 80 %, and the Gini is 0.787 and 0.77, respectively. That
indicates quite precise classification with no overfitting.
936 8 Classification Models

Fig. 8.295 Misclassification cost option in the CHAID node

Fig. 8.296 Advanced option in the tree building process of the CHAID node
8.8 Decision Trees 937

Fig. 8.297 Accuracy and Gini of the CHAID decision tree on the “tree_credit” data

Fig. 8.298 Model view of the CHAID model nugget on the credit rating data

The model nugget and trained decision tree


When inspecting the model nugget, we see in the Model tab on the right side that the
variables included in the tree are “Income level”, “Number of credit cards”, and “Age”.
On the left side, the rule set is displayed, and we notice that several nodes are split into
more than two branches, which is a property of the CHAID tree (Fig. 8.298).
938 8 Classification Models

The complete, large tree structure can be viewed in the View tab of the CHAID
model nugget but it is too large to show here.

8.8.5 Exercises

Exercise 1 The C&R Tree node and variable generation


The dataset “DRUG1n.sav” contains data of a drug treatment study (see Sect. 10.1.9).
The patients in this study all suffer from the same illness but respond differently
to medications A, B, C, X, and Y. The task in this exercise is to train a CART
with the C&R Tree node in the SPSS Modeler that automatically detects the most
effective drug for a patient. To do so, follow the steps listed below.

1. Import the “DRUG1n.sav” dataset and divide it into a training (70 %) and testing
(30 %) set.
2. Add a C&R Tree node to the stream. Inspect the node and compare the options
with the CHAID node as described in Sect. 8.8.4. What settings are different?
Try to find out what their purpose is, e.g., by looking them up in IBM (2015b).
3. Build a single CART with the Gini impurity measure as the splitting selection
method. What are the error rates for the training and testing set?
4. Try to figure out why the accuracy of the above tree differs so much between the
training and testing set. To do so, inspect the K and Na variables, e.g., with a
scatterplot. What can be done to improve the models precision?
5. Create a new variable that describes the ratio of the variables Na and K and
discard Na and K from the new stream. Why does this new variable improve the
prediction properties of the tree?
6. Add another C&R Tree node to the stream and build a model that includes the
ratio variable of Na and K as input variable instead of Na and K separately. Has
the accuracy changed?

Exercise 2: The QUEST node—Boosting & Imbalanced data


The dataset “chess_endgame_data.txt” contains 28056 chess endgame positions
with a white king, white rook, and black king left on the board (see Sect. 10.1.6). In
this exercise, the QUEST node is introduced while training a decision tree on the
chess data which predicts the outcome of such endgames, i.e., whether white wins
or they draw. The variable “Result for White” thereby describes the number of
moves white needs to win or to achieve a draw.

1. Import the chess endgame data with a proper Source node and reclassify all
“Result for White” variable into a binary field that indicates whether white wins
the game or not. What is the proportion of “draws” in the dataset? Split the data
into 70 % training and 30 % test data. See Exercise 1 in Sect. 8.6.4.
2. Add a QUEST node to the stream and inspect its options. Compare them to the
CHAID node, as described in Sect. 8.8.4, and the C&R Tree node introduced in
Exercise 1. What settings are different? Try to find out what their purpose is, e.g.,
by looking them up in IBM (2015b).
8.8 Decision Trees 939

3. Train a decision tree with the QUEST node. Determine the accuracy and Gini
values for the training and test set.
4. Build a second decision tree with another QUEST node. Use the boosting
method to train the model. What are accuracy and Gini values for these models?
Compare them with the first QUEST model.

Exercise 3: Detection of diabetes—comparison of decision tree nodes


The “diabetes_data_reduced” dataset contains blood measurements and bodily
characteristics of Indian females (see Sect. 10.1.8). Build three decision trees
with the C&R Tree, CHAID, and C5.0 node to detect diabetes (variable
“class_variable”) of a patient based on its blood measure values and body
constitution.

1. Import the diabetes data and set up a cross-validation stream with training,
testing, and validation set.
2. Build three decision trees with the C&R Tree, CHAID, and C5.0 node.
3. Compare the structures of the three trees to each other. Are they similar to each
other or do they branch completely differently?
4. Calculate appropriate evaluation measures and graphs to measure their predic-
tion performance.
5. Comprise these three decision trees in an ensemble model by using the Ensemble
node. What is the accuracy and Gini of this ensemble model?

Exercise 4: Rule set and cross-validation with C5.0


The dataset “adult_income_data” contains the census data of 32,561 citizens (see
Sect. 10.1.1). The variable “income” describes whether a census participant has an
income of more or less than 50,000 US dollars. Your task is to build a C5.0 rule set
and decision tree to predict the income of a citizen.

1. Import the data and divide it into 70 % training and 30 % test data.
2. Use two C5.0 nodes to build, respectively, a decision tree and a rule set that
predict the income of a citizen based on the variables collected in the census
study.
3. Compare both approaches to each other by calculating the accuracy and Gini of
both models. Then draw the ROC for each model.

8.8.6 Solutions

Exercise 1: The C&R Tree node and variable generation


Name of the solution streams druglearn modified C&R Tree
Theory discussed in section Sect. 8.8.1
Sect. 8.8.4

The final stream for this exercise should look like the stream in Fig. 8.299.
940 8 Classification Models

Fig. 8.299 Drug study classification exercise stream

Fig. 8.300 Enabling and pruning setting in the C&R Tree node

1. First we import the dataset with the Statistics File node and connect it with a
Partition node, followed by a Type node. We open the Partition node and define
70 % of the data as training and 30 % as test set. See Sect. 2.7.7 for a detailed
description of the Partition node.
2. Next, we add a C&R Tree node to the stream and connect it to the Type node.
We open it to inspect the properties and options. We see that compared to the
CHAID node, there are three main differences in the node Build Options.
In the Basics panel, the splitting method selection is missing, but the pruning
process with its parameter can be set. See Fig. 8.300. Remember that pruning
8.8 Decision Trees 941

Fig. 8.301 Prior setting in the C&R Tree node

cuts back the fully grown tree in order to face the problem of overfitting. To
manipulate the pruning algorithm, the maximum risk change can be defined.
Furthermore, the maximum number of surrogates can be changed. Surrogates are
used for handling missing values. For each split, the input field that is most
similar to the splitting is set as its surrogate. If a data record with missing values
has to be classified, the surrogate’s value is used as input for the missing value.
See IBM (2015b) for more details. Increasing the number of surrogates will
generate flexibility of the model but increase memory usage.
In the Cost & Priors panel, priors can be set for each target category, apart
from the misclassification costs. See Fig. 8.301. The prior probability describes
the relative frequency of the target categories of the total population from which
the training data is drawn. It gives prior information about the target variable and
can be changed if, e.g., the distribution of the target variable in the training data
does not equal the distribution of the population. There are three possible
settings:

– Base on training: This is the default setting and is the distribution of the target
variable in the training data.
942 8 Classification Models

Fig. 8.302 Setting of the impurity measure in the C&R Tree node

– Equal for all classes: All target categories appear equally in the population
and therefore have the same prior probability.
– Custom: Specification of customized probabilities for each target category.
The probabilities have to add up to 1. Otherwise, an error will occur.

See IBM (2015b) for additional information on the prior settings.


The impurity measure can be defined in the Advanced panel, see Fig. 8.302.
By default, this is the Gini coefficient. Further measures are Twoing (see
Sect. 8.8.1) and Ordered, which add the constraints to the Twoing method that
adjoined target classes of an ordinal target and which can be grouped together. If
the target variable is nominal, the Twoing method is used. See IBM (2015b) for
further details. Moreover, the minimum change in impurity which has to be
fulfilled in order for a node to split can be set.
Additionally, the proportion of data that is used to prevent overfitting can be
defined. 30 % is the default. See Fig. 8.302.
We define Gini as the impurity measure and run the stream. The model nugget
appears. On inspecting the model nugget, we notice in the Model tab that the K
variable is the most important one in the tree, with K being the field of the first
split. See Fig. 8.303. The tree itself consists of 4 levels.
3. We add an Analysis node to the model nugget to calculate the accuracy of the
model. See Sect. 8.2.6 for a description of the Analysis node. The output of the
Analysis node is shown in Fig. 8.304. We see that the accuracy in the training
data is pretty high at about 94 %, while in the testing set the accuracy is only
78 %. This indicates overfitting of the model on the training data.
8.8 Decision Trees 943

Fig. 8.303 Tree structure and variable importance of the CART for the drug exercise

Fig. 8.304 Accuracy of the first CART for the drug exercise
944 8 Classification Models

Fig. 8.305 Scatterplot of the variables Na and K

4. To figure out the reason for this discrepancy in the accuracy of the training and
testing set, we add a Plot node to the Type node to draw a scatterplot of the
variables Na and K. We further want to color the data of each drug (target class)
differently. See Sect. 4.2 for how to create a scatterplot with the Plot node. The
scatterplot is shown in Fig. 8.305. We see that the drug Y can be separated from
all other drugs by a line. However, this line is not parallel to the axis. As we
learned in Sect. 8.8.1, a decision tree is only able to divide the data space parallel
to the axis. This can be a reason for the model to overfit, as it separates the
training data perfectly, but the horizontal and vertical found decision boundaries
are not sufficient to classify the testing data. A ratio value of Na and K might
therefore be more appropriate and can lead to more precise predictions.
5. We add a Derive node to the stream and connect it to the Type node to calculate
the ratio variable of Na and K. See Fig. 8.306 for the formula entered in the
Derive node that calculates the new ratio variable.
To descriptively validate the separation power of the new variable
“Na_K_ratio”, we add a Graphboard node to the stream and connect it with
the Derive node. In this node, we select the new variable and choose the Dot plot.
Furthermore, we select the data of the drugs to be plotted in different colors. See
Sect. 4.2 for a description of the Graphboard node. In the Dot Plot in Fig. 8.307,
8.8 Decision Trees 945

Fig. 8.306 Derive node to calculate the ratio of Na and K

Fig. 8.307 Dot Plot of the ratio variable of Na and K


946 8 Classification Models

Fig. 8.308 Filter node to discard Na and K variables from the following stream

we can see that the drug Y can be perfectly separated from all other drugs by this
new ratio variable.
We now add a Filter node to the stream and discard the Na and K variable
from the following stream. See Fig. 8.308. Then we add another Type node to
the stream.
6. Then add another C&R Tree node to the stream and connect it with the last Type
node. We chose the same parameter setting as in the C&R Tree node, except for
the input variables. Here, the new variable “Na_K_Ratio” is included instead of
the Na and K variables. See Fig. 8.309.
We run the stream, and the model nugget appears. In the nugget, we can
immediately see that the new variable “Na_K_Ratio” is the most important
predictor, and it is chosen as the field of the root spilt. See Fig. 8.310. In addition,
we notice that the tree is slimmer compared to the first build tree (see Fig. 8.303),
meaning that the tree has fewer branches. See the Viewer tab for a visualization
of the build tree.
Next we add the standard Analysis node to the model nugget (see Sect. 8.2.6
for details on the Analysis node) and run it. In Fig. 8.311, the accuracy of the
decision tree with the “Na_K_Ratio” variable included is shown. In this model,
the accuracy of the testing data has noticeably improved from about 79 %
(see Fig. 8.304) to more than 98 %. The new variable thus contains a higher
separation ability which improves the prediction power and robustness of a
decision tree.
8.8 Decision Trees 947

Fig. 8.309 Tree structure and variable importance of the CART for the drug exercise

Fig. 8.310 Variable importance of the CART for the drug exercise with the new ratio variable
Na_K_Ratio
948 8 Classification Models

Fig. 8.311 Accuracy of the CART for the drug data with the new ratio variable NA_K_Ratio

Exercise 2: The QUEST node—Boosting & Imbalanced data


Name of the solution streams QUEST_chess_endgame_prediction
Theory discussed in section Sect. 8.8.1
Sect. 8.8.4

The final stream for this exercise looks like Fig. 8.312.

1. The first part of the exercise is analog to the first two parts of Exercise
1 Sect. 8.6.4, so a detailed description is omitted here, and we refer to this
solution for the import, reclassify, and partition of the data. See Figs. 8.198,
8.199, and 8.200. We recall that a chess game ends in a tie about 10 % of the
time, while in 90 % of the games, white wins. See Fig. 8.199.
2. We now add a QUEST node to the stream and connect it with the Partition node.
Comparing the model node and its options with the CHAID and C&R Tree node,
we notice that the QUEST node is very similar to both of them, especially to the
C&R Tree node. The only difference to the C&R Tree node options appears in
the Advanced panel in the Build Options tab. See Fig. 8.313. In addition to the
8.8 Decision Trees 949

Fig. 8.312 Stream of the chess endgame prediction with the QUEST node exercise

Fig. 8.313 Settings of the splitting parameter in the QUEST node

overfit prevention setting, the significance level for the splitting selection
method can be set. The default here is 0.05. See the solution of Exercise 1 in
Sects. 8.8.5 and 8.8.4 for a description of the remaining options and the differ-
ence to the other tree nodes. We further cite the manual IBM (2015b) for
additional information on the QUEST node.
3. We run the QUEST node and the model nugget appears. In the model nugget, we
see no splits are executed, and the decision tree consists of only one node, the
root node. Thus, all chess games are classified as having the same ending, “white
wins”. This is also verified by the output of an Analysis node (see Sect. 8.2.6),
which we add to the model nugget. See Fig. 8.314 for the performance statistics
of the QUEST model. We see in the coincidence matrix that all data records
(chess games) are classified as white winning games. With this simple
950 8 Classification Models

Fig. 8.314 Performance statistics of the QUEST model for the chess endgame data

classification, the accuracy is high by about 90 %, exactly the proportion of


white winning games in the dataset. However, the Gini value in the training and
test set is 0, which indicates that the prediction is not better than guessing.
4. We add another QUEST node to the stream, connect it with the Partition node,
and open it. We chose the same settings as in the node before, but change from a
standard model to a boosting model, which can be set in the Building Options
tab. See Fig. 8.315.
We then run the stream and a second model nugget appears. We rearrange the
two model nuggets so that they are aligned consecutively. See Fig. 8.107 for an
illustration of this procedure. We add another Analysis node to the stream,
connect it with the last model nugget, and set the options to calculate the
coincidence matrix and Gini. After running the Analysis node, the calculated
model evaluation statistics are shown in the pop-up window which then appears.
Figure 8.316 shows the accuracy and coincidence matrix of the boosting model
and the Gini of the decision tree with and without boosting. We immediately see
that the accuracy has improved to over 99 %. The coincidence matrix reveals
that both draws and white winning games are detected by the boosted tree. The
8.8 Decision Trees 951

Fig. 8.315 Definition of a boosted decision tree modeling in the QUEST node

model’s high quality of prediction is also demonstrated by the extremely high


Gini values (over 0.99).
In conclusion, boosting improves the prediction power of the decision tree
built by the QUEST algorithm massively. Compared to a tree without boosting,
the minority class, a “draw” in this example, is paid more attention during the
model building by boosting. This results in a much better model fit and thus
prediction ability.

Exercise 3: Detection of diabetes—comparison of decision tree nodes


Name of the solution streams diabetes_prediction_comparison_of_tree_nodes
Theory discussed in section Sect. 8.8.1
Sect. 8.8.2
Sect. 8.8.4

The final stream for this exercise looks like the stream in Fig. 8.317.

1. We start by opening the template stream “015 Template-Stream_Diabetes” (see


Fig. 8.318), which imports the diabetes data and has a Type node already
attached to the Source node in which the roles of the variables are defined.
The variable “class_variable” is set as target variable. In the Type node, we
952 8 Classification Models

Fig. 8.316 Evaluation statistics of the boosted QUEST model

Fig. 8.317 Stream of the diabetes detection with decision trees exercise

change the measurement class of the variable “class_variable” to Flag in order to


calculate the Gini evaluation measure later.
Next, we add a Partition node to the stream and place it between the Source
and Type node. In the Partition node, we specify 60 % of the data as training and
8.8 Decision Trees 953

Fig. 8.318 Template stream


of the diabetes dataset

Fig. 8.319 Fitted CART for


diabetes prediction

20 % as test and validation set, respectively. See Sect. 2.7.7 for the description of
the Partition node.
2. To build the three decision tree models, we add a C&R Tree, a CHAID, and a
C5.0 node to the stream and connect each of them with the Type node. The input
and target variable are automatically detected by the three tree nodes since the
variables roles were already set in the Type node.
Here, we use the default settings provided by the SPSS Modeler to build the
models and therefore run the stream for the decision trees to be constructed. We
align the three model nuggets so the data is passed through the three trees
successively to predict whether a patient suffers from diabetes. See Fig. 8.107
for an example of the rearrangement of the model nuggets in a line.
3. Inspecting the three model nuggets and the structures of the constructed decision
trees, we first see that the CART, built by the C&R Tree, consists of only a single
split and is therefore a very simple tree. See Fig. 8.319. This division is done
based on the variable “glucose_concentration”. So the Gini coefficient partition
criteria is not able to find additional partitions of the data that will improve the
accuracy of the model.
The variable “glucose_concentration” is also the first variable according to
which the data are split in the CHAID and C5.0. The complete structures of these
trees (CHAID and C5.0) are shown in Figs. 8.320 and 8.321. These two trees are
more complex with more branches and sub-trees. When comparing the CHAID
and C5.0 trees to each other, we see that the structure and node splits of the
CHAID can be also found in the C5.0. The C5.0 however contains further splits
and thus divides the data into finer distinctions. So where the CHAID has 3 levels
beneath the root, the C5.0 has 5 tree levels.
4. At the last model nugget, we add an Analysis and an Evaluation node to
calculate and visualize prediction performance of the three models. See
Sect. 8.2.6 and Fig. 8.28 for a description of these two nodes. We run these
two nodes to view the evaluation statistics. The accuracy and Gini values are
954 8 Classification Models

Fig. 8.320 Fitted CHAID for diabetes prediction

Fig. 8.321 Fitted C5.0 for diabetes prediction

shown in Fig. 8.322. We see that the accuracy, despite the training set in the
C&R Tree, is nearly the same in a dataset among all models. The number of
misclassified patients only differs by a maximal of 2 data records in a dataset
among decision trees. The Gini values paint a similar picture; the Gini coeffi-
cient of the CART model is significantly lower in all datasets than the other two
8.8 Decision Trees 955

Fig. 8.322 Accuracy and Gini values of all three decision trees for diabetes prediction

models. The CHAID and C5.0 however have nearly the same Gini coefficients,
with the C5.0 model slightly ahead in the training and validation set. This
indicates that a more complex tree might increase prediction ability.
In Fig. 8.323, the Gini values are visualized by the ROC for the three models
and sets. This graph confirms our conjecture that CHAID and C5.0 are pretty
similar in prediction performance (the ROCs of these modes have nearly the
same shape), while the CART curve is located far beneath the other two curves,
indicating a less ideal fit to the data.
However, none of the three curves is consistently located above the other two.
So, the three models perform differently in some regions of the data space. Even
the CART outperforms the other two models in some cases. See the curves of the
test set in Fig. 8.323. Thus an ensemble mode of the three trees might improve
the quality of the prediction.
956 8 Classification Models

Fig. 8.323 ROC of all three decision trees for diabetes prediction

Fig. 8.324 Settings of the Ensemble node

5. To compress the three trees into an ensemble model, we add an Ensemble node
to the stream and connect it with the last model nugget. In the Ensemble node,
we set “class_variable” as target for the ensemble and choose “Voting” as the
aggregation method. See Fig. 8.324 for the setting of these options, and Table 8.7
for a list of the aggregation methods available in the Ensemble node.
We add another Analysis node to the stream and connect it with the Ensemble
node. Then, we run the Analysis node. The output statistics are presented in
8.8 Decision Trees 957

Fig. 8.325 Analysis node output statistics for the Ensemble node

Fig. 8.325. We note that the accuracy has not changed much compared to the
other models, but the Gini has increased for the testing and validation set. This
indicates that the ensemble model balances the errors of each individual model
and is thus more precise in the prediction of unknown data.

Exercise 4: Rule set and cross-validation with C5.0


Name of the solution streams Income_C5_RuleSet
Theory discussed in section Sect. 8.8.1
Sect. 8.8.2

The final stream for this exercise looks like the stream in Fig. 8.326.

1. First we import the dataset with the Var. File node and connect it with a Type
node. In the Type node, we set the measurement of the variable “income” to Flag
and the role to target. Then we add a Partition node, open it, and define 70 % of
the data as training and 30 % as test set. See Sect. 2.7.7 for a detailed description
of the Partition node.
2. We now add two C5.0 nodes to the stream and connect them with the Partition
node. Since the roles of the variables are already defined in the Type node, the
C5.0 nodes automatically identify the target and input variables. We use the
default model settings provided by the SPSS Modeler and so just have to change
the output type of one node to “Rule Set”. See Fig. 8.327.
Now, we run the stream and the two model nuggets appear. We then rearrange
them in a line, so the models are applied successively to the data. As a result, the
models can be more easily compared to each other. See Fig. 8.107 for an
example of the rearrangement of model nuggets.
958 8 Classification Models

Fig. 8.326 Solution stream of the income prediction exercise

Fig. 8.327 Definition of the rule set output type in the C5.0 node

The final constructed decision tree is very complex and large with a depth of
23, meaning the rule set contains a large number of sole rules. The rule set as
well as the decision tree are too large and complex to describe here and for that
reason have been omitted.
3. To compare the two models to each other, we first add a Filter node to the stream
and connect it with the last model nugget. This node is added simply to rename
8.8 Decision Trees 959

Fig. 8.328 Analysis node output statistics for the Ensemble node

the prediction fields, which are then more easily distinguishable in the Analysis
node. See Sect. 2.7.5 for the description of the Filter node.
Afterwards, we add an Analysis and Evaluation node to the stream and
connect it with the Filter node. See Sect. 8.2.6 and Fig. 8.28 for a description
of the Analysis and Evaluation node options. We then run these two nodes. The
Analysis output with the accuracy and Gini of the two models is shown in
Fig. 8.328. We see that the accuracy of the decision tree C5.0 and rule set
C5.0 model are similar with about 12 % error rate in the training and 14 % error
rate in the test set. However, the decision tree model has a slightly better
performance, as additionally confirmed by the Gini values, which are a bit
higher for both datasets in this case. This indicates that the finding processes
with the rule set or a decision tree are close to each other but have minor
differences. This is evident in the evaluation statistics.
The ROCs of the two models are displayed in Fig. 8.329. As can be seen, the
curve of the decision tree model lies slightly above the curve of the C5.0 rule set
model. Hence, the C5.0 decision tree provides a better prediction power than the
rule set model.
960 8 Classification Models

Fig. 8.329 ROC of the two C5.0 models (decision tree and rule set)

8.9 The Auto Classifier Node

As for the regression and clustering methods (Chaps. 5 and 7), the SPSS Modeler
also provides a node, the Auto Classifier node, which comprises several different
classification methods and can thus build various classifiers in a single step. This
Auto Classifier node provides us with the option of trying out and comparing a
variety of different classification approaches without adding the particular node and
setting the parameter of the algorithm of each model individually, which can
involve a very complex stream with many different nodes. Finding the optimal
parameter of a method, e.g., the best kernel and its parameter of an SVM or the
number of neurons in a Neural Network in particular, can be extremely cumber-
some and is thus reduced to a very clear process in a single node. Furthermore, the
utilization of the Auto Classifier node is an easy way to consolidate several different
classifiers into an Ensemble model. See Sects. 5.3.6 and 8.8.1 and the references
given there for a description of Ensemble techniques and modeling. All models
built with the Auto Classifier node are automatically evaluated and ranked
according to a predefined measure. So the best performing models can be easily
identified and added to the ensemble.
Besides the classification methods and nodes introduced in this chapter, the Auto
Classifier node also comprises the Bayesian Network and Decision List techniques.
We cite Ben-Gal (2008) and Rivest (1987), respectively, for a description of these
classification algorithms and IBM (2015b) for a detailed introduction of their nodes
and options in the SPSS Modeler. See Fig. 8.330 for a list of all nodes included in
the Auto numeric node.
8.9 The Auto Classifier Node 961

Fig. 8.330 Nodes included within the Auto Classifier node. The darker circles are the nodes for
classification models which are described in this chapter. The lighter circles are additional
classification nodes of other models within the Auto Classifier node

Before turning to the description of the Auto Classifier node and how to apply it
to a dataset, we would like to point out that building a huge number of models is
very time consuming. That’s why one must pay attention to the number of different
parameter settings chosen in the Auto Classifier node since a large number of
different parameter values leads to a huge number of models. The building process
may take a very long time to calculate in this case, sometimes hours.

8.9.1 Building a Stream with the Auto Classifier Node

Below, you will learn how to use the Auto Classifier node effectively to build
different classifiers of the same problem in a single step and identify the optimal
models for our data and mining task. A further advantage of this node is its ability to
unite the best classifiers into an ensemble model to combine the strength of different
classification approaches and counteract the weaknesses. Furthermore, cross-
validation to find the optimal parameters of a model can be easily carried out within
the same stream. We introduce the Auto Classifier node by applying it to the
962 8 Classification Models

Wisconsin breast cancer data, to build classifiers that are able to determine benign
from malignant cancer samples.

Description of the model


Stream name Auto_classifier_node
Based on dataset WinconsinBreastCancerData (see Sect. 10.1.35)
Stream structure

Important additional remarks:


The target variable must be nominal or binary in order to use the Auto Classifier node.
Related exercises: 1, 2

1. First, we open the template stream “016 Template-Stream_Wisconsin-


BreastCancer” (see Fig. 8.331) and save it under a different name. The target
variable is called “class” and takes values 2 for “benign” and 4 for “malignant”
samples.
The template stream imports the data and already has a Type node attached
to it. In the Type node, the variable “class” is already defined as target variable
with measurement Flag. Except for the “Sample Code” variable, which is set to
none since it just labels the cancer samples, all other variables are set as input
variables.
2. To set up validation of the models, we add a Partition node to the stream
and insert it between the Source and Type node. We split the data into 70 %
training and 30 % testing data. The Partition node is described in more detail in
Sect. 2.7.7.
3. Now, we add the Auto Classifier node to the canvas, connect it with the Type
node, and open it with a double click. The target and input variables can be
specified in the Fields tab. See Fig. 8.332. If the variable roles are already set in

Fig. 8.331 Template stream


of the Wisconsin breast
cancer data
8.9 The Auto Classifier Node 963

Fig. 8.332 Definition of target and input variables and the partition field

a previous Type node, these are automatically recognized by the Auto Classi-
fier node if the “Use type node setting” option is chosen in the Fields tab. For
description purposes, we set the variable roles here manually. So we select the
“Use custom settings” option and define “class” as target, “Partition” as
partitioning identification variable, and all remaining variables, except for
“Sample Code”, as input variables. This is shown in Fig. 8.332.
4. In the Model tab, we enable the “Use partitioned of data” option. See the top
arrow in Fig. 8.333. This option will lead the Model to be built based on the
training data alone.
In the “Rank models by” selection field, we can choose the score that
validates the models and compares them to each other. Possible measures are
listed in Table 8.10. Some of the rank measures are only available for a binary
(Flag) target variable. Here, we chose the “Area under the curve” (AUC) rank
measure.
964 8 Classification Models

With the “rank” selection, we can choose whether the models should be
ranked by the training or the test partition, and how many models should be
included in the final ensemble. Here, we elect that the ensemble should have
3 models and is ranked by the calculations on the test set. See the bottom arrow
in Fig. 8.333. At the bottom of the Model tab, we can further set the revenue
and cost values used to calculate the profit. Furthermore, a weight can be
specified to adjust the results. In addition, the percentile considered for the
Lift measure calculations can be set (see Table 8.10 and IBM (2015b)). The
default here is 30.
In the Model tab, we can also choose to calculate the predictor importance,
and we recommend enabling this option each time.
5. The next tab is the “Expert” tab. Here, the classification models, which should
be calculated and compared with each other can be specified. See Fig. 8.334.
We can include a classification method by checking its box on the left. All

Fig. 8.333 Model tab with the criteria that models should be included in the ensemble
8.9 The Auto Classifier Node 965

Table 8.10 Rank criteria provided by the Auto Classifier node


Rank measure Description Target type
Overall accuracy Percentage of correctly predicted data records. Nominal, Flag
Area under the Area under the ROC. A higher value indicates a better Flag
curve (AUC) fitted model. See Sect. 8.2.5 for details on the ROC and
the area under the curve measure.
Profit Sum of profits across cumulative percentiles. A profit for Flag
a data record is the difference in its revenue and cost. The
revenue is the value associated with a hit and the cost is
the value associated with a misclassification. These
values can be set at the bottom of the Model tab.
Lift The hit ratio in cumulative quantiles relative to the Flag
overall sample. The percentiles used to calculate the Lift
can be defined at the bottom of the Model tab.
Number of fields Number of input fields used in the model. Nominal, Flag

Fig. 8.334 Selection of the considered classification methods


966 8 Classification Models

Fig. 8.335 Parameter setting of the Neural Net node in the Auto Classifier node

models marked in this way are built on the training set, compared to each other,
and the best ranked are selected and added in the final ensemble model.
We can further specify multiple settings for one model type, in order to
include more model variations and to find the best model of one type. Here, we
also want to consider a boosted Neural Network in addition to the standard
approach. How to set the parameter to include this additional Neural Network
in the Auto Classifier node building process is described below.
To include more models of the same type in the building process, we click
on the “Model parameters” field next to the particular model, Neural Net in
this example, and choose the option “Specify” in the opening selection bar
(Fig. 8.334). A window pops up which comprises all options of the particular
node, the Neural Net node in this example. See Fig. 8.335.
In this window, we can specify the parameter combinations which should be
considered in separate models. Each parameter or option can thereby be
assigned multiple values, and the Auto Classifier node then builds a model
for every possible combination of these parameters.
For our case, we also want to consider a boosted neural network. So we click
on the “Options” field next to the “Objectives” parameter and select the
“Specify” option in the drop-down menu. In the pop-up window, we select
the boosting and standard model options. This is shown in Fig. 8.336. Then we
click the OK button. This will enable the Auto Classifier node to build a neural
network with and without boosting. The selected options are shown in the
“Option” field next to the particular “Parameter” field in the main settings
window of the Neural Net node. See Fig. 8.335.
6. Then we go to the “Exert” tab to specify the aggregation method for the
boosting method in the same manner as for the model objective, i.e., boosting
and standard modeling procedure. We chose two different methods here, the
“Voting” and “Highest mean probability” technique. So, a neural network is
constructed for each of these two aggregation methods (Fig. 8.337).
8.9 The Auto Classifier Node 967

Fig. 8.336 Specification of the modeling objective type for a neural network

7. In summary, we have specified two modeling techniques and two aggregation


methods for ensemble models. The Auto Classifier node now takes all of these
options and builds a model for each of the combinations. So four neural
networks are created in this case, although the aggregation method has no
influence on the standard modeling process. Imprecise parameter setting can
result in countless irrelevant model builds and will massively increase
processing time. Furthermore, identical models can be included in the ensem-
ble and so exclude models with different aspects that might improve the
prediction power of the ensemble. The number of considered models in the
Auto Classifier node is displayed right next to the model type field in the Expert
tab of the Auto Classifier node, see Fig. 8.334.

" The Auto Classifier node takes all specified options and parameters of
a particular node and builds a model for each of the combinations.
For example, in the Neural Net node the modeling objective is chosen
as “standard” and “boosting”, and the aggregation methods “Voting”
and “Highest mean probability” are selected. Although the aggrega-
tion methods are only relevant for a boosting model, 4 different
models are created by the Auto Classifier node:

– Standard neural network with voting aggregation


968 8 Classification Models

Fig. 8.337 Specification of the aggregation methods for the boosting model in the Neural
Net algorithm settings

– Standard neural network with highest mean probability


aggregation
– Boosting neural network with voting aggregation
– Boosting neural network with highest mean probability
aggregation

" Imprudent parameter setting can result in countless irrelevant model


builds and so massively increase the processing time and memory.
Furthermore, identical models (in this case: standard neural net with
voting and highest mean probability aggregation) can be included in
the ensemble if they outperform the other models. In this case,
models with different approaches and aspects that might improve
the prediction or balance overfitting might be excluded from the
ensemble.

8. Rules that a model has to fulfill to be considered as a candidate for the ensemble
can be specified in the Discard tab of the Auto Classifier node. If a model fails
to satisfy one of these criteria, it is automatically discarded from the subsequent
process of ranking and comparison.
8.9 The Auto Classifier Node 969

Fig. 8.338 Definition of the discard criteria in the Auto Classifier node

The Discard tab and its options is shown in Fig. 8.338. The discarding
criteria comprise the ranking criteria, i.e., Overall accuracy, Number of fields,
Area under the curve, Lift, and Profit. In our example case of the Wisconsin
breast cancer data, we discard all models that have a lower than 80 % accuracy,
so that the final model has a minimum hit rate. See Fig. 8.338.
9. In the Settings tab, the aggregation method can be selected: this combines all
component models in the ensemble model generated by the Auto Classifier
node to a final prediction. See Fig. 8.339. The most important aggregation
methods are listed in Table 8.7. Besides these methods, the Auto Classifier
node provides weighted voting methods like “Confidence-weighted voting”
and “Raw propensity-weighted voting”. See IBM (2015b) for details on these
methods. We select the “Confidence-weighted voting” here. The ensemble
method can also be later changed in the model nugget. See Fig. 8.343.
10. When we have set all the model parameters and Auto Classifier options of our
choice, we run the model, and the model nugget appears in the stream. For each
possible combination of selected model parameter options, the Modeler now
generates a classifier, all of which are compared to each other and then ranked
according to the specified criteria. If a model is ranked high enough under the
top three models here, it is included in the ensemble. The description of the
model nugget can be found in Sect. 8.9.2.
11. We add an Analysis node to the model nugget to calculate the evaluation
statistics, i.e., accuracy and Gini. See Sect. 8.2.6 for the description of the
Analysis node and its options.
Figure 8.340 shows the output of the Analysis node. We see that the accuracy
in both training and test set are pretty high at about 97 %. Furthermore, the Gini
970 8 Classification Models

Fig. 8.339 Definition of the aggregation method for the ensemble model

Fig. 8.340 Analysis output with evaluation statistics from both the training and the test data
8.9 The Auto Classifier Node 971

values are nearly 1, which indicates an excellent prediction ability with the
inserted variables.

8.9.2 The Auto Classifier Model Nugget

In this short section, we will take a closer look at the model nugget generated by the
Auto Classifier node and the graphs and options it provides.

Model tab and the selection of models included in the ensemble


The top-ranked models built by the Auto numeric node are listed in the Model tab.
See Fig. 8.341. In this case, the ensemble model consists of the top three models to
predict breast cancer, as suggested by the Auto Classifier node. These models are a
Logistic Regression, a Discriminant, and a Neural Net classifier. The models are
ordered by their AUC of the test set, as this is the rank criteria chosen in the node
options (see previous section). The order can be manually changed in the drop-
down menu on the left, labeled “Sort by”, in Fig. 8.341.
In addition to the AUC measurement, all other ranking methods as well as the
build time are shown. We can change the basis of the ranking measure calculations
to be the training data on which all ranking and fitting statistics will then be based.
See right arrow in Fig. 8.341. The test set, however, has an advantage in that the
performance of the models is verified on unknown data.
To determine whether each model is a good fit for the data, we recommend
looking at the individual model nuggets manually to inspect the parameter values.
Double-clicking on the model itself will open a new window of the particular model
nugget, which provides all the graphs, quality statistics, decision boundary
equations, and other model specific information. This is highlighted by the left
arrow in Fig. 8.341. Each of the model nuggets is introduced and described
separately in the associated chapter.

Fig. 8.341 Model tab of the Auto Classifier model nugget. Specification of the models in the
ensemble used to predict target class
972 8 Classification Models

Fig. 8.342 Graph tab of the Auto Classifier model nugget. Predictor importance and bar plot that
shows the accuracy of the ensemble model prediction

In the furthest left column labeled “Use?”, we can choose which of the models
should contribute to the ensemble model. More precisely, each of the enabled
models takes the input data and estimates the target value individually. Then, all
outputs are aggregated according to the specified method in the Auto Classifier
node to one single output. This process of aggregating can prevent overfitting and
minimize the impact of outliers, which will lead to more reliable predictions.
Left of the Models, the distribution of the target variable and the predicted
outcome is shown for each model individually. Each graph can be viewed in a
larger, separate window by double-clicking on it.

Predictor importance and visualization of the prediction accuracy


In the “Graph” tab, the accuracy of the ensemble model prediction is visualized by a
bar plot on the left. Each bar thereby represents a category of the target variables
and its height the occurrence frequency in the data. So the bar plot is a visualization
of the distribution of the target variable. The bars are also colored, with each color
representing a category predicted by the ensemble model. This allows you to easily
see the overall accuracy, as well as identify classes with numerous misclassi-
fications, which are harder to detect. See Fig. 8.342.
In the graph on the right, the importance of the predictors is visualized in the
standard way. See Sect. 5.3.3 for predictor importance and the shown plot. The
predictor importance of the ensemble model is calculated on the averaged
output data.
8.9 The Auto Classifier Node 973

Fig. 8.343 Settings tab of the Auto Classifier model nugget. Specification of the ensemble
method

Setting of the ensemble aggregation method


The aggregation method is the method with which the individual prediction is made
so the models in the Auto Classifier node are aggregated to a single, final classifi-
cation, and it can be specified in the “Settings” tab. See Fig. 8.343. These are the
same options as in the Settings tab on the Auto Classifier node, and we therefore
refer to Sect. 8.9.1 and Fig. 8.339 in particular for a more detailed description of
this tab.

8.9.3 Exercises

Exercise 1: Finding the best models for credit rating with the Auto
Classifier node
The “tree_credit” dataset (see Sect. 10.1.33) comprises demographic and loan data
history of bank customers as well as a prognosis for giving a credit (“good” or
“bad”). Determine the best classifiers to predict the credit rating of a bank customer
with the Auto Classifier node. Use the AUC measure to rank the models. What is the
best model node and its AUC value, as suggested by the Auto Classifier procedure?
Combine the top five models to create an ensemble model. What is its accuracy and
AUC?

Exercise 2: Detection of leukemia in gene expression data with the SVM—


Determination of the best kernel function
The dataset “gene_expression_leukemia_short.csv” contains gene expression
measurements of 39 human genome positions of various leukemia patients (see
Sect. 10.1.15). Your task is to build a SVM classifier that will determine whether
each patient will be diagnosed with blood cancer. To do this, combine all leukemia
974 8 Classification Models

types in a new variable value which only indicates that the patient has cancer. Use
the Auto Classifier node to determine the best kernel function to be considered in
the SVM. What are the AUC values and which kernel function should be used in the
final SVM classifier?

8.9.4 Solutions

Exercise 1: Finding the best models for credit rating with the Auto
Classifier node
Name of the solution streams tree_credit_auto_classfier_node
Theory discussed in section Sect. 8.9.1

The final stream for this exercise looks like the stream in Fig. 8.344.

1. We start by opening the stream “000 Template-Stream tree_credit”, which


imports the tree_credit data and already has a Type node attached to it, and
save it under a different name. See Fig. 8.345.

Fig. 8.344 Stream of the credit rating prediction with the Auto Classifier node exercise

Fig. 8.345 Template stream for the tree_credit data


8.9 The Auto Classifier Node 975

2. We add a Partition node to the stream and place it between the Source and Type
node. In the Partition node, we declare 70 % of the data as training and the
remaining 30 % as test data. We then open the Type node and define the
measurement type of the variable “Credit rating” as Flag and its role as target.
3. Now we add an Auto Classifier node to the stream and connect it with the Type
node. The variable roles are automatically identified. This means that nothing
has to be changed in the Field tab settings.
4. In the Model tab, we select the AUC as rank criteria and set the number of
models to use to 5 since the final ensemble model should comprise 5 different
classifiers. See Fig. 8.346.
5. In the Expert tab, we add the SVM to the models that should be considered in the
building and ranking process by checking the box next to the SVM model type.
See Fig. 8.347.

Fig. 8.346 Definition of the ranking criteria and number of used models
976 8 Classification Models

Fig. 8.347 Selection of the models to be considered in the building and ranking process. Adding
of SVM to this list

6. Since we only want to consider models with high prediction performance, we


define two discard criteria. For a model to be a candidate for the ranking and
finally ensemble model, it needs a minimum accuracy of 80 % and an AUC of
above 0.8. See Fig. 8.348.
7. Now we run the stream and the model nugget appears.
8. To view the top five ranked classifiers build by the Auto Classifier node, we open
the nugget. The best five models as suggested by the Auto Classifier node are, in
the order of ranking, a Logistic Regression, CHAID, Bayesian Network, SVM,
and C5.0 tree. The AUC values range from 0.888 for the logistic regression to
0.843 for the C5.0 tree. The accuracy of these models is also quite high at just
over 80 % for all models. See Fig. 8.349.
8.9 The Auto Classifier Node 977

Fig. 8.348 Definition of the discard criteria

Fig. 8.349 Top five classifiers to predict the credit rating of a bank customer built by the Auto
Classifier node
978 8 Classification Models

Fig. 8.350 Analysis node with performance statistics of the ensemble model that classifies
customers according to their credit rating

9. To evaluate the performance of the ensemble model that comprises these five
models, we add an Analysis node to the stream and connect it with the model
nugget. We refer to Sect. 8.2.6 for information on the Analysis node options.
Figure 8.350 presents the accuracy and AUC of the ensemble model. As all the
individual components, the accuracy of the ensemble model is a little above
80 % for training and testing set. The AUC for the test set, also at 0.887, is in the
same range as the best ranked model, i.e., the logistic regression.

Exercise 2: Detection of leukemia in gene expression data with the SVM—


Determination of the best kernel function
Name of the solution streams gene_expression_leukemia_short_svm_kernel_
finding_auto_classifier_node
Theory discussed in section Sect. 8.9.1
Sect. 8.5.1
Sect. 8.5.4 (Exercise 1)

The stream displayed in Fig. 8.351 is the complete solution of this exercise.
8.9 The Auto Classifier Node 979

Fig. 8.351 Complete stream of the best kernel finding procedure for the SVM leukemia detection
classifier

Fig. 8.352 Sub stream of data preparation of the solution stream of this exercise

1. The first part of this exercise is the same as in Exercise 1 in Sect. 8.5.4. We
therefore omit a detailed description of the importation, partitioning, and reclas-
sification of the data into healthy and leukemia patients and begin referring to the
first steps of the solution of the above-mentioned exercise. After following the
steps of this solution, the stream should look like that in Fig. 8.352. This is our
new starting point.
2. We add an Auto Classifier node to the stream and connect it with the last Type
node. In the Auto Classifier node, we select the “Area under the curve” rank
criteria in the Model tab and set the number of used model to 4, as four kernel
functions are provided by the SPSS Modeler. See Fig. 8.353.
3. In the Expert tab, we check the box next to the SVM model type and uncheck the
boxes of all other model types. See Fig. 8.354. We then click on the model
parameter field of the SVM and select “Specify”. See arrow in Fig. 8.354.
4. The parameter option window of the SVM node opens, and we go to the Expert
tab. There, we change the Mode parameter to “Expert” for all other options to be
changeable. Afterwards, we click on the “Options” field of the Kernel type
parameter and click on “Specify”. In the pop-up window which appears, we
can select the kernel methods that should be considered in the building process
980 8 Classification Models

Fig. 8.353 Selection of the rank criteria and number of considered models

of the Auto Classifier node. As we want to identify the best among all kernel
functions, we check all the boxes: the RBF, Polynomial, Sigmoid, and Linear
kernel. See Fig. 8.355. We now click the OK buttons until we are back at the
Auto Classifier node. For each of these kernels, an SVM is constructed, which
means four in total. This is displayed in the Expert tab, see Fig. 8.354.
5. As the target variable and input variables are already specified in a Type node,
the Auto Classifier node identifies them and we can run the stream without
additional specifications.
6. We open the appeared model nugget to inspect the evaluation and the ranking of
the four SVMs with different kernels in the Model tab (Fig. 8.356). We see that
the SVM named “SVM 2” has the highest AUC value, which is 0.953. This
model is the SVM with a polynomic kernel. The values of the models “SVM 1”
8.9 The Auto Classifier Node 981

Fig. 8.354 Selection of the SVM model in the Auto Classifier node

(RBF kernel) and “SVM 4” (linear kernel) at 0.942 and 0.925, respectively, are
not far away from the one of the polynomic kernel SVM. The AUC value of the
last SVM (sigmoid kernel) however is quite lower at 0.66. Thus, the prediction
quality of this last model is not as good as the other three. By looking at the bar
plot of each model, we see that the “SVM 4” model classifies all patients as
leukemia patients, whereas the other three model are able to recognize healthy
patients. This explains the phenomena of the much lower AUC of “SVM 4”.
7. To recap, the SVM with a polynomial kernel function has the best performance
in detecting leukemia from gene expression data, and the Modeler suggests
using this kernel in a SVM model for this problem. However, the RBF
and Linear kernel models are nearly as good and are thus also appropriate
choices.
982 8 Classification Models

Fig. 8.355 Specification of the kernel functions considered during the model building process of
the Auto Classifier node

Fig. 8.356 Template stream of the gene


Literature 983

Literature
Allison, P. D. (2014). Measures of fit for logistic regression. Accessed 19/09/2015, from http://
support.sas.com/resources/papers/proceedings14/1485-2014.pdf
Azzalini, A., & Scarpa, B. (2012). Data analysis and data mining: An introduction. Oxford:
Oxford University Press.
Ben-Gal, I. (2008). Bayesian Networks. In F. Ruggeri, R. S. Kenett, & F. W. Faltin (Eds.),
Encyclopedia of statistics in quality and reliability. Chichester, UK: Wiley.
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “Nearest Neighbor”
meaningful? In G. Goos, J. Hartmanis, J. van Leeuwen, C. Beeri, & P. Buneman (Eds.),
Database Theory—ICDT’99, Lecture notes in computer science (Vol. 1540, pp. 217–235).
Berlin: Springer.
Biggs, D., de Ville, B., & Suen, E. (1991). A method of choosing multiway partitions for
classification and decision trees. Journal of Applied Statistics, 18(1), 49–62.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression
trees. Boca Raton, FL: CRC Press.
Cheng, B., & Titterington, D. M. (1994). Neural Networks: A review from a statistical perspective.
Statistical Science, 9(1), 2–30.
Cormen, T. H. (2009). Introduction to algorithms. Cambridge: MIT Press.
Esposito, F., Malerba, D., Semeraro, G., & Kay, J. (1997). A comparative analysis of methods for
pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19
(5), 476–493.
Fahrmeir, L. (2013). Regression: Models, methods and applications. Berlin: Springer.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, 7(2), 179–188.
He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowl-
edge and Data Engineering, 21(9), 1263–1284.
Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques, The Morgan
Kaufmann series in data management systems (3rd ed.). Waltham, MA: Morgan Kaufmann.
IBM. (2015a). SPSS Modeler 17 Algorithms Guide. Accessed 18/09/2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/17.0/en/AlgorithmsGuide.pdf
IBM. (2015b). SPSS Modeler 17 Modeling Nodes. Accessed 18/09/2015, from ftp://public.dhe.ibm.
com/software/analytics/spss/documentation/modeler/17.0/en/ModelerModelingNodes.pdf
IBM. (2015c). SPSS Modeler 17 Source, Process, and Output Nodes. Accessed 19/03/2015, from
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerSPOn
odes.pdf
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning
(Vol. 103). New York: Springer.
Kanji, G. K. (2009). 100 statistical tests (3rd ed.). London: Sage (reprinted).
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data.
Applied Statistics, 29(2), 119.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.
Lantz, B. (2013). Machine learning with R: Learn how to use R to apply powerful machine
learning methods and gain an insight into real-world applications, Open source. Community
experience distilled.
Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica,
7(4), 815–840.
Machine Learning Repository. (1998). Optical recognition of handwritten digits. Accessed 2015,
from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
984 8 Classification Models

Niedermeyer, E., Schomer, D. L., & Lopes da Silva, F. H. (2011). Niedermeyer’s electroencepha-
lography: Basic principles, clinical applications, and related fields (6th ed.). Philadelphia:
Wolters Kluwer/Lippincott Williams & Wilkins Health.
Oh, S.-H., Lee, Y.-R., & Kim, H.-N. (2014). A novel EEG feature extraction method using Hjorth
parameter. International Journal of Electronics and Electrical Engineering, 2(2), 106–110.
Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4, 1883.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Quinlan, J. R. (1993). C4.5: Programs for machine learning, The Morgan Kaufmann series in
machine learning. San Mateo, CA: Morgan Kaufmann.
R Core Team. (2014). R: A Language and Environment for Statistical Computing. http://www.R-
project.org/
Rivest, R. (1987). Learning decision lists. Machine Learning, 2(3), 229–246.
RStudio Team. (2015). RStudio: Integrated Development Environment for R. http://www.rstudio.
com/
Runkler, T. A. (2012). Data analytics: Models and algorithms for intelligent data analysis.
Wiesbaden: Springer Vieweg.
olkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regulari-
Sch€
zation, optimization, and beyond, Adaptive computation and machine learning. Cambridge,
MA: MIT Press.
Tuffery, S. (2011). Data mining and statistics for decision making, Wiley series in computational
statistics. Chichester: Wiley.
Welch, B. L. (1939). Note on discriminant functions. Biometrika, 31, 218–220.
Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for
medical diagnosis applied to breast cytology. Proceedings of the National Academy of
Sciences, 87(23), 9193–9196.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A.,
Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top
10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37.
Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms (Chapman & Hall/CRC
machine learning & pattern recognition series). Boca Raton, FL: Taylor & Francis.
Using R with the Modeler
9

After finishing this chapter the reader is able to . . .

1. Explain how to connect the IBM SPSS Modeler with R and why this can be
helpful,
2. Describe the advantages of implementing R features in a Modeler stream,
3. Extending a stream by using the correct node to incorporate R procedures in a
stream as well as
4. Use R features to determine the best transformation of a variable towards
normality.

9.1 Advantages of R with the Modeler

The SPSS Modeler provides a wide range of algorithms, procedures, and options to
build statistical models. The Modeler provides options for creating models and
preparing data in most instances appropriately which is easy to understand and also
intuitive. Why does IBM offer the user the option of implementing R functionalities
in the SPSS Modeler graphical environment? There are several answers to this
question:

1. It allows users with R knowledge to switch to the usage of the SPSS Modeler and
present the functionalities in a more structured way the analysis process.
2. The user can sometimes modify data more easily by using the R language.
3. The variety of advanced and flexible graphical functions provided by R can be
used to visualize the data in more significant plots.

# Springer International Publishing Switzerland 2016 985


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_9
986 9 Using R with the Modeler

IBM SPSS
Modeler
Essentials for R

IBM SPSS Modeler R

modelerData
users dataset R language,
R nodes offered by packages/libraries
‘test_scores.sav’
the modeler to extend
modelerDataModel
functionalities

Fig. 9.1 SPSS Modeler and R interaction

4. As with any other software, the SPSS Modeler and R offer different options to
analyze data and to create models. Sometimes it may also be helpful to assess the
fit of a model by using R.
5. R can be extended by using a wide range of libraries which researchers all over
the world have implemented. In this way, R is constantly updated, easier to
modify, and better at coping with specific modeling challenges.
6. Embedding R in the SPSS Modeler has the overall benefit of combining two
powerful analytics software, so the strengths of each can be used in a analysis.

Each statistical software package has its advantages, and the option to use R
functionalities within the IBM SPSS Modeler as well gives the user the chance to
lock on the same data from different angels and to use the best method provided by
both packages.
The aim of this chapter is to explain the most important steps in how to use R
functionalities and to implement the correct code in the SPSS Modeler. We will
have a look at how to install R and the IBM SPSS R Essentials. Furthermore, we
will discuss the R nodes of the Modeler that uses the R functionalities and present
the results to user. Figure 9.1 depicts the interaction of both programs by accessing
the same dataset “test_scores”.
The authors do not intend to explain the details of the R language here because
there are an overwhelming number of different options and functionalities that are
beyond the scope of this book.

9.2 Connecting with R

In order to use R with the Modeler, we have to install the IBM SPSS Modeler
Essentials for R. This is the Modeler toolbox to connect with R as shown in Fig. 9.1.
It manages to link the data of the Modeler and of R so that both applications have
access to the data and can exchange data.
9.2 Connecting with R 987

Here, we will present the steps to set up the IBM SPSS R Essentials. Addition-
ally we want to use a stream to test the connection with the R engine. A detailed
description of the installation procedure can also be found in IBM (2014a).

Assumptions

1. The R Essentials and therefore R can only be used with the Professional or
Premium Version of the Modeler.
2. The R Version 3.0.1 must be installed on the computer, and the folder of this
installation must be known, e.g., “C:\Program Files\R\R-3.1.0”.
DOWNLOAD: http://cran.r-project.org/bin/windows/base/old/3.1.0/
3. The folder “C:\Program Files\IBM\SPSS\Modeler\17\ext\bin\pasw.rstats” must
exist.

" In order to use R with the Modeler the “IBM SPSS Modeler—Essentials
for R” must be installed. The reader should distinguish between “IBM
SPSS Statistics—Essentials for R” and “IBM SPSS Modeler—Essentials
for R”. The tool last mentioned must be used. Furthermore, it is
essential to start the setup program as administrator! Details can be
found in the following detailed description and in IBM (2014a).

Set-up Procedure for the R Essentials


If all of these requirements are fulfilled, we can start the setup procedure

1. Download the “IBM SPSS Modeler—Essentials for R”. See IBM Website
(2015). The version of the Modeler and the version for the Essentials are
corresponding. So if the Modeler Version 17 is being installed, then Essentials
version 17 should be used too.
Depending on the operating system and the Modeler version, we must make
sure to use the correct 32 or 64 bit version. The correct name for the 64 bit
Microsoft Windows version is “SPSS_Modeler_REssentials_17.0_Win64”.
2. We must make sure not to start the install program after using the IBM download
program. Instead we strongly recommend making a note of the folder where the
download is saved and terminate the original download procedure after the file is
being saved.
Then we navigate to the folder with the setup program “SPSS_Modeler_REs-
sentials_17.0_Win64.exe”. We have to start the setup as administrator. To do so,
we click the file with the right mouse button and then choose the option “Run as
Administrator”.
3. After unzipping the files, the setup program comes up and requests you to choose
the correct language (Fig. 9.2).
4. We read the introduction and accept the license agreement.
5. Then we make sure to define the correct R folder (Fig. 9.3).
988 9 Using R with the Modeler

Fig. 9.2 Setup process “IBM SPSS Modeler—Essentials for R” initial step

Fig. 9.3 Setup process “IBM SPSS Modeler—Essentials for R”—define the R folder

6. As suggested by the setup program, we have to determine the path to the “pasw.
rstats” extension. In the previous steps, we verified that this folder exists.
Figure 9.4 shows an example. The user may find that the offered standard folder
in this dialog window is not correct and must be modified.
7. We carefully check the summary of the settings as shown in Fig. 9.5.
8. At the end, the setup program tells us that the procedure was successfully
completed (Fig. 9.6).
9.2 Connecting with R 989

Fig. 9.4 Setup process “IBM SPSS Modeler—Essentials for R”—define the pasw.rstats folder

Fig. 9.5 Setup process “IBM SPSS Modeler—Essentials for R”—Pre-Installation summary
990 9 Using R with the Modeler

Fig. 9.6 Setup process “IBM SPSS Modeler—Essentials for R”—Installation summary

9.3 Test the SPSS Modeler Connection to R

Description of the model


stream name R_Connect_Test.str
based on dataset none
stream structure

To test that the R essentials were installed successfully and the R Engine can also
be used from the Modeler, we suggest taking the following steps.
9.3 Test the SPSS Modeler Connection to R 991

Fig. 9.7 Stream to test the R


essentials

Fig. 9.8 Table node to show


the defined variables

1. We open the stream “R_Connect_Test.str”. In Fig. 9.7, we can see that there is a
User input node as well as an R Transform node.
2. In the User Input node, a variable called “Test_Variable” is defined and the value
of the variable is just 1. We used this very simple node because we do not have to
connect the stream to a complicated data source. The link to the data may be
missing and the stream may be harder to use.
3. If we click on the left Table node by right (!) clicking the mouse, we can use
“Run” to see what variables are defined so far (Fig. 9.8).
4. As expected, there is one variable and one record. The value of the variable
“Test_Variable” is 1.
5. We can close the window with “OK”.
6. To finish the procedure, we click on the right Table node (see Fig. 9.9) once
more with a right (!) click of the mouse button and we use “Run” to start the
calculation procedure.
992 9 Using R with the Modeler

Fig. 9.9 Stream to test the R


essentials

Fig. 9.10 Results calculated


by the R engine

7. The table is being modified. It is simply the “Test_Variable + 1”. If we can see
the new value as shown in Fig. 9.10, then the Modeler is successfully connected
with R.

" The R Transform node enables R to grab the SPSS Modeler data and to
modify the values stored in an object called “modelerData” using a
script.

" Not all operations are possible to use in a Modeler node for R, e.g.,
sorting or aggregation. For further details, see also IBM (2014b).

We have not yet covered the usage of the R language in the Transform node
itself. If we double click the R Transform node in Fig. 9.9, we can find the first short
R script as shown in Fig. 9.11.
9.3 Test the SPSS Modeler Connection to R 993

Fig. 9.11 R code in the R Transform node

1. The table or dataframe “modelerData” is the R object automatically linked to the


SPSS Modeler dataset.
2. By using “modelerData$Test_Variable” we address the column
“Test_Variable”.
3. We increase the values in this column by 1.

The SPSS Modeler Essentials link the two objects “modelerData” and
“modelerDataModel” from the Modeler to R and back. As shown in Fig. 9.1, the
modeler copies the information to “modelerData”. The R script modifies the value
(s). And in the Modeler, we can see the new values in the stream by looking at the
Table node. All other variables that will be defined in R will not be recognized in
the Modeler. See IBM (2015), p. 4.
The object “modelerDataModel” contains the structure of the object
“modelerData”. See also IBM (2015), p. 10–11. Because the structure of
“modelerData” is not being modified here, the object “modelerDataModel” does
not have to be addressed in the script.
As long as we only modify the variable previously defined and we do not add any
column or changes to the name of a column, we do not have to add more commands
to the script. In the next example, we will show how to deal with modified data
frames by R and how to make sure to get the correct values by showing the results in
the Modeler.
994 9 Using R with the Modeler

" The object “modelerData” links the data to be exchanged between


the Modeler and R. So in the R script, “modelerData” must be
addressed to get the input values. Furthermore, values that are
modified or variables that are created in R must be part of
“modelerData”.

" The object “modelerDataModel” contains the structure of the object


“modelerData”. See also IBM (2015), p. 10–11. In particular, for each
column of “modelerData”, the column name and its scale type are
defined. As long as the structure of “modelerData” is not being
changed, the object “modelerDataModel” must not be modified.

9.4 Calculating New Variables in R

Description of the model


Stream name R_salary_and_bonus.str
Based on dataset salary_simple.sav
Stream structure

Related exercises: 1

By using a new stream, we will now look at the data transport mechanism from
the Modeler to R and back. We will describe the analysis of the stream step-by-step.

1. We show the predefined variables and their values in the dataset “salary_simple.
sav” by double-clicking on the Table node on the left. Then we click “Run”. We
find 5000 values in the column “salary”. See Fig. 9.12. We can close the window
of the Table node now.
9.4 Calculating New Variables in R 995

Fig. 9.12 Table node with the original values of “salary_simple.sav”

Fig. 9.13 Type node settings

2. In the Type node, we can see that the variable “salary” is defined as continuous
and metrical and the role as “Input”. This shows also Fig. 9.13.
3. Now we double-click on the R Transform node. In the dialog window shown in
Fig. 9.14, we can see the R script that transforms the data. The commands used
here are explained in Table 9.1.
The tab “Console Output” in the R Transform node shows the internal
calculation results of R. After running the Transform node, the user can find
996 9 Using R with the Modeler

Fig. 9.14 R Script in the R Transform node

the error message below the last line shown in Fig. 9.15 that helps to identify the
R command to be modified.
4. We can see the values of the three variables “salary”, “bonus”, and “new_salary”
in the Table node at the end of the stream (see Fig. 9.16).

" The R Transform node can be used to calculate new variables.

" The objects “modelerData” and “modelerDataModel” are used to


exchange values or information. All other variables defined in R will
not be recognized in the Modeler.

" The object “modelerDataModel” is needed for the SPSS Modeler to


convert the data from an R object back to a SPSS Modeler data table.
If new columns are defined, their name, type, and role must be
defined in “modelerDataModel” too.

" In the R Transform node, the dialog window in the tab “Console
Output” helps the user to identify R commands that are not correct
and must be modified.
9.4

Table 9.1 R Script commands in R Transform node of stream “R_salary_and_bonus”


Command Explanation
1 # get the old values in the data frame “modelerData”
temp_data<-modelerData
The data in the dataset “salary_simple.sav” are linked to the variable “modelerData” automatically. In R, we can now copy the data
or better data frame to a new variable called “temp_data”.
2 # define a new column “bonus” and calculate the values
temp_data$bonus<-temp_data$salary*0.1
In the data frame “temp_data” we address the column “salary” by using the notation “temp_data$salary”. We consider the bonus
Calculating New Variables in R

payments to be 10 % of the salary.


We create a new column “bonus” in the data frame “temp_data” and calculate the values.
3 # define a new column “new_salary” and calculate the values
temp_data$new_salary<-temp_data$salary+temp_data$bonus
The new payments are calculated by using the original payments and the calculated bonus payments. The values in the columns “salary”
and “bonus” of the data frame “temp_data” are added. A new column “new_salary” is defined in the data frame “temp_data”.
4 modelerData<-data.frame(temp_data)
The data frame “temp_data” is copied back to “modelerData”.
5 modelerDataModel <- data.frame(c(fieldName¼"salary",fieldLabel¼"",fieldStorage¼"real",fieldMeasure¼"",fieldFormat¼"",
fieldRole¼""), . . .
The object “modelerDataModel” contains the structure of the object “modelerData”. See also IBM (2015), p. 10–11. Because we
modified the object “modelerData” by adding two new columns, we must add this information to “modelerdataModel”.
For each “salary” column and the new columns “bonus” and “new_salary”, the column name and its scale type is defined. The names of
the variables do not have to match with the original column names. If the command were “fieldName¼‘old_salary’”, then the name
would be “old_salary” even if the original variable that contains the data were named “salary”. The step of defining the structure of the
“modelerData” in the object “modelerDataModel” is needed for the SPSS Modeler to correctly convert the data from an R object back to
a SPSS Modeler data table.
997
998 9 Using R with the Modeler

Fig. 9.15 Console output of the R Transform node

Fig. 9.16 Table node with the calculated values


9.5 Model Building in R 999

9.5 Model Building in R

Description of the model


Stream name R_simple_linear_regression
Based on dataset test_scores.sav
Stream structure

Related exercises: 2, 3, 4

We discussed how to exchange data between the SPSS Modeler and


R. Additionally in the previous two subchapters, we analyzed examples of how to
modify data by using R scripts. We did this by implementing the scripts in an R
Transform node. Now we will demonstrate how to create models in R based on data
coming from the Modeler. Furthermore, new values should be calculated by using
the new model and we want to assess the model results in the Modeler. To do so we
will use a linear regression model learned in Sect. 5 in this book, giving us the
chance to compare the results of the Modeler and R.
Before we do so, we want to summarize the R functionalities available in the
Modeler. Table 9.2 shows an overview of the three different nodes. The R Trans-
form node was presented in the previous subchapters. The R Modeling node allows
for the creation of models and calculating forecasts by using this model. The R
Output node executes the R script and presents the results or diagrams in the
Modeler.
As outlined above, the aim of this subchapter is to present the functionalities
provided by the R Modeling and R Output node. We will therefore use an example
previously presented in Sect. 5.2.2. Students in different schools are taught using
different teaching methods. Based on pretest results, the scores in the posttest
should be predicted. Figure 9.17 depicts the stream “simple_linear_regression”
where a Linear node is being used to fit the linear regression model.
1000 9 Using R with the Modeler

Table 9.2 R nodes implemented by the SPSS Modeler Essentials


Node Type Functionalities/Description
• Variables represented in the data frame “modelerData” can be
modified. New variables can be calculated. They are columns
in the data frame “modelerData”. contains the structure of
“modelerData” and is needed to convert data from R back to
the Modeler.
R Transform node
• The name, measurement, and role must be defined by using the
(Modeler Tab “Records
R command “data.frame” in “modelerDataModel”.
Ops”)
• The example streams “R_Connect_Test” in Sect. 9.3 as well as
“R_salary_and_bonus” in Sect. 9.4 show this functionality.
• A model based on the data linked by “modelerData” can be
created. When the code is successfully executed, a model
nugget will be created in the SPSS Modeler.
• The scoring part of the node can be used to calculate new
values.
R Building node (Modeler
Tab “Modeling”)
• A model can be defined and new variables can be calculated.
But in comparison to the R Modeling node, this node allows
only for the presentation of the output of the R console or the
diagrams produced. It does not provide the scoring part of the
Modeling node and does not link the data back to the Modeler.
R Output node (Modeler
Tab “Output”)

Fig. 9.17 Stream


“simple_linear_regression”

As shown in Fig. 9.18, the estimated regression equation is

d ¼ 13:212 þ 0:981  pretest:


posttest

With this equation, we are now able to predict the outcome of the final exam if we
know the student’s pretest result.
We want to demonstrate here how to use R to get the same result so that we can
learn how to deal with R in the SPSS Modeler and can verify at the same time
whether the determined parameters are equal. We do not want to build the stream
step by step here because to define the necessary R scripts manually would be just a
9.5 Model Building in R 1001

Fig. 9.18 Determined parameter of the linear regression function using the Linear node

Fig. 9.19 Stream “R_simple_linear_regression”

copy and paste job. It is more convenient to analyze the nodes used in the final
stream and to understand how it works. Base on this knowledge, we can define other
stream and R scripts correctly too.

1. We open the stream “R_simple_linear_regression” which is depicted in Fig. 9.19.


2. To show the original dataset, we double-click on the Table node on the left and
run it. Figure 9.20 shows some records of the dataset “test_scores.sav” with the
option “display field and value labels” enabled. We can close the window by
clicking OK.
3. In the Type node, we can see that both variables are defined as continuous.
Additionally, “pretest” is assigned as input and “posttest” as target variable.
4. Now we want to analyze the R Building node shown on top of the stream in
Fig. 9.19. We double-click on this node. Figure 9.21 shows the syntax tab of the
R Building node with two sections for R scripts. The R script in the R model
building syntax field calculates the Pearson’s correlation coefficient between the
“pretest” and the “posttest” values, fits the model, and prints a summary as well
as several plots. Table 9.3 shows the detailed description of the commands.
A second section is to be found in the syntax tab of the R Building node. This
is the so-called R model scoring syntax field. Here, the linear model is being
1002 9 Using R with the Modeler

Fig. 9.20 Values of the


dataset “test_scores.sav”

Fig. 9.21 R Building node with R scripts


9.5 Model Building in R 1003

Table 9.3 R Script in R model building syntax field of the R Building node in the stream
“R_simple_linear_regression.str”
Command Explanation
1 # correlation of the model variables
cor(modelerData$pretest,modelerData$posttest)
The correlation between the pretest and the posttest score of the students will be
calculated. The result will be shown in the Text Output tab of the R model nugget.
See Fig. 9.23.
2 # fit the model
modelerModel<-lm(posttest~pretest,data¼modelerData)
The parameters of the linear regression model are determined. The model is
assigned to the variable “modelerModel”.
3 # show the results
summary(modelerModel)
A detailed statistic will be shown in the Text Output tab of the R model nugget. See
Fig. 9.23.
4 # diagnostic plots
plot(modelerModel)
Several plots such as the residual plot will be shown in the Graph Output tab of the
R model nugget. See Fig. 9.24.

Table 9.4 R Script in R model scoring syntax field of the R Building node in the stream
“R_simple_linear_regression.str”
Command Explanation
1 # calculating forecast
result<-predict(modelerModel,newdata¼modelerData)
The final scores of the students are determined based on the pretest results by using
the fitted linear regression model. The values are saved using the variable “result”.
2 # attach results
modelerData<-cbind(modelerData,result)
The data frame “modelerData” links the original values of the “test_scores.sav”
dataset from the SPSS Modeler to R. See Sect. 9.4. Here, this data frame is being
extended by the addition of a new column “result” with the predicted values of the
final test results.
3 # define characteristic of new variable
var1<-c(fieldName¼“posttest_prediction”,fieldLabel¼“”,fieldStorage¼“real”,
fieldMeasure¼“”,fieldFormat¼“”,fieldRole¼“”)
The details of the new column “result” are defined. Here, the name is
“posttest_prediction”.
4 # return to the Modeler and import the new data frame
modelerDataModel<-data.frame(modelerDataModel,var1)
The object “modelerDataModel” contains the original data and the new column
“posttest_prediction”. See also IBM (2015), p. 10. The variable or column name
and their scale type are defined by using the information in the previously defined
dummy variable “var1”.

used to predict the final test scores of the students based on the parameter saved
in the variable “modelerModel”. The original dataset is being extended by the
predicted values in the column “posttest_prediction”. Table 9.4 shows the
detailed description of the commands.
1004 9 Using R with the Modeler

Fig. 9.22 Model options tab of the R Building node

5. In the Model Options tab shown in Fig. 9.22, several options of the variable and
output handling can be determined. The options are self-explanatory. A detailed
description can be found in IBM (2015), p. 3.
We can run the scripts of the R Building node and an R model nugget will be
shown in the middle of the stream. See Fig. 9.19. We open the R model nugget
by double-clicking it. Figure 9.23 shows the Text Output tab of the nugget. Here,
all the results that the statistics package R would print out to its console are
shown.
We can see the correlation coefficient of pretest vs final test result. With its
value of +0.9508843, it is a very strong positive linear relationship between both
variables.
Furthermore, we can find the details of the determined linear regression
model printed by the R command “summary(modelerModel)” explained in
row 3 of Table 9.3. The determined model parameters are equal to the results of
9.5 Model Building in R 1005

Fig. 9.23 Text Output tab of


the R model nugget

the SPSS Modeler results previously determined by using a Linear node. See
Fig. 9.18.
6. By activating the Graph Output tab of the R model nugget, we can find several
residual plots as shown in Fig. 9.24. We can close the R model nugget node by
clicking OK.
7. The predicted test scores are shown in the Table node at the end of the stream.
See Fig. 9.25.
8. Beside all the nodes mentioned so far, we can find the R Output node at the
bottom of the stream in Fig. 9.17. Here, the same commands as explained in
Table 9.3 are used to determine the parameter of the linear regression model and
print the results (see Fig. 9.26). The residual plots are also created. Running this
node, we get the same results in the Text Output tab and in the Graph Output tab
as in the R Building node. See Fig. 9.27.

By fitting the linear regression model to predict the final test scores, we
analyzed the R Building node, the R model nugget, as well as the R Output
node. We found that the statistical parameter determined by R was equal to
those determined by the Modeler itself. The statistics produced by R are detailed
and the residual plots are also useful. So R can help the user to assess a model in
more detail.
1006 9 Using R with the Modeler

Fig. 9.24 Graph Output tab of the R model nugget

" The R Building node allows to calculate new variables as well as to


determine model parameters. In the scoring dialog field, the fitted
model can then be used to predict values. These values can be
populated back into the Modeler.

" The R Output node can be used to calculate several statistics or to fit
models. The output is similar to the text and graphics output of the R
Building node, but calculated results are not linked back to the
modeler.
9.5 Model Building in R 1007

Fig. 9.25 Predicted values in


the Table node

Fig. 9.26 Syntax tab of the R


Output node
1008 9 Using R with the Modeler

Fig. 9.27 Text Output tab of


the R Output node

9.6 Exercises

Exercise 1: “Calculating New Variables with R”


Name of the solution stream R_simple_calculations
Theory discussed in section Section 9.4

In Sect. 2.7.2, we introduced the Derive node of the SPSS Modeler to do more or
less simple calculations. We created the stream “simple_calculations.str”. It is
based on the dataset “IT_user_satisfaction.sav”. Here, respondents should assess
the quality of an IT-system. They stated the number of training days they had last
year (variable “training_days_actual”) and the number of days they would like to
add (variable “training_days_to_add”) to improve their skills in using the IT
resources. By adding both variables to the stream, a variable “training_expec-
ted_total_1” was derived as the number of the days the user expected overall.
The same is done using another method for the variable “training_expec-
ted_total_2”. In this exercise, the stream should be extended by using an R node
to get the same results but calculated in R.

1. Open the stream “simple_calculations.str”.


9.6 Exercises 1009

2. Save the stream using a different name. The final solution stream is named
“R_simple_calculations.str”.
3. Add an appropriate R node mentioned in Table 9.2 and connect it with the
Type node.
4. Calculate values of a new variable “training_expected_total_3”.
5. Show the results in a Table node.

Exercise 2: “Correlation Matrix with R”


Name of the solution stream R_correlation_nutrition_habits.str
Theory discussed in section Section 4.4
Section 4.5
Section 9.5

In Sect. 4.4, we explained how to calculate the bivariate correlations between


variables. We also used a Sim Fit node to produce a correlation matrix in Sect.
4.5. As mentioned there, this node approximates the correlations. In this exercise,
the R function “cor” should be used to calculate the correlation matrix.
In Sect. 6.3.2, we used the dataset “nutrition_habites.sav” to determine hidden
factors that help us to understand the dietary habits of respondents in a survey.
Factor analysis methods are used based on the correlation matrix. This matrix
should be calculated once more with R.

1. Open the stream “pca_nutrition_habits.str”. The stream shown in Fig. 9.28 was
created and explained to perform a PCA in Sect. 6.3.2. We want to use this stream
as a starting point to determine the correlation matrix of the input variables.
2. Save the stream using another name. The final stream here is “R_correlation_
nutrition_habits.str”.
3. Now remove all nodes in the rectangle in Fig. 9.28. They are used to perform the
PCA which we do not need here.

Fig. 9.28 Stream “pca_nutrition_habits.str” with highlighted nodes to be removed


1010 9 Using R with the Modeler

4. The Pearson’s correlations can be calculated if the input variables are defined as
metrical/continuous. So this scale type must be assigned to all variables even if
they are ordinal in this case. Add a Type node on the right and make sure the
variables are defined as metrical/continuous.
5. Now add an appropriate R node as mentioned in Table 9.2 to calculate the
correlation matrix in R.
Remarks:
Using the command
round(cor(modelerData,modelerData),3)
determine the values of the correlation matrix.
The function “cor(modelerData,modelerData)” calculates the correlations
between all variables defined in the dataset modelerData. They are equal and
linked to the original dataset “nutrition_habites.sav” without the “ID” filtered in
the Filter node.
The function “round(xxx,3)” rounds all the correlations with a precision of
three digits after the comma.
6. Execute the different sub-streams and compare the correlations determined by
the Sim Fit and the R node.

Exercise 3: “Extending R by using libraries”


Name of the solution stream R_correlation_nutrition_habits_extended
Theory discussed in section Section 9.5

In the previous exercise, we used R to calculate the correlation matrix of all


variables included in the dataset “nutrition_habites.sav” except the respondents’
ID. Figure 9.29 shows this result. This matrix is not optimal because the significant
correlations are not marked and all correlations are mentioned twice. We want to
show here how to extend R to get a more appropriate result.

Fig. 9.29 Correlation matrix


determined with the R
Output node
9.6 Exercises 1011

Researchers can create new libraries and offer them as new packages of
functions and procedures to all other R users. They can download them and use
them to perform similar calculations without having to program the same
procedures again. The library “Hmisc” contains useful functions for data analysis.
See CRAN—Package Hmisc (2015) for more details.
We will show you how to add the library to R and additionally how to define our
own functions and how to use them in an R script.

1. Open the stream “R_correlation_nutrition_habits.str” created in the previous


exercise.
2. Save this stream using another name. The solution stream has the name
“R_correlation_nutrition_habits_extended.str”.
3. Now add a library to R by performing the following steps
(a) Using Windows Explorer, open the folder “C:\Program Files\R\R-3.1.0\bin
\x64”. Now start the R graphical user interface (abbr. R GUI). See Fig. 9.30.
(b) In the R toolbar click on “Packages \ Install Packages”
(c) Choose a file mirror to download new packages (Fig. 9.31).
(d) To install the library, choose “Hmisc” from the list and install it. R tells
us with the message “package ‘Hmisc’ successfully unpacked and MD5
sums checked” that the installation procedure had been successfully
completed.

Fig. 9.30 R GUI


1012 9 Using R with the Modeler

Fig. 9.31 Determining an R


file mirror

(e) Close R without saving the workspace.


4. Now go back to the SPSS Modeler. Copy the commands shown in Table 9.5.
A user-defined function “crorstarsl” is defined in the source code found on
myowelt.blogspot.de (2015). This function is then being used in the command
“corstarsl(modelerData)” to create a more appropriate correlation matrix in R.
In the script, correlations with significance smaller than 0.001 are marked with
“***”, correlations with a significance smaller than 0.01 with “**” and
correlations with a significance smaller than 0.05 are marked with “*”.
The reader should note that the user-defined function uses functionalities
provided by the R library Hmisc. That’s because it was necessary to install
this library in the R GUI.
9.6 Exercises 1013

Table 9.5 R script from myowelt.blogspot.de (2015) to determine a correlation


matrix and mark significant correlations
require(Hmisc)
## define a function corstarsl
corstarsl <- function(x){
require(Hmisc)
x <- as.matrix(x)
R <- rcorr(x)$r
p <- rcorr(x)$P
## define notions for significance levels; spacing is important.
mystars <- ifelse(p < .001, "***", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " ")))
## truncate the matrix that holds the correlations to two digits
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their appropriate stars
Rnew <- matrix(paste(R, mystars, sep¼""), ncol¼ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep¼"")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep¼"")
## remove upper triangle
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag ¼ TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
## remove last column and return the matrix (which is now a data frame)
Rnew <- cbind(Rnew[1:length(Rnew)-1])
return(Rnew)
}
## end of function definition
corstarsl(modelerData)

Now paste the R script in the R Output syntax script of the predefined R
Output node as shown in Fig. 9.32. The old command “round(cor(modelerData,
modelerData),3)” is also included in the first row so the difference of both
calculations are easier to see.
5. Run the R Output node and compare the results of both correlation matrices.
1014 9 Using R with the Modeler

Fig. 9.32 R Output node with the extended R script to determine correlation matrix

Exercise 4: “Multiple linear regression with R”


Name of the solution stream R_multiple_linear_regression.str
Theory discussed in section Section 9.5

In Sect. 5.3, we built a linear regression model with the Linear node. Based on data from
the real estate market in Boston, the model predicts the median value of owner-occupied
homes based on multiple input variables. For a description of the variables see Sect.
10.1.17. Here, we want to use R to create the same model and to compare the results.

1. Open the stream “multiple_linear_regression.str”.


2. Save the stream using a different name. The solution stream is named
“R_multiple_linear_regression.str”. Mark the Linear node and the model nug-
get and move them upwards as depicted in Fig. 9.33.
3. Add a Table node to show the predicted values. Inspect the predicted values.
4. Add a comment to the last two nodes so it is clear that this is the MLR Model
based on a Modeler procedure implemented in the Linear node.
5. To compare the determined model parameter later, the automatic data prepara-
tion procedure in the Linear node must be disabled. To do so, double-click on
9.6 Exercises 1015

Fig. 9.33 Original stream “multiple_linear_regression.str” to fit a MLR model

Fig. 9.34 Build Options in the Linear node

the Linear node and activate the tab “Build Options”. In the section “Basics”
disable the option “Automatically prepare data”. See Fig. 9.34.
6. Run the Linear node to update the model nugget and the parameter of the model
now without a data preparation.
7. Open the model nugget and activate the coefficients view on the left. You
should find the parameters as shown in Fig. 9.35. Close this window.
1016
9 Using R with the Modeler

Fig. 9.35 MLR Model parameter without data preparation


9.6 Exercises 1017

8. Now add an R Building node and connect it with the existing Type node.
9. Determine the appropriate R commands to . . .
(a) determine the model parameter by using the command
lm(MEDV~CRIM+ xxx,data¼modelerData)
The dummy “xxx” in this command must be substituted by the correct
variable names equal to the variables used in the Linear node.
(b) Save the parameter by using the variable “modelerModel”.
(c) Print a model summary.
(d) Create residual plots.
10. Add R commands so the values of MEDV are predicted by using the MLR
model.
11. Add a Table node to show the predicted values.
12. Compare the determined parameter of the model with those determined by
using the Linear node.
13. Add an appropriate comment to the R nodes.

Exercise 5 “Implementing Box Cox Transformation with R”


Name of the solution stream R_transform_diabetes
Theory discussed in section Section 9.5

In Sect. 3.2.5, we discussed the Box–Cox transformations to move the distribution


of a variable towards normality. In this exercise, an R script should be modified and
implemented in an SPSS Modeler stream.

1. Open the stream “transform_diabetes.str”. This stream should be the basis for
implementation of the Box–Cox transformation and the normality tests.
2. Save the stream using a different name. The final solution stream is named
“R_transform_diabetes.str”.
3. Review the R script “R_transform_diabetes_data.R” in the R scripts folder that
should be the basis for the implementation here.
4. Find and add an R node mentioned in Table 9.2 that is appropriate to implement
the transformation function as well as the normality tests as shown in the R script
mentioned above. Keep in mind that the test results in form of text outputs to the
console as well as in form of diagrams are produced by the R script.
Connect the new node with the existing Type node.
5. Using the R script “R_transform_diabetes_data.R” implement the Box–Cox
transformations and normality tests (Kolmogorov–Smirnov with Lilliefors mod-
ification and Shapiro–Wilk normality test) in the SPSS Modeler stream.
6. Interpret the results.
1018 9 Using R with the Modeler

9.7 Solutions

Exercise 1: “Calculating new variables with R”


The solution can be found in stream “R_simple_calculations”. We start the expla-
nation here with Step 3.

3. Following the characterization of the different R nodes in Table 9.2, here we use
an R Transform node from the Modeler Record Ops tab. Figure 9.36 shows the
node at the bottom.
4. The R script in Table 9.6 can be used to calculate the new variable “training_ex-
pected_total_2”. The structure of the script equals those in the R Building node
implemented in the stream “R_simple_linear_regression.str”. An explanation of
the commands can be found in Sect. 9.5, e.g., Table 9.4. Figure 9.37 shows the
R transform node with the R script.
5. To show the results in a Table node, we add a node at the end of the stream.
Figure 9.38 shows the final stream and Fig. 9.39 shows the calculated new values
of the variable “training_expected_total_3”.

Fig. 9.36 Added R Transform node in the stream “R_simple_calculations.str”


9.7 Solutions 1019

Fig. 9.37 R Transform node with the R script to calculate the new variable
1020 9 Using R with the Modeler

Fig. 9.38 Final stream “R_simple_calculations.str”

Fig. 9.39 Table node with the new variable “training_expected_total_3”


9.7 Solutions 1021

Exercise 2: “Correlation Matrix with R”


The solution can be found in stream “R_correlation_nutrition_habits.str”. We start
the explanation here with Step 4.

4. We add another Type node and define all variables as metrical/continuous.


Figure 9.40 shows the stream and Fig. 9.41 the Type node settings.

Fig. 9.40 Added new Type node to redefine the scale of measurement of the variables

Fig. 9.41 Variables defined as continuous in the Type node


1022 9 Using R with the Modeler

Fig. 9.42 R Output node to determine the correlation matrix

5. We can use an R Building node and define the command in the dialog field “R
model building syntax” (see, e.g., Fig. 9.21). But here we do not need to modify
the original dataset and link the modified data frame back to the Modeler. That’s
because here we also can use an R Output node from the Modelers Output tab.
We add it to the stream and paste the command mentioned in the “R Output
syntax” dialog field of the node. See Fig. 9.42.
6. As shown in Figs. 9.43 and 9.44: apart from their order, the determined
correlations are the same.
9.7 Solutions 1023

Fig. 9.43 Correlation matrix determined with the R Output node

Fig. 9.44 Correlation Matrix determined with the Sim Fit node

Exercise 3: “Extending R by using libraries”


The solution can be found in stream “R_correlation_nutrition_habits_extended”. In
this exercise, we will explain how to download and install a library in R. Then we
use this library in a script originally found on myowelt.blogspot.de (2015). This
shows how to extend the functionalities of R in two ways. The user can use R
libraries or define its own functions.
1024 9 Using R with the Modeler

Fig. 9.45 Correlation matrices determined in R

Running the R Output node, we can compare the results of both correlation
matrices determined in R and as shown in Fig. 9.45. The second matrix is obviously
easier to interpret. Significant correlations are marked.

Exercise 4: “Multiple linear regression with R”


The solution can be found in stream “R_multiple_linear_regression.str”. The first
two steps are described in the exercise. We start here with the third step.

3. We add a Table node and connect it to the model nugget. Figure 9.46 shows the
stream with the new Table node and Fig. 9.47 shows the predicted values.
4. We explained in Sect. 2.4 how to add a comment to a stream. Here, we can’t
assign the comment to a specific node. So we do not mark any node in advance.
We add a comment by using the toolbar. The result is shown below (Fig. 9.48).
5. In the exercise, we described how to disable the automatic data preparation in
the Linear node. See Fig. 9.34.
6. Running the Linear node, we update the model nugget.
7. We get the parameter as shown in Fig. 9.35.
8. We added an R Building node to the stream and connected it with the existing
Type node (Fig. 9.49).
9. The R script in the “R model building syntax” dialog field shows Fig. 9.50 in the
upper dialog field. The correct lm-command to fit the model is:
modelerModel<-lm(MEDV~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE
+DIS+RAD+TAX+PTRATIO+B+LSTAT,data¼modelerData)
9.7 Solutions 1025

Fig. 9.46 Stream with added Table node to show the predicted values

Fig. 9.47 Predicted values by using the SPSS Modeler model

10. To predict the MEDV values by using the determined R model, we can use the
same script as explained in detail in Table 9.4. We only have to modify the
name of the variable from “posttest_prediction” to “MEDV_prediction” as
highlighted with an arrow in Fig. 9.50.
11. The predicted values are saved in the data frame modelerDataModel. We
add a Table node to show the predicted values. See Fig. 9.51. The predicted
values shown in Fig. 9.52 are the same as the results in Fig. 9.47.
12. The determined parameter of the R model in Fig. 9.53 equals the model
parameter in the Linear node shown in Fig. 9.35.
13. Figures 9.54 and 9.55 show the final streams with the comment related to the R
Building, R model nugget, and the Table node.
1026 9 Using R with the Modeler

Fig. 9.48 Stream with added comment

Fig. 9.49 Added R Building node in the original stream


9.7 Solutions 1027

Fig. 9.50 R scripts in the R Building node


1028 9 Using R with the Modeler

Fig. 9.51 Added Table node to show the predicted values

Fig. 9.52 Predicted values using the R model


9.7 Solutions 1029

Fig. 9.53 Text Output tab of the R model nugget


1030 9 Using R with the Modeler

Fig. 9.54 Graph Output tab of the R model nugget


9.7 Solutions 1031

Fig. 9.55 Final stream with added comments

Exercise 5 “Implementing Box Cox Transformation with R”


The solution can be found in stream “R_transform_diabetes”. We start the expla-
nation here with step 4:

4. Following Table 9.2, we use here an R Building node. Figure 9.56 shows the
final stream “R_transform_diabetes.str”.
5. To determine the optimal transformation the R scripts in Tables 9.6 and 9.7 are
used. These scripts equal those in the R script “R_transform_diabetes_data.R”.
Some modifications for the handling of the data are necessary. The comments in
the script explain the commands. First all required libraries are installed in
R. Then these libraries are loaded.
In Table 9.7, we define a function “my.bc.transform”. The Box Cox transfor-
mation is used to determine the optimal exponent to transform the data towards
normality. Then the original as well as the transformed variable are tested with
the Kolmogorov–Smirnov test with Lilliefors modification and the Shapiro–
Wilk normality test.
1032 9 Using R with the Modeler

Fig. 9.56 Final stream “R_transform_diabetes.str”

Table 9.6 R Script to calculate the variable “training_expected_total_3”


# get the old values in the data frame “modelerData”
temp_data<-modelerData
# calculating forecast
result<-temp_data$training_days_actual+temp_data$training_days_to_add
# attach results
modelerData<-cbind(modelerData,result)
# define characteristic of new variable
var1<-c(fieldName¼"training_expected_total_3",fieldLabel¼"",fieldStorage¼"real",
fieldMeasure¼"",fieldFormat¼"",fieldRole¼"")
# return to the Modeler and import new dataframe
modelerDataModel<-data.frame(modelerDataModel,var1)

In the main part of the script in Table 9.6, the data from the modeler are copied
to an object “my.data” with the command “my.data<-modelerData”. In the
script, we do not have to address the object “modelerDataModel”. That’s
because we didn’t modify the object “modelerData”. So the R Building node
does only perform calculations as well as the tests and writes the output to the
console. There are no variables that should be returned to the SPSS Modeler.
Finally, in Table 9.8 the function “my.bc.transform” is being applied to the
variables “glucose_concentration”, “blood_pressure”, “serum_insulin”, “BMI”,
and “diabetes_pedigree”. In comparison to the given R script “R_transform_-
diabetes_data” the variable names are here slightly different. That’s because the
function “spss.system.file” used in the R script to import the SPSS file cuts the
longer variable names. In the SPSS Modeler the variable names must be used as
they can be found in the Table node.
6. A detailed description of the functionalities can be found in Sect. 3.2.5. The
results mentioned there for the variable “Serum Insulin” are also shown in
9.7 Solutions 1033

Table 9.7 Function “my.bc.transform” in the R Script to perform the Box–Cox transforma-
tion and the normality tests in the stream “R_transform_diabetes”
# user-defined function for transformation and tests
my.bc.transform <- function(org.var, var.name¼"")
{
print(var.name)
par(mfrow¼c(1, 1))
# show Log-Likelihood profile
boxcox(org.var~1)
# determine best lambda for box cox
bc.best.power<-powerTransform(org.var)
cat("Estimated transformation parameter: ",bc.best.power$lambda,"\n\n")
# transform original variable with box–cox
bc.best.data<-bcPower(org.var,bc.best.power$lambda)
par(mfrow¼c(1, 2))
# create histograms
hist(org.var, main¼var.name)
hist(bc.best.data, main¼paste(var.name," transformed", sep¼""))
# normal probability plot for original variable
qqnorm(org.var,main¼var.name)
qqnorm(bc.best.data, main¼paste(var.name," transformed", sep¼"")) # ¼¼¼ test normality
# Lilliefors test
# H0: normally distributed
print("original variable: ")
print(lillieTest(org.var))
print("transformed variable: ")
print(lillieTest(bc.best.data)) # ¼¼¼ perform Shapiro Wilk tests
# H0: normally distributed
print("original variable: ")
print(shapiro.test(org.var))
print("transformed variable: ")
print(shapiro.test(bc.best.data))
# restore old parameter
par(mfrow¼c(1, 2))
}

Figs. 9.57 and 9.58. Additionally here the Kolmogorov–Smirnov test result can
be found. The null hypothesis is that the values are normally distributed. We can
reject this hypothesis based on the result “p<2.2E-16” for the original variable.
The p-value for the transformed values of “Serum Insulin” is much better
(“p ¼ 0.4443”) (not shown in Fig. 9.57) and so the transformed variable is
normally distributed. The Shapiro–Wilk normality test shows in general the
same results for this variable.
1034 9 Using R with the Modeler

Table 9.8 Main part of R Script to perform the Box–Cox transformation and the normality
tests in the stream “R_transform_diabetes”
# Automatically install all necessary packages in R
# Source: https://gist.github.com/benmarwick/5054846
list.of.packages <- c("Matrix", "stats","car","MASS","fBasics") # replace xx and yy with
package names
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(new.packages, require, character.only¼T) # loading the required libraries
require(Matrix)
require(stats)
require(car)
require(MASS)
require(fBasics) # ¼¼¼ determine variable to transform
my.data<-modelerData
my.bc.transform(my.data$glucose_concentration, var.name¼"Glucose")
my.bc.transform(my.data$blood_pressure, var.name¼"Blood Pressure")
my.bc.transform(my.data$serum_insulin,var.name¼"Serum Insulin")
my.bc.transform(my.data$BMI, var.name¼"BMI")
my.bc.transform(my.data$diabetes_pedigree, var.name¼"Diabetes Pedigree")

Fig. 9.57 Text Output of the R Building node for the variable “Serum Insulin”
Literature 1035

Fig. 9.58 Graph Output of the R Building node for the variable “Serum Insulin”

Literature
CRAN – Package Hmisc. (2015). Package Hmisc. Accessed 13/08/2015, from https://cran.r-
project.org/web/packages/Hmisc/index.html
IBM. (2014a). SPSS Modeler 16 Essentials for R: Installation Instructions. Accessed 18/09/2015,
from ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/mod
eler_r_plugin_install_project_book.pdf
IBM. (2014b). SPSS Modeler 16 R Nodes. Accessed 18/09/2015, from ftp://public.dhe.ibm.com/
software/analytics/spss/documentation/modeler/16.0/en/modeler_r_nodes_book.pdf
IBM. (2015). SPSS Modeler 17 R Nodes. Accessed 18/09/2015, from ftp://public.dhe.ibm.com/
software/analytics/spss/documentation/modeler/17.0/en/ModelerRnodes.pdf
IBM Website. (2015). Downloads for IBM® SPSS® Modeler. Accessed 06/08/2015, from https://
www.ibm.com/developerworks/community/wikis/home?lang¼en#!/wiki/We70df3195ec8_
4f95_9773_42e448fa9029/page/Downloads%20for%20IBM%C2%AE%20%20SPSS%C2%
AE%20%20Modeler
myowelt.blogspot.de. (2015). R script correlation matrix improved. Accessed 13/08/2015, from
http://myowelt.blogspot.de/2008/04/beautiful-correlation-tables-in-r.html.
Appendix
10

10.1 Data Sets Used in This Book

10.1.1 adult_income_data.txt

The dataset was downloaded from UCI Machine Learning Repository (1996)
and contains census data of 32,651 people. The data were originally
extracted from census bureau database. The variables in the dataset are listed in
Table 10.1.

10.1.2 beer.sav

Based on data found on the webpages of Hoffmann-Beverages (2014) and Beer-


Shop-Hamburg (2014), in this dataset the characteristics of different beers are
listed. Table 10.2 shows the details.

10.1.3 benchmark.xlsx

This dataset includes the performance test results of personal computer processors
published in c’t Magazine for IT Technology (2008). The names of the processors
can be found alongside the names of the manufacturers Intel and AMD. The
processor speed was determined using the “Cinebench” benchmark test (see
Table 10.3).

# Springer International Publishing Switzerland 2016 1037


T. Wendler, S. Gröttrup, Data Mining with SPSS Modeler,
DOI 10.1007/978-3-319-28709-6_10
1038 10 Appendix

Table 10.1 Variables defined in “adult_income_data.txt”


Name of variable Description
Age Age of the participant
Sex Gender of the participant (Female, Male)
Workclass Workclass of the participant (Private, Self-emp-not-inc, Self-emp-inc,
Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
Fnlwgt Final sampling weight
Education Education status (Bachelors, Some-college, 11th, HS-grad, Prof-school,
Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th,
Doctorate, 5th-6th, Preschool)
Education-num Education status numerical represented (1,. . .,16)
Marital-status Marital-status of the participant (Married-civ-spouse, Divorced, Never-
married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
Occupation Job area of the participant (Tech-support, Craft-repair, Other-service, Sales,
Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct,
Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv,
Protective-serv, Armed-Forces)
Relationship Relationship status (Wife, Own-child, Husband, Not-in-family, Other-
relative, Unmarried)
Race Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
Capital-gain Profit that results from a sale of a capital asset.
Capital-loss The loss incurred when a capital asset decreases in value
Hours-per-week Working hours per week
Native-country Native country (United States, Cambodia, England, Puerto-Rico, Canada,
Germany, Outlying-US (Guam-USVI-etc.), India, Japan, Greece, South,
China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam,
Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador,
Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland,
Thailand, Yugoslavia, El-Salvador, Trinidad & Tobago, Peru, Hong,
Holland-Netherlands)
Income Income of the participant (>50 K,
50 K)

Table 10.2 Variables in Name of variable Description


dataset “beer.sav”
Name Name of the beer
Price Price per liter
Calories Calories per 100 ml
Alcohol Alcohol in percent

Table 10.3 Variables Firm Name of the processor company


defined in “benchmark.
Processor type Name of the processor
xlsx”
EUR Price of the processor
CB Score in the Cinebench 10 test
10.1 Data Sets Used in This Book 1039

10.1.4 car_simple.sav

The data represent the prices of six cars in different size categories. The dataset
includes the name of the manufacturer, the type of car, and the price. We formally
declare that the prices are not representative of the models and types mentioned.
Table 10.4 shows the values.
This dataset was created based on an idea presented in Handl (2010), p. 364–383.

10.1.5 car_sales_modified.sav

The file “car_sales_modified.sav” is a modified version of “car_sales_knn_mod.


sav”, originally provided by IBM with the SPSS Modeler. Table 10.5 shows the
variables and a short description.

10.1.6 chess_endgame_data.txt

The dataset was downloaded from the UCI Machine Learning Repository Machine
Learning Repository (1994) and contains data of 28,056 chess endgame black-to-
move positions with a white king and rook against a black king. For all not familiar
with chess, in this position only two outcomes are possible: white winning or a

Table 10.4 Prices of different cars


ID Manufacturer Model Dealer (Possible) Price in 1000 USD
1 Nissan Versa ABC motors 13
2 Kia Soul California motors 19
3 Ford F-150 Johns test garage 27,5
4 Chevrolet Silverado 1500 Welcome cars 28
5 BMW 3 series Four wheels fine 39
6 Mercedes-Benz C-Class Best cars ever 44

Table 10.5 Variables Field name Description


defined in
manufact Manufacturer
“car_sales_modified.sav”
type Vehicle type: 0 ¼ Automobile, 1 ¼ Truck
fuel_cap Fuel capacity
sales Sales in thousands
model Model
resale 4-year resale value
price Price in thousands
horsepow Horsepower
width Width of the car
length Length of the car
1040 10 Appendix

draw. The dataset comprises seven variables, where six of them describe the
positions of the three pieces on the board and the last one the number of moves
white needs to win or the game ends with a draw. Thereby, when white hasn’t won
the game within 16 moves, the game ends automatically with a draw (Table 10.6).

10.1.7 customer_bank_data.csv

This dataset was created by merging data found on the IBM Website (2014). The
records describe several characteristics of the customers of a bank. Additionally,
the customers are marked as having defaulted, or not. See Table 10.7 for details.

10.1.8 diabetes_data_reduced.sav

This dataset comes from the Machine Learning Repository (1990) and originally
from the National Institute of Diabetes and Digestive and Kidney Diseases. It
represents the data from a study of the Pima Indian population. The Pima Indians

Table 10.6 Variables defined in “chess_endgame_data.txt”


Name of variable Description
White King file Column of the white’s king position (A,. . ., H)
White King rank Row of the white’s king position (1,. . .,8)
White Rook file Column of the white’s rook position (1,. . .,H)
White Rook rank Row of the white’s rook position (1,. . .,8)
Black King file Column of the black’s king position (A,. . .,H)
Black King rank Row of the black’s king position (1,. . .,8)
Result for White Outcome of the game for white (win, draw)

Table 10.7 Variables in dataset “customer_bank_data.csv”


Name of variable Description
CUSTOMERID ID of the customer
AGE Age
EDUCATION “school”, “undergraduate”, “postgraduate”, “doctorate”, “postdoctoral
research” . . . non numerically filed
YEARSEMPLOYED years employed
INCOME annual income in 1,000 USD
CARDDEBT debt, accessed through credit cards in 1,000 USD
OTHERDEBT customers debt other than credit cards in 1,000 USD
DEFAULTED Did customers fail to make a payment? 0 ¼ no
1 ¼ yes
$null$ ¼ no information available
ADDRESS address code
10.1 Data Sets Used in This Book 1041

are affected by higher rates of diabetes and obesity than the average. See Schulz
et al. (2006).
The included variables, as well as their meaning, can be found in Table 10.8. We
removed all records with missing values in any of the variables and converted the
data into an SPSS-data file. 392 records are included. All patients are female and at
least 21 years old.

10.1.9 DRUG1n.sav

The dataset contains data of a drug treatment study. The patients in this study suffer
all from the same illness but respond different medications. The data is provided by
the SPSS Modeler as basis of a demo, see IBM (2015), p. 73. The variables in the
dataset are listed in Table 10.9.

Table 10.8 Variables in dataset “diabetes_data_reduced.sav”


Name of variable Description
glucose_concentration Plasma glucose concentration in an oral glucose tolerance test
blood_pressure Diastolic blood pressure (mm Hg)
serum_insulin 2-Hour serum insulin (mu U/ml)
BMI Body mass index (weight in kg/(height in m)\^2)
diabetes_pedigree Diabetes pedigree function (DBF)
Developed by Smith et al. (1988). The Diabetes Pedigree Function
(DBF) “provides a synthesis of the diabetes mellitus history in
relatives and the genetic relationship of those relatives to the subject.
The DPF uses information from parents, grandparents, full and half
siblings, full and half aunts and uncles, and first cousins. It provides a
measure of the expected genetic influence of affected and unaffected
relatives on the subject’s eventual diabetes risk.” Smith et al. (1988),
p. 262
times_pregnant Number of pregnancies
skin_thickness Tricep skin fold thickness (mm)
age Age in years
class_variable Diabetes test result

Table 10.9 Variables defined in “DRUG1n.sav”


Name of variable Description
Age Age of the patient
Sex Gender of the patient (M, F)
BP Blood pressure (HIGH, NORMAL, LOW)
Cholesterol Blood cholesterol (NORMAL, HIGH)
Na Blood sodium concentration
K Blood potassium concentration
Drug Prescription drug to which a patient responded
1042 10 Appendix

10.1.10 EEG_Sleep_Signals.csv

The dataset contains EEG signal data of a single person in a drowsiness and awake
state. The electric impulse of the brain is thereby measured every 10 ms, and the
data is split into segments of 30 s. For information on EEG, we refer to
Niedermeyer et al. (2011) (Table 10.10).

10.1.11 employee_dataset_001 and employee_dataset_002

These Microsoft Excel datasets were generated for demonstration purposes only.
The sample sizes are three records each. Tables 10.11 and 10.12 show the structure
of the sets.

10.1.12 England Payment Datasets

The source of these data is the UK Office for National Statistics and its website
NOMIS UK (2014). These data are based on an annual survey of hours and
earnings—workplace analysis coming from the Annual Survey of Hours and
Earnings (ASHE).
The median of the weekly or hourly payments has been downloaded. The sum of
different variables cannot be represented by another variable also included in the
dataset, e.g., the median of the weekly payment excluding overtime plus the median
of the overtime payment does not equal the median of the weekly gross payment.

Table 10.10 Variables defined in “EEG_Sleep_Signal.csv”


Name of variable Description
Ms_x EEG signal at time x in a 30 second signal segment. X in the time in
milliseconds and is between 0 and 29990.
Sleepiness Indicator variable if the segment is measured while sleepiness (Yes/No)

Table 10.11 Dataset “employee_dataset_001”


customer_ID Automatically generated unique ID of a customer
monthly_salary Monthly salary of the employee
employer Firm name of the employer

Table 10.12 Dataset “employee_dataset_002”


customer_ID Automatically generated unique ID of a customer
family_status Family status of the employee. No variable labels or codes are used here
car Manufacturer name of the staff car. No variable labels or codes are used here
10.1 Data Sets Used in This Book 1043

Table 10.13 shows the variables from the CSV file for 2013. Table 10.14 shows the
variables for 2014.
The payments for female and male employees in 2014 are included in the
Microsoft Excel files “england_payment_fulltime_female_2014” and “england_
payment_fulltime_male_2014”. The variables are the same as those described in
Tables 10.13, 10.14, and 10.15.
The coefficient of variation is described on the website NOMIS UK (2014) as
follows:

Table 10.13 Variables defined in “england_payment_fulltime_2013.csv”


Represents the type of the region and the name, separated
admin_description by “:”. For the type of the region, see Table 10.15
area_code A unique identifier for the area/region
weekly_payment_gross Median of sum of weekly payments
weekly_payment_gross_CV Coefficient of variation from the value above
weekly_payment_excluding Median of weekly payment excluding overtime
weekly_payment_excluding_CV Coefficient of variation from the value above
weekly_payment_basic Median of basic weekly payment
weekly_payment_basic_CV Coefficient of variation from the value above
overtime_payment Median of weekly payment for overtime
overtime_payment_CV Coefficient of variation from the value above
hourly_payment Median of the sum of hourly payments
hourly_payment_CV Coefficient of variation from the value above
hourly_payment_excluding Median of the hourly payments excluding overtime
hourly_payment_excluding_CV Coefficient of variation from the value above
annual_payment_gross Median of the sum of annual payments
annual_payment_gross_CV Coefficient of variation from the value above
annual_payment_incentive Median of the incentives per year
annual_payment_incentive_CV Coefficient of variation from the value above
hours_worked_total Median of the number of working hours per week
hours_worked_total_CV Coefficient of variation from the value above
hours_worked_basic Median of the basic hours worked per week
hours_worked_basic_CV Coefficient of variation from the value above
hours_worked_overtime Median of the overtime per week
hours_worked_overtime_CV Coefficient of variation from the value above

Table 10.14 Variables defined in “england_payment_fulltime_2014_reduced”


admin_description
admin_description Represent the type of the region and the name, separated by “:”.
For the type of the region, see Table 10.15
area_code A unique identifier for the area/region.
weekly_payment_gross Median of sum of weekly payments
weekly_payment_gross_CV Coefficient of variation from the value above
1044 10 Appendix

Table 10.15 Codes of ualad09 District


variable “District code”
pca10 Parliamentary constituencies
and their meaning
gor Region

Table 10.16 Codes of 5 % or lower Precise


variable “CV Value” and
over 5 %, up to 10 % Reasonably precise
their meaning in relation to
the quality of an estimation over 10 %, up to 20 % Acceptable, but use with caution
over 20 % Unreliable, figures suppressed

Table 10.17 Variables defined in “Featurs_eeg_signals.csv”


Name of variable Description
Activity Variation of the signal
Mobility Represents mean frequency
Complexity Describes the change in frequency
Range Difference of maximum and minimum value of the signal
Crossings0 Number of x-axis crossings of the standardized signal
Sleepiness Indicator variable if the segment is measured while sleepiness (Yes/No)

“The quality of an estimate can be assessed by referring to its coefficient of


variation (CV), which is shown next to the earnings estimate. The CV is the ratio of
the standard error of an estimate to the estimate. Estimates with larger CVs will be
less reliable than those with smaller CVs.
In their published spreadsheets, ONS use the following CV values to give an
indication of the quality of an estimate . . .”. For details, see Table 10.16.

10.1.13 Features_eeg_signals.csv

The dataset contains aggregated data which are obtained from the EEG dataset
“EEG_Sleep_Signal.csv” (Sect. 10.1.10). The features were calculated for each
EEG signal segment of the mentioned dataset. The first three features are called
Hjorth, see Niedermeyer et al. (2011) and Oh et al. (2014) (Table 10.17).

10.1.14 gene_expression_leukemia.csv

The dataset contains gene expression of various leukemia patients. The data were
measured on 851 positions on the human genome which refer to the genes from a
list of cancer related genes, which were consolidated in Futreal et al. (2004). The
dataset here is a subset of the huge leukemia data that were the basis of the study
Haferlach et al. (2010). In this study, however, gene expression measurements of
more genes were included. The subset here was downloaded from the open source
10.1 Data Sets Used in This Book 1045

Table 10.18 Variables defined in “gene_expression_leukemia_short.csv”


Name of
variable Description
Patient_ID Identification number of the patient
number_at Gene expression on the particular location on the human genome. These
position identifiers consist of a cryptic number and “at” at the end. For
example “1563591_at”. There are 39 different positions in the dataset
Leukemia The type of leukemia of the patient. One of the following: AML, ALL, CML,
CLL, Non-leukemia and healthy bone marrow

Table 10.19 Variables defined in “gene_expression_leukemia_short.csv”


Name of variable Description
Patient_ID Identification number of the patient
number_at Gene expression on the particular location on the human genome. These
position identifier consist of a cryptic number and “at” at the end. For
example “1563591_at”. There are 39 different positions in the dataset
Leukemia The type of leukemia of the patient. One of the following: AML, ALL,
CML, CLL, Non-leukemia and healthy bone marrow

project Leukemia Gene Atlas Hebestreit et al. (2012), which is an open repository
of leukemia datasets and studies (Table 10.18).
The term gene expression and its relation to gene regulation is explained in
O’Connor and Adams (2010). For basic background information on leukemia, we
refer to National cancer Institute (2013).

10.1.15 gene_expression_leukemia_short.csv

The dataset is a subset of the “gene_expression_leukemia.csv” data in Sect. 10.1.14


and contains gene expression of various leukemia patients on 39 selected locations
of the human genome. These genome positions refer to the genes NPM1, RUNX1,
HOXA1, . . ., HOXA11, HOXA13. These genes are commonly known to be
relevant for leukemia. For references, we point to Sect. 10.1.14. For details see
also Table 10.19.

10.1.16 gravity_constant_data.csv

This Microsoft Excel dataset has been generated for demonstration purposes
only. The sample sizes are three records each. Table 10.20 shows the structure of
the sets.
1046 10 Appendix

Table 10.20 Variables defined in “gravity_constant_data.csv”


Name of variable Description
Height (m) Height in meters
Seconds squared Squared time in seconds until the falling object hits the ground

Table 10.21 Variables defined in “Housing.data.txt”


Name of variable Description
CIRM Per capita crime rate by town
ZN Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS Proportion of non-retail business acres per town
CHAS Charles River dummy variable (tract bounds river ¼ 1; otherwise ¼ 0)
NOX Nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil–teacher ratio by town
B 1000(Bk – 0.63)^2 where Bk is the proportion of African/Americans by town
LSTAT percentage of population with low status
MEDV Median value of owner-occupied homes in $1000’s

10.1.17 Housing.data.txt

The dataset was downloaded from the Machine Learning Repository (1993). It
contains housing values for certain Boston suburbs. Details can be found in
Harrison and Rubinfeld (1978) and Gilley and Pace (1996). The variables included
are described in Table 10.21.

10.1.18 Iris.csv

This dataset contains the sepal and petal width and length of 50 flowers for each of
the three iris species. It is included in the R, version 3.1.0, and can be shown with
the R command “Index: data(iris)”. For details, see Fisher (1936) and Longley
(1967). Table 10.22 explains the variables and their meaning.
10.1 Data Sets Used in This Book 1047

Table 10.22 Variables Name of variable Description


defined in “Iris.csv”
Sepal Length Sepal length in centimeters
Sepal Width Sepal width in centimeters
Petal Length Petal length in centimeters
Petal Width Petal width in centimeters
Species Species name of the iris flower:
Setosa, versicolor, virginica

Table 10.23 Variables defined in “IT-projects.txt”


Project number ID of the project
Time for development of the The time that is necessary to write the source instructions and
software (TDEV) to deliver the application to the customer. The unit is months
Person months (PM) The sum of the number of months the development process
needs per person
1000 (K) delivered source Number of source instructions written
instructions (KDSI)

10.1.19 IT-projects.txt

This dataset includes several variables related to IT Projects. For details, see
Table 10.23.

10.1.20 IT user satisfaction.sav

This dataset represents the opinions of IT users in a particular firm. 180 users were
asked to assess the quality of the IT system. The questionnaire used to collect the
data is based on Heinrich (2002b). A detailed description can be found in Heinrich
(2002a). The dataset used here is a modified version, first presented by Wendler
(2004). For details, see Table 10.24.

10.1.21 longley.csv

This dataset is included in the R, version 3.1.0, and can be shown with the R
command “Index: data(longley)”. For details, see Longley (1967).
The data are macroeconomics data with several yearly, observed economic
variables from 1947 to 1962. See Table 10.25.
1048 10 Appendix

Table 10.24 Variables defined in “IT user satisfaction.sav”


Degree of satisfaction with technically related success factors
starttime How satisfied are you with the length of time the IT system needs from
booting the system until it is ready to start daily needed applications?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
time_to_start Estimate the time you spend every day getting ready to work with the
system, instead of actually doing work (turning it on, logging in,
starting Windows)?
Options:
3 ¼ less than 3 min,
5 ¼ 3 to 5 min,
10 ¼ 6 to 10 min,
11 ¼ more than 10 min
system_availability Make an assessment of how much downtime you have with your
present resources, (i.e., software crashes, disruption during data
transfer, or access to resources) in comparison to your actual working
time. Rate fulfillment of the aim of minimizing downtime.
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
time_loss How much working time do you lose approximately PER WEEK
because of program or system crashes, due to the fact you have to
repeat tasks.
Answer options:
0 ¼ insignificant,
10 ¼ up to 10 min,
20 ¼ 10 to 20 min,
30 ¼ 20 to 30 min, 31 ¼ more than 30 min
performance How do you evaluate the timeliness of the system’s available features,
the programs and data, the relationship between waiting time and
overall duration of the process (i.e., program boot time, time needed to
access a database, printing time, etc.)?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
increase_performance Estimate how much of an improvement is needed in the execution
performance of programs, for them to work more effectively?
Answer options:
0 ¼ insignificant,
2 ¼ double,
3 ¼ more than double
Degree of satisfaction with success factors related to organizational processes
user_qualification How do you evaluate your knowledge, abilities, and accomplishments
in dealing with the IT system and the provided applications?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
training_quality How do you evaluate the extent and the quality of the provided training
and materials (i.e., the manuals, availability of online help, etc.)?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
training_days_actual How many working days per year do you actually spend on IT
training?
Answer options:
0 ¼ none, 1 ¼ 1 day, 2 ¼ 2 days, 3 ¼ more than 2 days
(continued)
10.1 Data Sets Used in This Book 1049

Table 10.24 (continued)


Degree of satisfaction with technically related success factors
training_days_to_add How many working days should be available per year for training and
further education, additional to the abovementioned existing time
budget for training during working time.
Answer options:
0 ¼ none, 1 ¼ 1 day, 2 ¼ 2 days, 3 ¼ more than 2 days
support_quality How do you evaluate the quality of the support that is provided with
the system, particularly when there are problems/failures/defects?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
support_reachability How do you evaluate the availability of user support staff when there
are problems with the IT system?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
Degree of satisfaction of other success factors
user_orientation How do you evaluate the ability of the IT system to support you in
accomplishing your tasks (i.e., availability of the necessary programs
and complete/appropriate functions)?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
usability How do you evaluate the usability of the IT system (i.e., simplicity of
interaction, transferability of data between different programs)?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
data_quality How do you evaluate the possibilities of access to electronic data (i.e.,
completeness, correctness, and actuality of data, user knowledge about
the data)?
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent
slimness Is applied hardware and software optimally utilized? Rate the
minimization of unused capacities (i.e., no redundant workplace
printer, no needless complex, and unused software)
1 ¼ poor, 3 ¼ fair, 5 ¼ good, 7 ¼ excellent

Table 10.25 Variables defined in “longley.csv”


Name of variable Description
GNP.deflator GNP implicit price deflator (1954 ¼ 100)
GNP Gross National Product
Unemployed Number of unemployed people
Armed.Forces Number of people in the armed forces
Population “Noninstitutionalized” population  14 years of age
Year Year (time)
Employed Number of people employed

10.1.22 LPGA2009.csv

This data from the Journal of Statistical Education Data Archive (2009) represents
performance and success statistics for golfers on the LPGA tour in 2009. See
Table 10.26 for more details.
1050 10 Appendix

Table 10.26 Variables defined in “LPGA2009.csv”


Name of variable Description
Golfer id Identification number of the golfer
Average drive (yards) Average number of yards of long distance shots
Percent of fairways hit Percentage of hits with the ball touching the fairway
afterwards
Percent of greens reached in Percentage of hits that reach the green at least two before
regulation par
Average putts per round Average number of shots on the green
Percent of sand saves (2 shots to Percentage of sand saves (player achieves par with two shot
hole) from a green side bunker)
Ln(prize) Prize money on a logarithmic scale
Tournaments played in Number of tournaments the golfer played
Green in regulation putts per hole Putts on green with at least two to par
Completed tournaments Number of finished tournaments of a golfer
Average percentile in Average percentile in tournaments (high is good)
tournaments (high is good)
Rounds completed Number of completed rounds
Average strokes per round Average number of strokes per round

Table 10.27 Variables defined in “Mtcars.csv”


Name of variable Description
Mpg Miles per gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec 1/4 mile time
vs V/S
am Transmission (0 ¼ automatic, 1 ¼ manual)
gear Number of forward gears
carb Number of carburetors

10.1.23 Mtcars.csv

This dataset is included in the R package datasets, version 3.1.0, and can be shown
with the R command “Index: data(mtcars)”. For details, see Henderson and
Velleman (1981).
Data represent ten performance and design parameters, as well as the fuel
consumption of 32 automobiles from 1974 (see Table 10.27).
10.1 Data Sets Used in This Book 1051

10.1.24 nutrition_habites.sav

The key idea of this dataset and some of the steps and interpretations that can be
found in this book are based on explanations from Bühl (2012). The authors created
a completely new dataset of their own, however.
In relation to the “diet types”:

– Vegetarian
– Low meat
– Fast food
– Filling
– Hearty

the consumers were asked “Please indicate which of the following dietary
characteristics describe your preferences. How often do you eat . . .”.
The respondents had the chance to rate their preferences on a scale “(very)
often”, “sometimes”, and “never”.
The ID is an ordinal variable because the values can be ordered, but because of
the way the role “none” has been defined, the scale type in fact does not matter. All
the other variables are coded as follows: “1 ¼ never”, “2 ¼ sometimes”, and “3 ¼
(very) often”. The variables are ordinally scaled.

10.1.25 optdigits_training.txt, optdigits_test.txt

The datasets contain information of handwritten digits of 43 people, 30 of them for


the training set and the other 13 people contributed to the test set. So, the handwrit-
ing in the two datasets arose from different people. The digits were written on a
32  32 bitmap which was divided into disjoint segments of 4  4. In each of these
segments, the number of black pixels were counted. Hence, each handwritten digit
in the dataset consists of 8  8 fields where each has an integer value between 0 and
16.
The dataset were derived from the UCI Machine Learning Repository
M. Lichman (2013) and are available in Machine Learning Repository (1998).
The training data consist of 3823 instances and the test set of 1797 hand written
digits (Table 10.28).

Table 10.28 Variables defined in “optdigits_training.txt” and “optdigits_test.txt”


Name of variable Description
FieldX Number of black pixels in block X. X is between 1 and 64
Field65 Handwritten digit (0,. . ., 9)
1052 10 Appendix

10.1.26 Orthodont.csv

The dataset contains measures of the orthodontic change (distance) over time of
27 teenagers. It is included in the package “nlme” of R, version 3.1.0, and can be
shown with the R command “Index: data(Orthodont)”. See Potthoff and Roy (1964)
for details. Table 10.29 explains the variables and their meaning.

10.1.27 Ozone.csv

This dataset contains meteorology data and ozone concentration from the Los
Angeles Basin in 1976. It is included in the package “faraway” of R, version
3.1.0, and can be shown with the R command “Index: data(faraway::ozone)”. See
Breiman and Friedman (1985) for details. Table 10.30 explains the variables and
their meaning.

10.1.28 pisa2012_math_q45.sav

This dataset includes the answers to a questionnaire of the Organisation for


Economic Cooperation and Development (OECD), related to the Programme for
International Student Assessment, and can be freely downloaded from the website

Table 10.29 Variables defined in “Orthodont.csv”


Name of variable Description
Distance Distance from the pituitary to the pterygomaxillary fissure (mm)
Age Age of the patient
Subject Indicator of the patient
Sex Gender of the patient (levels: Male, Female)

Table 10.30 Variables defined in “Ozone.csv”


Name of variable Description
O3 Daily maximum of the hourly average ozone concentration
Vh 500 millibar pressure height
Wind Wind speed in mph
Humidity Humidity in percent
Temp Temperature in degrees Fahrenheit
Ibh Temperature inversion base height in feet
Dpg Pressure gradient in mm Hg
Ibt Inversion base temperature in degrees Fahrenheit
Vis Visibility
Day Day of year
10.1 Data Sets Used in This Book 1053

OECD (2012b). To reduce the number of records and variables, the authors of this
book preprocessed the data as follows:

– Responses to the student questionnaire type A are used.


– Only the answers given by German students are selected.
– Question 45 “Thinking about mathematical concepts: how familiar are you with
the following terms?” will be addressed here.
– Records with missing values have been completely removed.
– Variables unrelated to question number 45 stated above have been removed.
– As outlined in OECD (2012a), p. 17 and Stacey and Turner (2015), p. 213,
additional questions were asked to make sure the students gave correct answers.
If the answers coded with “ST62Q04”, “ST62Q11”, and “ST62Q13” indicate
incorrect responses, the records were removed from the dataset.
– The question with the code “ST62Q17” was removed because it relates to the
statistical measure of a central tendency “mean”. Therefore, it is not related to all
the other concepts/mathematical terms addressed here.

The sample size is 551. The responses are related to the question “Thinking
about mathematical concepts: how familiar are you with the following terms?”
Details can be found in Table 10.31. Table 10.32 shows the coding of the answers.

Table 10.31 Variable Variable name Mathematical term/topic addressed


name and topic of the
ST62Q01 Exponential Function
question
ST62Q02 Divisor
ST62Q03 Quadratic Function
ST62Q06 Linear Equation
ST62Q07 Vectors
ST62Q08 Complex Number
ST62Q09 Rational Number
ST62Q10 Radicals (equals roots, e.g., square root)
ST62Q12 Polygon
ST62Q15 Congruent Figure
ST62Q16 Cosine
ST62Q19 Probability

Table 10.32 Coding of Answer code Meaning


the answers
1 Never heard of it
2 Heard of it once or twice
3 Heard of it a few times
4 Heard of it often
5 Know it well, understand the concept
1054 10 Appendix

10.1.29 sales_list.sav

This dataset was created by the authors of this book, based on an idea presented in
IBM (2014b), p. 57. For details, see Table 10.33.

10.1.30 ships.csv

This dataset is included in the R package MASS, version 3.1.0, and can be shown
with the R command “Index: data(ships)”. For details, see McCullagh and
Nelder (1983).
Data give the number of incidents, the year of construction, the aggregated
month of service, and the period of operations for 40 ships. For details, see
Table 10.34.

10.1.31 test_scores.sav

This dataset “test_scores.sav” comes with the IBM SPSS Modeler Version
16 (Table 10.35).

Table 10.33 Variables consumer_ID Internal ID of the consumer


defined in “sales_list.sav”
transaction_date Date of the purchase
gender Gender of the consumer
1 . . . female
2 . . . male
article Product purchased
1 . . . butter
2 . . . bread
3 . . . salt
4 . . . honey
5 . . . beer
payment_method Payment method
1 . . . cash
2 . . . credit card

Table 10.34 Variables defined in “ships.csv”


Name of variable Description
Type Ship type (“A” to “E”)
Year year of construction: 1960–64, 65–69, 70–74, 75–79 (coded as “60”,
“65”, “70”, “75”)
Period Period of operation : 1960–74, 75–79
Service Aggregate months of service
Incidents Number of damage incidents
10.1 Data Sets Used in This Book 1055

Table 10.35 Variables defined in “test_scores.sav”


Field name Description
school Name of the school
school_setting School setting:
1 ¼ Urban, 2 ¼ Suburban, 3 ¼ Rural
school_type School type
1 ¼ Public, 2 ¼ Nonpublic
classroom Classroom number
teaching_method Teaching method
0 ¼ Standard, 1 ¼ Experimental
n_student Number of students in the classroom
student_id Student ID
gender Gender of the student
0 ¼ Male, 1 ¼ Female
lunch Reduced/Free lunch
1 ¼ Qualifies for reduced/free lunch 2 ¼ Does not qualify
pretest Result of Pretest
Posttest Result of Posttest

It can be used to answer several questions such as:

– Is a good pre-exam result an indicator for a good final-exam score?


– Is there a relationship between the final-exam score and other variables?

Table 10.35 shows the fields and a short description. See also IBM (2014c).

10.1.32 Titanic.xlsx

The dataset contains information about the Titanic passengers including the sur-
vival status of the sinking. The data does not contain crew information. The data
were collected by Thomas Cason. See Vanderbilt University School of Medicine
(2004). Table 10.36 lists the variables of the dataset with their meaning.

10.1.33 tree_credit.sav

General link: C:\Program Files\IBM\SPSS\Modeler\16\Demos\


Source: The dataset comes with the IBM SPSS Modeler and will be installed on
the hard disc.
Table 10.37 from IBM (2014a), p. 19 shows the fields and a short description.
1056 10 Appendix

Table 10.36 Variables defined in “Titanic.xlsx”


Name of
variable Description
Name Name of the passenger
Pclass Indicator of the socioeconomic status of the passenger (1 ¼ Upper,
2 ¼ Middle, 3 ¼ Low)
Survived Flag if the passenger survived the sinking (0 ¼ No, 1 ¼ Yes)
Sex Gender of the passenger (female, male)
Age Age in years of the passenger
Sibsp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Ticket Number of the ticket
Fare Passenger fare payed
Cabin Cabin number of the passenger
embarked Port of Embarkation (C ¼ Cherbourg, Q ¼ Queenstown, S ¼ Southampton)

Table 10.37 variables in dataset tree_credit.sav


Field name Description
Credit_rating Credit rating: 0 ¼ Bad, 1 ¼ Good, 9 ¼ missing values
Age Age in years
Income Income level: 1 ¼ Low, 2 ¼ Medium, 3 ¼ High
Credit_cards Number of credit cards held: 1 ¼ less than five, 2 ¼ five or more
Education Level of education: 1 ¼ High school, 2 ¼ College
Car_loans Number of car loans taken out: 1 ¼ None or one, 2 ¼ More than two

10.1.34 wine_data.txt

The dataset was downloaded from the UCI Machine Learning Repository Machine
Learning Repository (1991) and contains chemical analysis data from three Italian
wines from different cultivators. In the analysis of the wines, 13 indicators are
determined for each of the three wines (Table 10.38).

10.1.35 WisconsinBreastCancerData.csv

The dataset was downloaded from the UCI Machine Learning Repository Machine
Learning Repository (1992). The dataset originally from William H. Wolberg
(2003) represents medically related values determined to diagnose breast cancer.
See also Wolberg and Mangasarian (1990) for details. The variables are described
in Table 10.39.
Literature 1057

Table 10.38 Variables defined in “wine_data.txt”


Name of variable Description
Wine Indicator of the wine (1, 2, 3)
Alcohol Per mil of alcohol in the wine
Malic_acid Malic acid value of the wine in pH
Ash Amount of ash content in the wine
Alcalinity_of_ash Amount of organic acid salt in the wine
Magnesium Amount of magnesium in the wine
Total_phenols Measure for the total number of phenols in the wine
Flavanoids Proportion of flavonoids in the wine
Nonflavanoid_phenols Proportion of nonflavonoids in the wine
Proanthocyanins Proportion of proanthocyanins in the wine
Color_intensity Measure for the intensity of the wine’s color
Hue Another color measure that describes wine aging
OD280_OD315_of_diluted_wines Proportion of OD280/OD315 in diluted wine
Proline Amount of proline in the wine

Table 10.39 Variables defined in “WisconsinBreastCancerData.csv”


Name of variable Description
Sample code number Number of the sample
Class 2 for benign, 4 for malignant
Clump thickness Clump thickness on a scale from 1 to 10
Uniformity of cell size Uniformity of cell size on a scale from 1 to 10
Uniformity of cell shape Uniformity of cell shape on a scale from 1 to 10
Marginal adhesion Marginal adhesion on a scale from 1 to 10
Single epithelial cell size Single epithelial cell size on a scale from 1 to 10
Bare nuclei Bare nuclei (0–10)
Bland chromatin Bland chromatin (1–10)
Normal nucleoli Normal nucleoli (1–10)
Mitoses Mitosis (1–10)

10.1.36 z_pm_customer1.sav

The dataset contains historical data of several offers to customers in different


campains. The data is originally provided as pm_customer1.sav file by the SPSS
Modeler as basis of a demo, see IBM (2015), p. 35.

Literature
Beer-Shop-Hamburg. (2014). Beer from all over the world. Accessed 26/08/2014, from http://
www.biershop-hamburg.de/Biere-aus-aller-Welt-17
1058 10 Appendix

Breiman, L., & Friedman, J. H. (1985). Estimating optimal transformations for multiple regression
and correlation. Journal of the American Statistical Association, 80(391), 580–598.
Bühl, A. (2012). SPSS 20: Einf€ uhrung in die moderne Datenanalyse, Scientific tools (13th ed.).
München: Pearson.
c’t Magazine for IT Technology. (2008). CPU-Wegweiser: x86-Prozessoren im Überblick, Vol.
2008 No. 7, pp. 178–182.
Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of
Eugenics, 7(2), 179–188.
Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., Rahman, N., &
Stratton, M. R. (2004). A census of human cancer genes. Nature Reviews Cancer, 4(3),
177–183.
Gilley, O. W., & Pace, R. (1996). On the Harrison and Rubinfeld Data. Journal of Environmental
Economics and Management, 31(3), 403–405.
Haferlach, T., Kohlmann, A., Wieczorek, L., Basso, G., Kronnie, G. T., Béné, M.-C., de Vos, J.,
Hernández, J. M., Hofmann, W.-K., Mills, K. I., Gilkes, A., Chiaretti, S., Shurtleff, S. A.,
Kipps, T. J., Rassenti, L. Z., Yeoh, A. E., Papenhausen, P. R., Liu, W.-M., Williams, P. M., &
Foà, R. (2010). Clinical utility of microarray-based gene expression profiling in the diagnosis
and subclassification of leukemia: report from the International Microarray Innovations in
Leukemia Study Group. Journal of Clinical Oncology Official Journal of the American Society
of Clinical Oncology, 28(15), 2529–2537.
Handl, A. (2010). Multivariate Analysemethoden: Theorie und Praxis multivariater Verfahren
unter besonderer Ber€ ucksichtigung von S-PLUS, Statistik und ihre Anwendungen (2nd ed.).
Heidelberg: Springer.
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air.
Journal of Environmental Economics and Management, 5(1), 81–102.
Hebestreit, K., Gr€ottrup, S., Emden, D., Veerkamp, J., Ruckert, C., Klein, H.-U., Müller-Tidow,
C., Dugas, M., & Speletas, M. (2012). Leukemia Gene Atlas – A Public Platform for Integra-
tive Exploration of Genome-Wide Molecular Data. PLoS One, 7(6), e39148.

Heinrich, L. J. (2002a). Informationsmanagement: Planung, Uberwachung und Steuerung der
Informationsinfrastruktur, Wirtschaftsinformatik (7th ed.). München: Oldenbourg.
Heinrich, L. J. (2002b). Questionnaire for a success factor analysis in SME.
Henderson, H. V., & Velleman, P. F. (1981). Building multiple regression models interactively.
Biometrics, 37, 391–411.
Hoffmann-Beverages. (2014). Beverage-details. Accessed 27/08/2014, from http://www.
getraenke-hoffmann.de/download/durstexpress/DurstExpress_Katalog.pdf
IBM. (2014a). SPSS Modeler 16 Applications Guide. Accessed 18/09/2015, from ftp://public.dhe.
ibm.com/software/analytics/spss/documentation/modeler/16.0/en/modeler_applications_-
guide_book.pdf
IBM. (2014b). SPSS Modeler 16 Source, Process, and Output Nodes. Accessed 18/09/2015, from
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/16.0/en/
modeler_nodes_general.pdf
IBM. (2014c). Test scores dataset. Accessed 18/09/2015, from http://www-01.ibm.com/support/
knowledgecenter/SSLVMB_22.0.0/com.ibm.spss.statistics.cs/components/glmm/glmm_tests-
cores_intro.htm
IBM. (2015). SPSS Modeler 17 Applications guide. ftp://public.dhe.ibm.com/software/analytics/
spss/documentation/modeler/17.0/en/ModelerApplications.pdf
IBM Website. (2014). Customer segmentation analytics with IBM SPSS. Accessed 08/05/2015,
from http://www.ibm.com/developerworks/library/ba-spss-pds-db2luw/index.html
Journal of Statistical Education Data Archive. (2009). LPGA Performance Statistics for 2009.
Accessed 12/06/2015, from http://www.stat.ufl.edu/~winner/data/lpga2009.dat
Longley, J. W. (1967). An appraisal of least squares programs for the electronic computer from the
point of view of the user. Journal of the American Statistical Association, 62(319), 819–841.
Lichman, M. (2013). UCI Machine learning repository. http://archive.ics.uci.edu/ml
Literature 1059

Machine Learning Repository. (1990). Pima Indians Diabetes Data Set. Accessed 18/09/2015,
from http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Machine Learning Repository. (1991). Wine Data Set. Accessed 2015, from http://archive.ics.uci.
edu/ml/datasets/Wine
Machine Learning Repository. (1992). Breast Cancer Wisconsin (Original) Data Set. Accessed
29/10/2015, from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%
28Original%29
Machine Learning Repository. (1993). Boston Housing Data Set. Accessed 12/06/2015, from
https://archive.ics.uci.edu/ml/datasets/Housing
Machine Learning Repository. (1994). Chess Endgame Database for White King and Rook against
Black King (KRK). Accessed 2015, from https://archive.ics.uci.edu/ml/datasets/Chess+(King-
Rook+vs.+King)
Machine Learning Repository. (1998). Optical Recognition of Handwritten Digits. Accessed 2015,
from https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
McCullagh, P., & Nelder, J. A. (1983). Generalized linear models, Monographs on statistics and
applied probability. London: Chapman and Hall.
National cancer Institute. (2013). What you need to know about leukemia, NIH publication,
no. 13-3775. Revised September 2013, digital edition.
Niedermeyer, E., Schomer, D. L., & Lopes da Silva, F. H. (2011). Niedermeyer’s electroencepha-
lography: Basic principles, clinical applications, and related fields (6th ed.). Philadelphia:
Wolters Kluwer/Lippincott Williams & Wilkins Health.
NOMIS UK. (2014). Official Labour Market Statistics – Annual Survey of Hours and Earnings –
Workplace Analysis. Accessed 18/09/2015, from http://nmtest.dur.ac.uk/
O’Connor, C. M., & Adams, J. U. (2010). Essentials of cell biology. Cambridge, MA: NPG
Education.
OECD. (2012a). PISA 2012 Technical Report.
OECD. (2012b). Programm for International Student Assessment (PISA) 2012. Accessed 02/03/
2015, from http://pisa2012.acer.edu.au/downloads.php
Oh, S.-H., Lee, Y.-R., & Kim, H.-N. (2014). A novel EEG feature extraction method using Hjorth
parameter. International Journal of Electronics and Electrical Engineering, 2(2), 106–110.
Potthoff, R. F., & Roy, S. N. (1964). A generalized multivariate analysis of variance model useful
especially for growth curve problems. Biometrika, 51, 313–326.
Schulz, L. O., Bennett, P. H., Ravussin, E., Kidd, J. R., Kidd, K. K., Esparza, J., & Valencia, M. E.
(2006). Effects of traditional and western environments on prevalence of type 2 diabetes in
Pima Indians in Mexico and the U.S. Diabetes Care, 29(8), 1866–1871.
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the
ADAP learning algorithm to forecast the onset of diabetes mellitus. Proceedings of the Annual
Symposium on Computer Application in Medical Care, 261–265.
Stacey, K., & Turner, R. (2015). Assessing mathematical literacy: The PISA experience.
UCI Machine Learning Repository. (1996). UCI Machine Learning Repository – Adult Data Set.
Accessed 12/09/2015, from https://archive.ics.uci.edu/ml/datasets/Adult
Vanderbilt University School of Medicine. (2004). Department of Biostatistics – Titanic Data.
Accessed 12/09/2015, from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.
html
Wendler, T. (2004). Modellierung und Bewertung von IT-Kosten: Empirische Analyse mit Hilfe
multivariater mathematischer Methoden, Wirtschaftsinformatik. Wiesbaden: Deutscher
Universitäts-Verlag.
Wolberg, W. H. (2003). Wisconsin breast cancer data. Accessed 12/06/2015, from http://www.
stat.yale.edu/~pollard/Courses/230.spring03/WBC/
Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for
medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA, 87(23), 9193–9196.

Vous aimerez peut-être aussi