20773A ENU TrainerHandbook PDF

MCT USE ONLY.
STUDENT USE PROHIBITED

O F F I C I A L M I C R O S O F T L E A R N I N G P R O D U C T
20773A
Analyzing Big Data with Microsoft R
MCT USE ONLY. STUDENT USE PROHIBITED
ii Analyzing Big Data with Microsoft R
Information in this document, including URL and other Internet Web site references, is subject to change
without notice. Unless otherwise noted, the example companies, organizations, products, domain names,
e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with
any real company, organization, product, domain name, e-mail address, logo, person, place or event is
intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the
user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in
or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.
The names of manufacturers, products, or URLs are provided for informational purposes only and
Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding
these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a
manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links
may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not
responsible for the contents of any linked site or any link contained in a linked site, or any changes or
updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission
received from any linked site. Microsoft is providing these links to you only as a convenience, and the
inclusion of any link does not imply endorsement of Microsoft of the site or the products contained
therein.
© 2017 Microsoft Corporation. All rights reserved.
Microsoft and the trademarks listed at

http://www.microsoft.com/about/legal/en/us/IntellectualProperty/Trademarks/EN-US.aspx are trademarks of
the Microsoft group of companies. All other trademarks are property of their respective owners
Product Number: 20773A
Part Number (if applicable): X21-42248
Released: 05/2017
MICROSOFT LICENSE TERMS
MICROSOFT INSTRUCTOR-LED COURSEWARE
These license terms are an agreement between Microsoft Corporation (or based on where you live, one of its
affiliates) and you. Please read them. They apply to your use of the content accompanying this agreement which
includes the media on which you received it, if any. These license terms also apply to Trainer Content and any
updates and supplements for the Licensed Content unless other terms accompany those items. If so, those terms
apply.
BY ACCESSING, DOWNLOADING OR USING THE LICENSED CONTENT, YOU ACCEPT THESE TERMS.
IF YOU DO NOT ACCEPT THEM, DO NOT ACCESS, DOWNLOAD OR USE THE LICENSED CONTENT.
If you comply with these license terms, you have the rights below for each license you acquire.
1. DEFINITIONS.
a. “Authorized Learning Center” means a Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, or such other entity as Microsoft may designate from time to time.
b. “Authorized Training Session” means the instructor-led training class using Microsoft Instructor-Led
Courseware conducted by a Trainer at or through an Authorized Learning Center.
c. “Classroom Device” means one (1) dedicated, secure computer that an Authorized Learning Center owns
or controls that is located at an Authorized Learning Center’s training facilities that meets or exceeds the
hardware level specified for the particular Microsoft Instructor-Led Courseware.
d. “End User” means an individual who is (i) duly enrolled in and attending an Authorized Training Session
or Private Training Session, (ii) an employee of a MPN Member, or (iii) a Microsoft full-time employee.
e. “Licensed Content” means the content accompanying this agreement which may include the Microsoft
Instructor-Led Courseware or Trainer Content.
f. “Microsoft Certified Trainer” or “MCT” means an individual who is (i) engaged to teach a training session
to End Users on behalf of an Authorized Learning Center or MPN Member, and (ii) currently certified as a
Microsoft Certified Trainer under the Microsoft Certification Program.
g. “Microsoft Instructor-Led Courseware” means the Microsoft-branded instructor-led training course that
educates IT professionals and developers on Microsoft technologies. A Microsoft Instructor-Led
Courseware title may be branded as MOC, Microsoft Dynamics or Microsoft Business Group courseware.
h. “Microsoft IT Academy Program Member” means an active member of the Microsoft IT Academy
Program.
i. “Microsoft Learning Competency Member” means an active member of the Microsoft Partner Network
program in good standing that currently holds the Learning Competency status.
j. “MOC” means the “Official Microsoft Learning Product” instructor-led courseware known as Microsoft
Official Course that educates IT professionals and developers on Microsoft technologies.
k. “MPN Member” means an active Microsoft Partner Network program member in good standing.
l. “Personal Device” means one (1) personal computer, device, workstation or other digital electronic device
that you personally own or control that meets or exceeds the hardware level specified for the particular
Microsoft Instructor-Led Courseware.
m. “Private Training Session” means the instructor-led training classes provided by MPN Members for
corporate customers to teach a predefined learning objective using Microsoft Instructor-Led Courseware.
These classes are not advertised or promoted to the general public and class attendance is restricted to
individuals employed by or contracted by the corporate customer.
n. “Trainer” means (i) an academically accredited educator engaged by a Microsoft IT Academy Program
Member to teach an Authorized Training Session, and/or (ii) a MCT.
o. “Trainer Content” means the trainer version of the Microsoft Instructor-Led Courseware and additional
supplemental content designated solely for Trainers’ use to teach a training session using the Microsoft
Instructor-Led Courseware. Trainer Content may include Microsoft PowerPoint presentations, trainer
preparation guide, train the trainer materials, Microsoft One Note packs, classroom setup guide and Pre-
release course feedback form. To clarify, Trainer Content does not include any software, virtual hard
disks or virtual machines.
2. USE RIGHTS. The Licensed Content is licensed not sold. The Licensed Content is licensed on a one copy
per user basis, such that you must acquire a license for each individual that accesses or uses the Licensed
Content.
2.1 Below are five separate sets of use rights. Only one set of rights apply to you.
a. If you are a Microsoft IT Academy Program Member:

i. Each license acquired on behalf of yourself may only be used to review one (1) copy of the Microsoft
Instructor-Led Courseware in the form provided to you. If the Microsoft Instructor-Led Courseware is
in digital format, you may install one (1) copy on up to three (3) Personal Devices. You may not
install the Microsoft Instructor-Led Courseware on a device you do not own or control.
ii. For each license you acquire on behalf of an End User or Trainer, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one (1) End
User who is enrolled in the Authorized Training Session, and only immediately prior to the
commencement of the Authorized Training Session that is the subject matter of the Microsoft
Instructor-Led Courseware being provided, or
2. provide one (1) End User with the unique redemption code and instructions on how they can
access one (1) digital version of the Microsoft Instructor-Led Courseware, or
3. provide one (1) Trainer with the unique redemption code and instructions on how they can
access one (1) Trainer Content,
provided you comply with the following:
iii. you will only provide access to the Licensed Content to those individuals who have acquired a valid
license to the Licensed Content,
iv. you will ensure each End User attending an Authorized Training Session has their own valid licensed
copy of the Microsoft Instructor-Led Courseware that is the subject of the Authorized Training
Session,
v. you will ensure that each End User provided with the hard-copy version of the Microsoft Instructor-
Led Courseware will be presented with a copy of this agreement and each End User will agree that
their use of the Microsoft Instructor-Led Courseware will be subject to the terms in this agreement
prior to providing them with the Microsoft Instructor-Led Courseware. Each individual will be required
to denote their acceptance of this agreement in a manner that is enforceable under local law prior to
their accessing the Microsoft Instructor-Led Courseware,
vi. you will ensure that each Trainer teaching an Authorized Training Session has their own valid
licensed copy of the Trainer Content that is the subject of the Authorized Training Session,
vii. you will only use qualified Trainers who have in-depth knowledge of and experience with the
Microsoft technology that is the subject of the Microsoft Instructor-Led Courseware being taught for
all your Authorized Training Sessions,
viii. you will only deliver a maximum of 15 hours of training per week for each Authorized Training
Session that uses a MOC title, and
ix. you acknowledge that Trainers that are not MCTs will not have access to all of the trainer resources
for the Microsoft Instructor-Led Courseware.
b. If you are a Microsoft Learning Competency Member:

User attending the Authorized Training Session and only immediately prior to the
commencement of the Authorized Training Session that is the subject matter of the Microsoft
Instructor-Led Courseware provided, or
2. provide one (1) End User attending the Authorized Training Session with the unique redemption
code and instructions on how they can access one (1) digital version of the Microsoft Instructor-
Led Courseware, or
3. you will provide one (1) Trainer with the unique redemption code and instructions on how they
can access one (1) Trainer Content,
iv. you will ensure that each End User attending an Authorized Training Session has their own valid
licensed copy of the Microsoft Instructor-Led Courseware that is the subject of the Authorized
Training Session,
v. you will ensure that each End User provided with a hard-copy version of the Microsoft Instructor-Led
Courseware will be presented with a copy of this agreement and each End User will agree that their
use of the Microsoft Instructor-Led Courseware will be subject to the terms in this agreement prior to
providing them with the Microsoft Instructor-Led Courseware. Each individual will be required to
denote their acceptance of this agreement in a manner that is enforceable under local law prior to
vi. you will ensure that each Trainer teaching an Authorized Training Session has their own valid
licensed copy of the Trainer Content that is the subject of the Authorized Training Session,
vii. you will only use qualified Trainers who hold the applicable Microsoft Certification credential that is
the subject of the Microsoft Instructor-Led Courseware being taught for your Authorized Training
Sessions,
viii. you will only use qualified MCTs who also hold the applicable Microsoft Certification credential that is
the subject of the MOC title being taught for all your Authorized Training Sessions using MOC,
ix. you will only provide access to the Microsoft Instructor-Led Courseware to End Users, and
x. you will only provide access to the Trainer Content to Trainers.
c. If you are a MPN Member:
User attending the Private Training Session, and only immediately prior to the commencement
of the Private Training Session that is the subject matter of the Microsoft Instructor-Led
Courseware being provided, or
2. provide one (1) End User who is attending the Private Training Session with the unique
redemption code and instructions on how they can access one (1) digital version of the
Microsoft Instructor-Led Courseware, or
3. you will provide one (1) Trainer who is teaching the Private Training Session with the unique
redemption code and instructions on how they can access one (1) Trainer Content,
iv. you will ensure that each End User attending an Private Training Session has their own valid licensed
copy of the Microsoft Instructor-Led Courseware that is the subject of the Private Training Session,
v. you will ensure that each End User provided with a hard copy version of the Microsoft Instructor-Led
Courseware will be presented with a copy of this agreement and each End User will agree that their
use of the Microsoft Instructor-Led Courseware will be subject to the terms in this agreement prior to
providing them with the Microsoft Instructor-Led Courseware. Each individual will be required to
denote their acceptance of this agreement in a manner that is enforceable under local law prior to
vi. you will ensure that each Trainer teaching an Private Training Session has their own valid licensed
copy of the Trainer Content that is the subject of the Private Training Session,
vii. you will only use qualified Trainers who hold the applicable Microsoft Certification credential that is
the subject of the Microsoft Instructor-Led Courseware being taught for all your Private Training
Sessions,
viii. you will only use qualified MCTs who hold the applicable Microsoft Certification credential that is the
subject of the MOC title being taught for all your Private Training Sessions using MOC,
ix. you will only provide access to the Microsoft Instructor-Led Courseware to End Users, and
x. you will only provide access to the Trainer Content to Trainers.
d. If you are an End User:

For each license you acquire, you may use the Microsoft Instructor-Led Courseware solely for your
personal training use. If the Microsoft Instructor-Led Courseware is in digital format, you may access the
Microsoft Instructor-Led Courseware online using the unique redemption code provided to you by the
training provider and install and use one (1) copy of the Microsoft Instructor-Led Courseware on up to
three (3) Personal Devices. You may also print one (1) copy of the Microsoft Instructor-Led Courseware.
You may not install the Microsoft Instructor-Led Courseware on a device you do not own or control.
e. If you are a Trainer.

i. For each license you acquire, you may install and use one (1) copy of the Trainer Content in the
form provided to you on one (1) Personal Device solely to prepare and deliver an Authorized
Training Session or Private Training Session, and install one (1) additional copy on another Personal
Device as a backup copy, which may be used only to reinstall the Trainer Content. You may not
install or use a copy of the Trainer Content on a device you do not own or control. You may also
print one (1) copy of the Trainer Content solely to prepare for and deliver an Authorized Training
Session or Private Training Session.
ii. You may customize the written portions of the Trainer Content that are logically associated with
instruction of a training session in accordance with the most recent version of the MCT agreement.
If you elect to exercise the foregoing rights, you agree to comply with the following: (i)
customizations may only be used for teaching Authorized Training Sessions and Private Training
Sessions, and (ii) all customizations will comply with this agreement. For clarity, any use of
“customize” refers only to changing the order of slides and content, and/or not using all the slides or
content, it does not mean changing or modifying any slide or content.
2.2 Separation of Components. The Licensed Content is licensed as a single unit and you may not
separate their components and install them on different devices.
2.3 Redistribution of Licensed Content. Except as expressly provided in the use rights above, you may
not distribute any Licensed Content or any portion thereof (including any permitted modifications) to any
third parties without the express written permission of Microsoft.
2.4 Third Party Notices. The Licensed Content may include third party code tent that Microsoft, not the
third party, licenses to you under this agreement. Notices, if any, for the third party code ntent are included
for your information only.
2.5 Additional Terms. Some Licensed Content may contain components with additional terms,
conditions, and licenses regarding its use. Any non-conflicting terms in those conditions and licenses also
apply to your use of that respective component and supplements the terms described in this agreement.
3. LICENSED CONTENT BASED ON PRE-RELEASE TECHNOLOGY. If the Licensed Content’s subject

matter is based on a pre-release version of Microsoft technology (“Pre-release”), then in addition to the
other provisions in this agreement, these terms also apply:
a. Pre-Release Licensed Content. This Licensed Content subject matter is on the Pre-release version of
the Microsoft technology. The technology may not work the way a final version of the technology will
and we may change the technology for the final version. We also may not release a final version.
Licensed Content based on the final version of the technology may not contain the same information as
the Licensed Content based on the Pre-release version. Microsoft is under no obligation to provide you
with any further content, including any Licensed Content based on the final version of the technology.
b. Feedback. If you agree to give feedback about the Licensed Content to Microsoft, either directly or
through its third party designee, you give to Microsoft without charge, the right to use, share and
commercialize your feedback in any way and for any purpose. You also give to third parties, without
charge, any patent rights needed for their products, technologies and services to use or interface with
any specific parts of a Microsoft technology, Microsoft product, or service that includes the feedback.
You will not give feedback that is subject to a license that requires Microsoft to license its technology,
technologies, or products to third parties because we include your feedback in them. These rights
survive this agreement.
c. Pre-release Term. If you are an Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, MPN Member or Trainer, you will cease using all copies of the Licensed Content on
the Pre-release technology upon (i) the date which Microsoft informs you is the end date for using the
Licensed Content on the Pre-release technology, or (ii) sixty (60) days after the commercial release of the
technology that is the subject of the Licensed Content, whichever is earliest (“Pre-release term”).
Upon expiration or termination of the Pre-release term, you will irretrievably delete and destroy all copies
of the Licensed Content in your possession or under your control.
4. SCOPE OF LICENSE. The Licensed Content is licensed, not sold. This agreement only gives you some
rights to use the Licensed Content. Microsoft reserves all other rights. Unless applicable law gives you more
rights despite this limitation, you may use the Licensed Content only as expressly permitted in this
agreement. In doing so, you must comply with any technical limitations in the Licensed Content that only
allows you to use it in certain ways. Except as expressly permitted in this agreement, you may not:
• access or allow any individual to access the Licensed Content if they have not acquired a valid license
for the Licensed Content,
• alter, remove or obscure any copyright or other protective notices (including watermarks), branding
or identifications contained in the Licensed Content,
• modify or create a derivative work of any Licensed Content,
• publicly display, or make the Licensed Content available for others to access or use,
• copy, print, install, sell, publish, transmit, lend, adapt, reuse, link to or post, make available or
distribute the Licensed Content to any third party,
• work around any technical limitations in the Licensed Content, or
• reverse engineer, decompile, remove or otherwise thwart any protections or disassemble the
Licensed Content except and only to the extent that applicable law expressly permits, despite this
limitation.
5. RESERVATION OF RIGHTS AND OWNERSHIP. Microsoft reserves all rights not expressly granted to
you in this agreement. The Licensed Content is protected by copyright and other intellectual property laws
and treaties. Microsoft or its suppliers own the title, copyright, and other intellectual property rights in the
Licensed Content.
6. EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regulations.
You must comply with all domestic and international export laws and regulations that apply to the Licensed
Content. These laws include restrictions on destinations, end users and end use. For additional information,
see www.microsoft.com/exporting.
7. SUPPORT SERVICES. Because the Licensed Content is “as is”, we may not provide support services for it.
8. TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you fail
to comply with the terms and conditions of this agreement. Upon termination of this agreement for any
reason, you will immediately stop all use of and delete and destroy all copies of the Licensed Content in
your possession or under your control.
9. LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed
Content. The third party sites are not under the control of Microsoft, and Microsoft is not responsible for
the contents of any third party sites, any links contained in third party sites, or any changes or updates to
third party sites. Microsoft is not responsible for webcasting or any other form of transmission received
from any third party sites. Microsoft is providing these links to third party sites to you only as a
convenience, and the inclusion of any link does not imply an endorsement by Microsoft of the third party
site.
10. ENTIRE AGREEMENT. This agreement, and any additional terms for the Trainer Content, updates and
supplements are the entire agreement for the Licensed Content, updates and supplements.
11. APPLICABLE LAW.

a. United States. If you acquired the Licensed Content in the United States, Washington state law governs
the interpretation of this agreement and applies to claims for breach of it, regardless of conflict of laws
principles. The laws of the state where you live govern all other claims, including claims under state
consumer protection laws, unfair competition laws, and in tort.
b. Outside the United States. If you acquired the Licensed Content in any other country, the laws of that
country apply.
12. LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the laws
of your country. You may also have rights with respect to the party from whom you acquired the Licensed
Content. This agreement does not change your rights under the laws of your country if the laws of your
country do not permit it to do so.
13. DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS" AND "AS
AVAILABLE." YOU BEAR THE RISK OF USING IT. MICROSOFT AND ITS RESPECTIVE
AFFILIATES GIVES NO EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS. YOU MAY
HAVE ADDITIONAL CONSUMER RIGHTS UNDER YOUR LOCAL LAWS WHICH THIS AGREEMENT
CANNOT CHANGE. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS, MICROSOFT AND
ITS RESPECTIVE AFFILIATES EXCLUDES ANY IMPLIED WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
14. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. YOU CAN RECOVER FROM
MICROSOFT, ITS RESPECTIVE AFFILIATES AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP
TO US$5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL,
LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES.
This limitation applies to

o anything related to the Licensed Content, services, content (including code) on third party Internet
sites or third-party programs; and
o claims for breach of contract, breach of warranty, guarantee or condition, strict liability, negligence,
or other tort to the extent permitted by applicable law.
It also applies even if Microsoft knew or should have known about the possibility of the damages. The
above limitation or exclusion may not apply to you because your country may not allow the exclusion or
limitation of incidental, consequential or other damages.
Please note: As this Licensed Content is distributed in Quebec, Canada, some of the clauses in this
agreement are provided below in French.
Remarque : Ce le contenu sous licence étant distribué au Québec, Canada, certaines des clauses
dans ce contrat sont fournies ci-dessous en français.
EXONÉRATION DE GARANTIE. Le contenu sous licence visé par une licence est offert « tel quel ». Toute
utilisation de ce contenu sous licence est à votre seule risque et péril. Microsoft n’accorde aucune autre garantie
expresse. Vous pouvez bénéficier de droits additionnels en vertu du droit local sur la protection dues
consommateurs, que ce contrat ne peut modifier. La ou elles sont permises par le droit locale, les garanties
implicites de qualité marchande, d’adéquation à un usage particulier et d’absence de contrefaçon sont exclues.
LIMITATION DES DOMMAGES-INTÉRÊTS ET EXCLUSION DE RESPONSABILITÉ POUR LES

DOMMAGES. Vous pouvez obtenir de Microsoft et de ses fournisseurs une indemnisation en cas de dommages
directs uniquement à hauteur de 5,00 $ US. Vous ne pouvez prétendre à aucune indemnisation pour les autres
dommages, y compris les dommages spéciaux, indirects ou accessoires et pertes de bénéfices.
Cette limitation concerne:
• tout ce qui est relié au le contenu sous licence, aux services ou au contenu (y compris le code)
figurant sur des sites Internet tiers ou dans des programmes tiers; et.
• les réclamations au titre de violation de contrat ou de garantie, ou au titre de responsabilité
stricte, de négligence ou d’une autre faute dans la limite autorisée par la loi en vigueur.
Elle s’applique également, même si Microsoft connaissait ou devrait connaître l’éventualité d’un tel dommage. Si
votre pays n’autorise pas l’exclusion ou la limitation de responsabilité pour les dommages indirects, accessoires
ou de quelque nature que ce soit, il se peut que la limitation ou l’exclusion ci-dessus ne s’appliquera pas à votre
égard.
EFFET JURIDIQUE. Le présent contrat décrit certains droits juridiques. Vous pourriez avoir d’autres droits
prévus par les lois de votre pays. Le présent contrat ne modifie pas les droits que vous confèrent les lois de votre
pays si celles-ci ne le permettent pas.
Revised July 2013

Analyzing Big Data with Microsoft R xi
xii Analyzing Big Data with Microsoft R
Acknowledgements
Microsoft Learning would like to acknowledge and thank the following for their contribution towards
developing this title. Their effort at various stages in the development has ensured that you have a good
classroom experience.
Davis Springate – Content Developer

David Springate has a PhD in computational and evolutionary biology from the University of Manchester.
He has worked as a postdoctoral researcher, medical statistician and head of data science at RealityMine,
a market research and big data analytics company.
He now runs Data Jujitsu Ltd., a data science consultancy company based in Manchester, UK
Ian Creese – Content Developer

Ian Creese graduated with a degree in Physics before moving into consumer analytics and data science.
Ian's favorite technologies are Hadoop and R. He has functioned as content-matter-expert for various
companies before becoming a contractor.
John Sharp – Content Developer

John Sharp gained an honors degree in Computing from Imperial College, London. He has been
developing software and writing training courses, guides, and books for over 25 years. John has
experience in a wide range of technologies, from database systems and UNIX through to C, C++ and C#
applications for the .NET Framework, together with Java and JavaScript development. He has authored
several books for Microsoft Press, including eight editions of C# Step By Step, Windows Communication
Foundation Step By Step, and the J# Core Reference.
Analyzing Big Data with Microsoft R xiii
Contents
Module 1: Microsoft R Server and Microsoft R Client
Module Overview 1-1
Lesson 1: Introduction to Microsoft R Server 1-2
Lesson 2: Using Microsoft R Client 1-7
Lesson 3: The ScaleR functions 1-15
Lab: Exploring Microsoft R Server and Microsoft R Client 1-20
Module Review and Takeaways 1-25
Module 2: Exploring Big Data

Module Overview 2-1
Lesson 1: Understanding ScaleR data sources 2-2
Lesson 2: Reading and writing XDF data 2-9
Lesson 3: Summarizing data in an XDF object 2-21

Lab: Exploring big data 2-32
Module 3: Visualizing Big Data

Module Overview 3-1
Lesson 1: Visualizing in-memory data 3-2
Lesson 2: Visualizing big data 3-22
Lab: Visualizing data 3-33

Module 4: Processing Big Data

Module Overview 4-1
Lesson 1: Transforming big data 4-2
Lesson 2: Managing big datasets 4-13
Lab: Processing big data 4-19
Module 5: Parallelizing Analysis Operations

Module Overview 5-1
Lesson 1: Using the RxLocalParallel compute context with rxExec 5-2
Lesson 2: Using the RevoPemaR package 5-11
Lab: Parallelizing analysis operations 5-20

xiv Analyzing Big Data with Microsoft R
Module 6: Creating and Evaluating Regression Models

Module Overview 6-1
Lesson 1: Clustering big data 6-2
Lesson 2: Generating regression models and making predictions 6-10
Lab: Creating and using a regression model 6-23
Module 7: Creating and Evaluating Partitioning Models

Module Overview 7-1
Lesson 1: Creating partitioning models based on decision trees 7-3
Lesson 2: Evaluating models 7-14
Lesson 3: Using the MicrosoftML package 7-24
Lab: Creating a partitioning model to make predictions 7-28
Module 8: Processing Big Data in SQL Server and Hadoop

Module Overview 8-1
Lesson 1: Integrating R with SQL Server 8-2
Lab A: Deploying a predictive model to SQL Server 8-13

Lesson 2: Using ScaleR functions with Hadoop on a Map/Reduce cluster 8-21
Lesson 3: Using ScaleR functions with Spark 8-30
Lab B: Incorporating Hadoop Map/Reduce and Spark functionality into the

ScaleR workflow 8-35
Lab Answer Keys

Module 1 Lab: Exploring Microsoft R Server and Microsoft R Client L01-1
Module 2 Lab: Exploring big data L02-1
Module 3 Lab: Visualizing data L03-1
Module 4 Lab: Processing big data L04-1
Module 5 Lab: Parallelizing analysis operations L05-1
Module 6 Lab: Creating and using a regression model L06-1
Module 7 Lab: Creating a partitioning model to make predictions L07-1
Module 8 Lab A: Deploying a predictive model to SQL Server L08-1
Module 8 Lab B: Incorporating Hadoop Map/Reduce and Spark

functionality into the ScaleR workflow L08-8
About This Course i
About This Course

This section provides a brief description of the course, audience, suggested prerequisites, and course
objectives.
Course Description
This three-day instructor-led course describes how to use Microsoft R Server to create and run an analysis
on a large dataset, and utilize it in Big Data environments, such as a Hadoop or Spark cluster, or a SQL
Server database.
Audience
The primary audience for this course is data scientists who are already familiar with R and who have a
high-level understanding of data platforms such as the Hadoop ecosystem, SQL Server, and core T-SQL
capabilities.
The course will likely be attended by developers who need to integrate R analyses into their solutions.
Student Prerequisites
In addition to their professional experience, students who attend this course should have:
 Programming experience using R, and familiarity with common R packages
 Knowledge of common statistical methods and data analysis best practices.
 Basic knowledge of the Microsoft Windows operating system and its core functionality.
 Working knowledge of relational databases.
Course Objectives
After completing this course, students will be able to:
 Explain how Microsoft R Server and Microsoft R Client work.

 Use R Client with R Server to explore big data held in different data stores.
 Visualize big data by using graphs and plots.
 Transform and clean big data sets.

 Implement options for splitting analysis jobs into parallel tasks.
 Build and evaluate regression models generated from big data.
 Create, score, and deploy partitioning models generated from big data.
 Use R in SQL Server and Hadoop environments.
Course Outline
The course outline is as follows:
 Module 1: ‘Microsoft R Server and R Client’ provides an introduction to Microsoft R Server and R
Client, and an overview of the ScaleR functions
 Module 2: ‘Exploring Big Data’ describes a how to use ScaleR data sources to read and summarize big
data in different compute contexts.
 Module 3: ‘Visualizing Big Data’ shows how to use the ggplot2 package, and the ScaleR functions, to
generate plots and graphs of big data.
 Module 4: ‘Processing Big Data’ describes how to transform and clean big data sets.
ii About This Course
 Module 5: ‘Parallelizing Analysis Operations’ shows how to use ScaleR and PEMA classes to split
analysis jobs into parallel tasks.
 Module 6: ‘Generating and Evaluating Regression Models' describes how build and evaluate
regression models generated from big data sets.
 Module 7: 'Creating and Evaluating Partitioning Models' shows how to create and score partitioning
models generated from big data sets.
 Module 8: 'Processing Big Data in SQL Server and Hadoop' describes how to use R in SQL Server and
Hadoop environments.
Course Materials
The following materials are included with your kit:
 Course Handbook: a succinct classroom learning guide that provides the critical technical
information in a crisp, tightly-focused format, which is essential for an effective in-class learning
experience.
o Lessons: guide you through the learning objectives and provide the key points that are critical to
the success of the in-class learning experience.
o Labs: provide a real-world, hands-on platform for you to apply the knowledge and skills learned
in the module.
o Module Reviews and Takeaways: provide on-the-job reference material to boost knowledge
and skills retention.
o Lab Answer Keys: provide step-by-step lab solution guidance.
Additional Reading: Course Companion Content on the

http://www.microsoft.com/learning/en/us/companion-moc.aspx Site: searchable, easy-to-
browse digital content with integrated premium online resources that supplement the Course
Handbook.
 Modules: include companion content, such as questions and answers, detailed demo steps and
additional reading links, for each lesson. Additionally, they include Lab Review questions and answers
and Module Reviews and Takeaways sections, which contain the review questions and answers, best
practices, common issues and troubleshooting tips with answers, and real-world issues and scenarios
with answers.
 Resources: include well-categorized additional resources that give you immediate access to the most
current premium content on TechNet, MSDN®, or Microsoft® Press®.
Additional Reading: Student Course files on the

http://www.microsoft.com/learning/en/us/companion-moc.aspx Site: includes the
Allfiles.exe, a self-extracting executable file that contains all required files for the labs and
demonstrations.
 Course evaluation: at the end of the course, you will have the opportunity to complete an online
evaluation to provide feedback on the course, training facility, and instructor.
 To provide additional comments or feedback on the course, send email to mcspprt@microsoft.com.

To inquire about the Microsoft Certification Program, send an email to mcphelp@microsoft.com.
About This Course iii
Virtual Machine Environment

This section provides the information for setting up the classroom environment to support the business
scenario of the course.
Virtual Machine Configuration

In this course, you will use Microsoft® Hyper-V™ to perform the labs.
The following table shows the role of each virtual machine that is used in this course:
Virtual machine Role
20773A-LON-DC LON-DC1 is a domain controller.
20773A-LON-DEV LON-DEV provides the main development. It runs

R client and has R tools installed.
20773A-LON-RSRVR LON-RSRVR runs R Server and acts as the web

node for the R Server cluster
20773A-LON-SQLR LON-SQLR runs SQL Server 2016 and R Server. It

also acts as a compute node in the R Server
cluster
MT17B-WS2016-NAT WS2016-NAT provides access to the Internet
This course also uses a separate VM, LON-HADOOP, running in Azure. This VM is a Hadoop server that is
shared by all students.
Software Configuration
The following software is installed on the virtual machines:
 Microsoft R Server - 9.0.1, 64 bit
 Visual Studio 2015 Enterprise Edition
 Microsoft R Open 3.3.2

 Microsoft R Client
 Microsoft SQL Server 2016 (64-bit)
 Microsoft SQL Server Management Studio - 16.5.3
Course Files
The files associated with the labs in this course are located in the E:\Labfiles folder on the 20773A-LON-
DEV virtual machine.
Classroom Setup
Each classroom computer will have the same virtual machines configured in the same
way. Additionally, the instructor must perform the following tasks before starting the
course:
Create the Hadoop VM in Azure

The document Creating the Hadoop VMs on Azure provides full details for creating this VM. Note that
you should allow an hour to complete this task before the course starts.
You should have been provided with the details of the Azure account.
iv About This Course
Start the Hadoop VM

1. On the desktop computer, in the Start menu, type Internet Explorer, and then click Internet
Explorer.
2. In the address bar, type portal.azure.com, and then press Enter.
3. Enter your Microsoft® account credentials to log in.
4. In the navigation blade on the left side of the portal, click Resource groups.
5. Click the LON-HADOOP-SERVER resource group.
6. In the LON-HADOOP-SERVER blade, click the LON-HADOOP virtual machine.
7. In the LON-HADOOP blade, click Start, and then wait for the VM to start running.
8. Navigate to http://fqdn:8080, where fqdn is the fully qualified domain name of the LON-HADOOP
VM. For example, http://lon-hadoop-01.ukwest.cloudapp.azure.com:8080 (Note: you can find the
fqdn of the VM in Azure, on the same page that you used to start the VM—it is reported in the Public
IP address/DNS name label field.)
9. On the Sign in page, type admin for the Username and Password, and then click Sign in. Note that
none of the Hadoop services are currently running.
10. In the left pane, click the Actions drop-down, and then click Start All.
11. In the Confirmation dialog box, click Confirm Start.

12. When all the services have started, click OK, and then close Internet Explorer.
All students and the instructor must perform the following tasks prior to commencing
module 1:
Start the VMs
1. In Hyper-V Manager, under Virtual Machines, right-click MT17B-WS2016-NAT, and then click Start.
2. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-DC, and then click Start.
3. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-DEV, and then click Start.
4. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-RSVR, and then click Start.
5. In Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-SQLR, and then click Start.
Note: If a Hyper-V Manager dialog box appears saying there is not enough memory in
the system to start the virtual machines, restart the PC, and then start the VMs again.
6. Right-click 20773A-LON-DEV, and then click Connect.
7. Log in as Adatum\AdatumAdmin with the password Pa55w.rd.
Install RStudio on the LON-DEV VM

1. In Internet Explorer, browse to https://download1.rstudio.org/RStudio-1.0.136.exe.
2. In the Internet Explorer message box, click Run.
3. In the User Account Control dialog box, click Yes.
4. In the RStudio Setup wizard, on the Welcome to the RStudio Setup Wizard page, click Next.
5. On the Choose Install Location page, click Next.
6. On the Choose Start Menu Folder page, click Install.

About This Course v
7. On the Completing the RStudio Setup Wizard page, click Finish.
8. Click the Windows Start button, and in the Recently added list click RStudio.
9. In the RStudio Console, type the following command and then press Enter.
remoteLogin("http://LON-RSVR:12800", session=TRUE, commandline=TRUE, diff=TRUE)
10. In the Remote Server dialog box, in the User name box type admin, in the Password box type
Pa55w.rd, and then click OK.
11. In the RStudio Console, verify that a remote R session starts successfully.
12. Type exit to close the remote session.
13. Close RStudio without saving the workspace image.
Install PuTTY on the LON-DEV VM

1. In Internet Explorer, browse to https://the.earth.li/~sgtatham/putty/0.68/w64/putty-64bit-0.68-
installer.msi.
2. In the Internet Explorer message box, click Run.
3. In the PuTTY release 0.68 (64-bit) Setup wizard, on the Welcome to the PuTTY release 0.68 (64-bit)
Setup Wizard page, click Next.
4. On the Destination Folder page, click Next.
5. On the Product Features page, click Install.
6. In the User Account Control dialog box, click Yes.

7. When the wizard has completed, clear the View README file check box, and then click Finish.
8. Right-click the Windows Start button, and then click System.
9. In the System dialog box, click Advanced system settings.

10. In the System Properties dialog box, click Environment Variables.
11. In the Environment Variables dialog box, click Path, and then click Edit.
12. In the Edit User Variable dialog box, append the path C:\Program Files\PuTTY to the Variable
Value, and then click OK.
13. In the Environment Variables dialog box, click OK.
14. In the System Properties dialog box, click OK.

15. Close the System dialog box.
16. Open a command prompt window.
17. In the command prompt window, type the following command and then press Enter:
putty
18. Verify that the PuTTY Configuration window appears, and then click Cancel.
19. Close the command prompt window.

vi About This Course
Configure PuTTY on the LON-DEV VM
Note: For the instructor. In the following procedure, use the instructor account for
connecting to Hadoop, where specified by the text <your user name>. There are separate
accounts on the Hadoop VM for each student, named student01, student02, and so on, as
described in the document Creating the Hadoop VMs on Azure. Allocate one of these accounts
to each student, who should use it as <your user name> in this procedure. You should also
provide students with the fully qualified domain name (fdqn), such as lon-hadoop-
01.ukwest.cloudapp.azure.com, of the Hadoop VM.
1. On the LON-DEV VM, open a command prompt.
2. In the command prompt window, run the putty command. The putty utility should start and the
PuTTY Configuration window should appear.
3. In the PuTTY Configuration window, in the Host Name box, enter the fully qualified domain name
of the LON-HADOOP VM. For example, lon-hadoop-01.ukwest.cloudapp.azure.com.
4. In the Saved Sessions box, type LON-HADOOP, click Save, and then click Open.
5. If a PuTTY Security Alert dialog box appears, click Yes.

6. In the PuTTY terminal window that appears, at the login as prompt, log in as <your user name>
with the password Pa55w.rd.
7. Run the following command to create SSH keys for performing password-less authentication:
ssh-keygen
8. At the prompt Enter file in which to save the key (/home/<your user name>/.ssh/id_rsa), press
Enter.
9. At the prompt Enter passphrase (empty for no passphrase), press Enter.
10. At the prompt Enter same passphrase again, press Enter.
11. In the PuTTY terminal window, run the following command:
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
12. In the PuTTY terminal window, run the following commands:
chmod 700 .ssh

chmod 600 .ssh/authorized_keys
13. Close the PuTTY terminal window.
14. In the PuTTY Exit Confirmation dialog box, click OK.

15. On the LON-DEV VM, in the command prompt window, move to the E:\ folder.
16. Run the following command to copy the key file for your account on the Hadoop VM to the LON-
DEV VM. Replace fqdn with the fully qualified domain name of the LON-HADOOP VM, for example,
lon-hadoop-01.ukwest.cloudapp.azure.com:
pscp <your user name>@fqdn:.ssh/id_rsa id_rsa
17. At the Password prompt, type Pa55w.rd, and then press Enter.
About This Course vii
18. In the command prompt window, run the puttygen command. The PuTTY Key Generator window
should appear.
19. In the PuTTY Key Generator window, click Load.
20. In the Load private key dialog box, move to the E:\ folder, in the file selector drop-down list box,
click All Files(*.*), click id_rsa, and then click Open.
21. In the PuTTYgen Notice dialog box, verify that the key was imported successfully, and then click OK.
22. In the PuTTY Key Generator window, click Save private key.
23. In the PuTTYgen Warning dialog box, click Yes.
24. In the Save private key as dialog box, in the File name box, type HadoopVM, and then click Save.
25. Close the PuTTY Key Generator window.
26. Run the putty command again.

27. In the PuTTY Configuration window, in the Saved Sessions box, click LON-HADOOP, and then click
Load.
28. In the Category pane of the PuTTY Configuration window, under Connection, expand SSH, and
then click Auth.
29. In the Options controlling SSH authentication pane, next to the Private key file for
authentication box, click Browse.
30. In the Select private key file dialog box, move to the E:\ folder, click HadoopVM, and then click
Open.
31. In the Category pane of the PuTTY Configuration window, under Connection, click Data.
32. In the Data to send to the server pane, in the Auto-login username box, type <your user name>.
33. In the Category pane of the PuTTY Configuration window, click Session.
34. Click Save, and then close the PuTTY Configuration window.
35. Close the Command Prompt window.
Create a network share on the R Server

1. On the desktop computer, in Hyper-V Manager, under Virtual Machines, right-click 20773A-LON-
RSVR, and then click Connect.
2. Log in as Adatum\Administrator with the password Pa55w.rd.
3. Using File Explorer, create a new folder named C:\Data.
4. In File Explorer, right-click the C:\Data folder, point to Share with, and then click Specific people.
5. In the File Sharing dialog box, click the drop-down list, click Everyone, and then click Add.
6. In the lower pane, click the Everyone row, and set the Permission Level to Read/Write, and then
click Share.
7. In the File Sharing dialog box, verify that the file share is named \\LON-RSVR\Data, and then click
Done.
viii About This Course
At the end of the course, the instructor must perform the following tasks to shut down
the Hadoop VM:
Shut down the Hadoop VM

1. Using Internet Explorer, navigate to portal.azure.com.
2. Enter your Microsoft account credentials to log in.
3. In the navigation blade on the left side of the portal, click Resource groups, and then click the LON-
HADOOP-SERVER resource group.
4. In the LON-HADOOP-SERVER blade, click the LON-HADOOP virtual machine.

5. In the LON-HADOOP blade, click Stop.
6. Close the LON-HADOOP blade.
7. Close Internet Explorer.
Course Hardware Level

To ensure a satisfactory student experience, Microsoft Learning requires a minimum equipment
configuration for trainer and student computers in all Microsoft Learning Partner classrooms in which
Official Microsoft Learning Product courseware is taught.
 Processor:
o 2.8 GHz 64-bit processor (multi-core) or better
 AMD:
o AMD Virtualization (AMD-V)
o Second Level Address Translation (SLAT) - nested page tables (NPT)
o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (NX Bit)
o Supports TPM 2.0 or greater
 Intel:
o Intel Virtualization Technology (Intel VT)
o Supports Second Level Address Translation (SLAT) – Extended Page Table (EPT)
o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (XD bit)
o Supports TPM 2.0 or greater
 Hard Disk: 500GB SSD System Drive

 RAM: 32 GB minimum
 Network adapter
 Monitor: Dual monitors supporting 1440X900 minimum resolution
 Mouse or compatible pointing device
 Sound card with headsets
In addition, the instructor computer must:
 Be connected to a projection display device that supports SVGA 1024 x 768 pixels, 16 bit colors.
 Have a sound card with amplified speakers.

1-1
Module 1
Microsoft R Server and Microsoft R Client
Contents:
Module Overview 1-1
Lesson 1: Introduction to Microsoft R Server 1-2
Lesson 2: Using Microsoft R Client 1-7
Lesson 3: The ScaleR functions 1-15
Lab: Exploring Microsoft R Server and Microsoft R Client 1-20
Module Overview
In this module, you will learn the basics of how to use the Microsoft® R Server and the Microsoft R Client.
This will be the principal environment you will use to interact with R when dealing with data at scale. The
module will give an overview of what the Microsoft R Server actually is, in addition to key concepts (such
as compute contexts) which you return to throughout this course. You’ll then learn how to connect to the
R Server from the R Client, the various environments for interacting with R, how to connect to remote
servers, and transferring data between local and remote sessions. Finally, you will learn how to use the
ScaleR™ functions to handle distributed computations transparently. You will find that many of these
functions are analogous to well-known nonparallel functions in base R.
Objectives
In this module, you will:
 Learn the purpose of Microsoft R Server.
 Learn how to use Microsoft R Client.
 Learn about the ScaleR functions.

1-2 Microsoft R Server and Microsoft R Client
Lesson 1
Introduction to Microsoft R Server
Open Source R is an excellent tool for the modeling and prototyping of data science algorithms. However,
it often lacks the speed and stability to be effective in either a production environment or when working
with datasets too large to fit in memory. Microsoft R Server is a server for hosting and managing parallel
and distributed R processes on servers (both Linux and Windows®), clusters running either Hadoop or
Apache Spark, or database systems such as Teradata and SQL Server®. It gives you an infrastructure for
distributing a workload across multiple nodes (chunking) so that you can run R jobs in parallel, and then
reassemble the results for downstream analysis and/or visualization.
Lesson Objectives
In this lesson, you will learn about:
 The features and components of Microsoft R Server.

 The environments in which you can run R Server.
 R Server compute contexts for running R in different environments.
 Operationalization, for running R in a cluster.
What is Microsoft R Server?

R Server is the Microsoft solution for performing
enterprise-scale R. You can use R Server to analyze
big datasets that can span multiple computers. R
Server is also designed to provide a distributed
processing environment, enabling you to split
large jobs into pieces that can be run in parallel.
R Server consists of the following major elements

for analyzing big data:
 Microsoft R Open. This is Microsoft’s

distribution of Open Source R. It is 100
percent compatible with the standard R
language.
 ScaleR. This is the engine that can partition massively large datasets into chunks and distribute them
across multiple compute nodes or database platforms such as Teradata or SQL Server. ScaleR
functions are provided by the RevoScaleR package. You use these functions to remove a lot of the
complexity of dealing with parallel and distributed computation—they are often analogous to many
well-known data manipulation functions in base R. You also use ScaleR functions to, for example, test
out models on small local datasets, and then run them on huge datasets on a cluster with very few
changes in your code.
RevoScaleR Functions
https://aka.ms/uihq8b
Analyzing Big Data with Microsoft R 1-3
 Machine learning algorithms. Functions that employ machine learning algorithms such as
regression, clustering and partitioning methods on ScaleR distributed data, are available in the
MicrosoftML package. As with ScaleR functions, the complexity of dealing with chunked data is
abstracted so you can concentrate on tuning your models.
Microsoft ML: State-of-the-Art Machine Learning R Algorithms from Microsoft Corporation
https://aka.ms/nbexpl
In addition to these analysis components, R Server includes:
 Platform-specific components that enable you to write essentially the same code, irrespective of
whether you are running on a Linux server, a Hadoop or Spark cluster, or a Teradata or SQL Server
database.
 The operationalization engine for connecting to a remote R server and for deploying R code to
production as a RESTful web service.
Introducing Microsoft R Server
https://aka.ms/vwuke0
R Server environments
You can run R Server on a variety of different
platforms depending on your requirements:
Windows/Linux
You can install R Server on a standalone Windows
or Linux machine. In this case, R Server runs as a
background process that can be accessed from a
variety of front ends including the standard R
Client, RStudio and Visual Studio® (connection via
these tools is covered in Lesson 2). You might also
have a large remote server on which you can
install R Server, and to which you can connect
from your laptop or desktop (see the topic Using a
remote R server later in the module). On large servers, you can use the ScaleR functions to parallelize your
workload over multiple processor cores. You can also install R Server on a network of locally connected
servers.
Hadoop MapReduce/Spark
For very large datasets, having multiple cores on a large server might still not be enough for your needs.
In this case, you might need to farm out your computations over a cluster of multiple compute nodes. R
Server is available on both Hadoop and Spark clusters in the cloud. Here you can make use of the HDFS
distributed file system and MapReduce or Spark frameworks for distributed computation.
Teradata/SQL Server
Both of these platforms provide high performance, high capacity data storage capabilities that are well
suited to working with the high performance analytics available in R Server. You use ScaleR functions to
get data in and out of the database quickly, and undertake high performance in-database analytics.
Using a remote R server

Although you can use R Server on a local machine,
the power of the platform is only realized when
you run analytics on a remote server or cluster
that has many more resources available.
R Server supports remote execution. You can issue

commands to a remote R server instance to
offload heavy processing onto the remote server.
You perform interactive analyses just like you can
on a local machine—you can also run large batch
jobs.
You can execute R code remotely through the

command line in interactive console applications,
in R scripts that call functions from the mrsdeploy package, or from code that calls the operationalization
APIs. You can enter R code just as you would in a local R console.
mrsdeploy functions
https://aka.ms/xhll16
With remote execution, you can:

 Log in to and out of an R server remotely—you don’t need to remain logged in to a remote process
while it is running.
 Execute R code remotely, either interactively or as scripts.
 Reconcile any differences between the local and remote environments and generate different reports.
This can help you to ensure that a remote environment includes the necessary packages to run your
scripts.
 Work with R objects and files remotely.
 Create and manage snapshots of the remote environment for reuse.
You use the mrsdeploy functions for remote execution. However, you must first set up operationalization
—this is described in the topic Deploying Your Code later in this lesson.
Compute contexts
To execute parallel or distributed computations on
an R server or cluster, you must first define a
compute context. This is the entry point to the
distributed computation engine and the heart of
any ScaleR application. The compute context sets
up internal processes such as logging, and
provides the information that the server needs to
execute code remotely, including the back-end
engine (for example, Hadoop or Spark), database
connection information, and specifications for how
to handle output.
After the compute context is established, you can use it to create XDF data files, broadcast variables
(relatively small variables or data to be shared across all nodes) and run jobs.
Note that each compute context has its own environment and limitations. This means that you will often
need to configure the environment to load specific packages in different compute contexts. Not all
compute contexts support all ScaleR functionality. For example, the Teradata compute context can only
interact with the Teradata database, not with CSV or XDF data.
For more detail on compute contexts, see Module 5: Parallelizing Analysis Operations and Module 8:
Processing Big Data in SQL Server and Hadoop.
Deploying your code to R Server

After you have designed and built your analysis
models, you will want to make them available in
production, quickly and easily. Traditionally, this
has been a difficult problem for R, often requiring
full rewrites in faster, lower-level languages to
make the algorithms production ready.
R Server provides the operationalization engine to

solve this problem. You use this engine to deploy
parallel and distributed R code in production as a
web service. You can also use it to connect to
remote R servers—you use the functions in the
mrsdeploy package to do this.
After installation, you need to configure operationalization to deploy R code as web services or to connect
to a remote R server.
The architecture of R Server supports a clustered configuration. All configurations have at least a single
web node and a single compute node:
 A web node acts as an HTTP REST endpoint with which you can interact directly using API calls. The
web node accesses data in the database, and sends jobs to the compute node.
 A compute node executes R code as a session or service. Each compute node has its own pool of R
shells.
When you work with remote R servers or web services, your local R Client or app sends instructions to the
web node. The web node then passes instructions to a database service, and/or one or more compute
nodes, to perform the analytic heavy lifting.
Note that, if your code requires access to resources such as data files, you must ensure that these items
are accessible to all nodes. For example, you should place any text files on a fast shared device that can be
accessed by using the same path name from all nodes.
You can set up a single web node and compute node on a single machine, as a one-box configuration.
You can also install multiple components on multiple machines.
Configuring R Server for Operationalization (One-Box Configuration)

https://aka.ms/bt29ax
Check Your Knowledge

Question
Which of the following is not an advantage of using R Server over open source R?
Select the correct answer.
The ScaleR functions in R Server improve distribution and parallelization of analysis

operations across nodes in a cluster.
R Server can run advanced machine learning algorithms implemented by the MicrosoftML
package.
R Server runs exclusively on Windows
R Server supports the remote execution of R code.
R Server enables you to package and deploy code as a web service.

Lesson 2
Using Microsoft R Client
Microsoft R Client is a freely available data science tool for advanced analytics. It is built on the Microsoft
Open R distribution, so you have access to the full power of R and can install any R packages—for
example, through CRAN (https://cran.r-project.org/). Like R Server, R Client also has access to the powerful
ScaleR functionality for parallel and distributed computation.
Lesson Objectives
In this lesson, you will learn:
 How R Client differs from R Server.
 How to configure the environments in which you can run R Client.
 How to connect to a remote R server from R Client.
 How to transfer objects between an R Client session and a remote R server session.
How R Client compares to R Server

Despite the many similarities with R Server, R
Client has the following important limitations you
need to be aware of:
 It can only process data in-memory on the

machine on which it is installed. You are
therefore limited to the resources available
locally.
 The ScaleR functions are limited to two
threads, even if you have more cores than this
on your local machine.
 Chunking is not supported. R Client has to fit
all the data it uses into memory.
With these features and limitations, R Client is an effective replacement for standard open-source R.
However, to benefit from disk scalability, performance and speed, you can push the compute to R servers
on servers, clusters or database services. In this case, R Server is the back-end process that performs the
heavy lifting in terms of executing code, running models, fetching from database services, and so on.
However, you don’t interact with the R Server directly—only the R Client that then sends out code to
remote R servers.
Get Started with Microsoft R Client

https://aka.ms/m89d0i
Using R Client
After you have installed R Client, you can interact
with it using the built-in R GUI (Rgui.exe). On
Linux, you can use the R terminal-based
application. You can also run standalone scripts
using Rscript. You can find these applications
under the installation folder—this is typically
C:\Program Files\Microsoft\R client\R_SERVER\bin
on Windows, or /usr/bin on Linux.
These environments can become quite limiting

and you will probably choose to interact with R
Client using an integrated development
environment (IDE) that provides a wider range of
editing, visualization, and development options. The two main IDEs for R Client are R Tools for Visual
Studio (RTVS) and RStudio.
RTVS is an add-in for Microsoft Visual Studio (including the free Visual Studio Community edition). It is
the IDE recommended by Microsoft to use with R Client. To use it, you will first need to install Visual
Studio and then add the R Tools package.
Visual Studio Community

https://aka.ms/eu881i
Setup or Configure R Tools

https://aka.ms/ld1mf8
If you already have another version of R installed, you will need to reconfigure RTVS to interact with R
Client. You do this as follows:
1. Launch RTVS.
2. From the R Tools menu, choose Change R to Microsoft R Client.

When you run RTVS, R Client is now the default R engine.
RStudio is a free IDE for R by the developers of popular third-party R packages like dplyr, Shiny and
sparklyr. It includes an R console, a syntax-highlighting editor that supports direct code execution, and
tools for plotting, viewing your command history, debugging, and managing your workspace. If you
already have another version of R installed you will need to reconfigure RStudio:
1. Launch RStudio.
2. From the Tools menu, choose Global Options.
3. On the General tab, update the path to R to point to C:\Program Files\Microsoft\R

Client\R_SERVER\bin\x64.
When you launch RStudio, R Client is now the default R engine.

Connecting to a remote R server

Before you can connect to a remote R server, it
must first be operationalized. This means that the
server is configured to execute remotely and that
it has at least one web node and at least one
compute node available. If you have installed R
Server on a Windows machine, setting up a single
box configuration is a matter of opening the R
Server admin tool as an administrator and
selecting the option “Configure R Server for
Operationalization”, and then “Configure for one
box”.
Configuring R Server for Operationalization (One-Box Configuration)

https://aka.ms/bt29ax
When the remote R server is operationalized, and you have successfully logged in to the remote server
using the remote execution functions from the mrsdeploy package, you can start a remote R session.
Connecting to R Server with mrsdeploy

https://aka.ms/x52zod
Logging in
The mrsdeploy package provides the following functions for authenticating against a remote R server
and starting a new remote R session, for switching back and forth between remote and local sessions, and
for closing a connection to a remote server:
 remoteLogin(). Use this function to connect with a server on your own local network. By default, it
prompts for a username and password (you can also pass these details as parameters to the function),
creates a remote R session, and gives access to the R command line. You can also configure Active
Directory® authentication.
 remoteLoginAAD(). Use this function to connect to a remote R server in the cloud. It authenticates
the user through the Azure® Active Directory, then creates a new session in the same way as
remoteLogin.
 pause(). This command temporarily drops you out of the remote session, so that you can work
locally, but keeps the connection to the remote server open.
 resume(). This command reconnects you to a remote session that you have paused.
 remoteExecute(). Use this function to execute a block of code or an R script on a remote server.
 remoteLogout(). This function logs you out of a remote session.
 exit. This is a command rather than a function. It closes a remote session.
After you authenticate on either a local network or the cloud, the web node returns an access token. The
R console passes this token in the request header of every subsequent mrsdeploy request.
Under the default parameters noted above, you will see the default prompt REMOTE> in your R console
window. The code you enter here will run on the remote server. You can terminate the remote R session
by either typing exit at the REMOTE> prompt or using the remoteLogout() function.
Execute on a remote Microsoft R Server

https://aka.ms/lxv3fx
Transferring objects between sessions

After you have run an analysis in a remote session,
you might want to retrieve the output and load it
into your local R session. For example, you might
want to work on the model fit object returned
from running a linear model. You might also want
to transfer variables and small datasets between a
local and remote session, or copy files. The
mrsdeploy package provides the following
functions to perform these tasks:
 getRemoteObject(). Transfers an R object

from the remote server workspace into the
local client workspace.
 putLocalObject(). Transfers an R object from your local workspace to the server workspace.
 getRemoteWorkspace(). Transfers all objects in the remote session into the local R session.
 putLocalWorkspace(). Transfers all objects in the local R session to the remote session.
 getRemoteFile(). Downloads a binary or text file from the working directory of the remote R session
into the working directory of the local R session.
 putLocalFile(). Uploads a file from the local machine and writes it to the working directory of the
remote R session. This function is often used if a data or parameter file needs to be accessed by a
script running on the remote R session. This function copies the file to the working directory of the
remote server. If you need to transfer large data files, create a manual out-of-band copy to a shared
storage location that is accessible to all servers in the cluster using the same path.
 listRemoteFiles(). Returns a list of all the files that are in the working directory of the remote session.
 deleteRemoteFile(). Deletes a file from the working directory of the remote R session.
Note: You should never try to transfer large datasets or files from remote to local. Your
local server is unlikely to be able to cope with the size of the data. The file transfer functions in
the mrsdeploy package are principally designed for handling small configuration and parameter
files that can be efficiently transmitted across nodes.
Demonstration: Using R Client with Visual Studio and RStudio

In this demonstration, you will see how to configure RTVS and RStudio to work with R Client. This
demonstration shows the Visual Studio Community edition, and the open source version of RStudio
desktop.
Demonstration Steps
Configure and use RTVS

1. Click the Windows Start button, type Visual Studio 2015, and then click Visual Studio 2015.
2. In Visual Studio 2015, on the R Tools menu, click Data Science Settings.
3. In the Microsoft Visual Studio message box, click Yes. Visual Studio arranges the IDE with four
panes in a layout that resembles that of RStudio.
4. In the R Interactive pane, type mtcars, and then press Enter. The contents of the mtcars sample data
frame should appear.
5. Type mpg <- mtcars[["mpg"]], and then press Enter. Notice that the variable mpg appears in the
Variable Explorer window.
6. Click the R Tools menu. This menu contains most of the commands that you use for working in the
IDE. These include:
 The Session menu, which contains commands that enable you to stop a session and interrupt a
long-running R function.
 The Plots menu, which enables you to open new plot windows, and save plots as image files and
PDF documents.
 The Data menu, which enables you to import datasets, and manage data sources and variables.
 The Working Directory menu, which enables you to set the current working directory.
 The Install Microsoft R Client command, which downloads and installs R Client if it has not
already been configured.
7. On the File menu, point to New, and then click File.
8. In the New File dialog box, in the Installed pane, click R. The center pane displays the different types
of R file you can create, including documentation and markdown.
9. In the center pane, click R Script, and then click Open. The source editor window appears in the top
left pane.
10. In the source editor window, type print(mpg), and then press Ctrl+Enter. The statement is executed,
and the results appear in the R Interactive window. You use the source editor window to create
scripts of R commands. You can execute a single statement by using Ctrl+Enter, or you can highlight
a batch of statements using the mouse and press Ctrl+Enter to run them.
11. Click the File menu. This menu contains commands that you can use to save the R script, and load an
R script that you had saved previously.
Configure RStudio to work with RClient

1. Click the Windows Start button, click the RStudio program group, and then click RStudio.
2. In RStudio, on the Tools menu, click Global Options.
3. In the Options dialog box, verify that the R version is set to C:\Program Files\Microsoft\R
Client\R_SERVER. This is the location of Microsoft R Client. Note that you must download and install
R Client separately, before running RStudio.
4. Click Change. The Choose R Installation dialog box enables you to switch between different
versions of R. If you have just installed R Client, use the Choose a specific version of R to select it.
Click Cancel.
5. In the Options dialog box, click Cancel.
6. On the File menu, point to New File, and then click R Script. The script editor pane appears in the
top right window. Note the following points:
 As with RTVS, you can use this window to create scripts, and run commands by using Ctrl+Enter.
 You can also run commands directly from the Console window.
 The Environment window displays the variables for the current session.
 The Session menu provides commands that you can use to interrupt R, terminate a session,
create a new session (which starts a new instance of RStudio), and change the working directory.
Perform operations on a remote server
Note: You can perform these tasks either in RTVS or RStudio, according to your preference
of IDE.
1. In the script editor window, enter and run (Ctrl+Enter) the following command:
remoteLogin("http://LON-RSVR.ADATUM.COM:12800", session = TRUE, diff = TRUE,

commandline = TRUE)
2. In the Remote Server dialog box, enter admin for the user name, Pa55w.rd for the password, and
then click OK.
3. Verify that the login is successful; the interactive window should display a list of packages that are
installed on the client machine but not on the server (you might need to install these if your script
uses them), followed by the REMOTE> prompt.
4. In the script editor window, enter and run the following command:
mtcars
This command displays the mtcars data frame again, but this time it is being run remotely.
firstCar <- mtcars[1, ]
This command creates the firstCar variable in the remote session. Note that it doesn't appear in the
Variable Explorer/Environment window.
pause()
Notice that the interactive window no longer displays the REMOTE> prompt; you are now running in
the local session.
print(firstCar)
This command should fail with the error message object 'firstCar' not found. This occurs because
the firstCar variable is part of the remote session, not the local one.
getRemoteObject(c("firstCar"))
9. Verify that the interactive window displays the response TRUE to indicate that the command
succeeded. Also note that the firstCar variable now appears in the Variable Explorer/Environment
window.
print(firstCar)
11. This command should now succeed, and display the data for the first observation from the mtcars
data frame.
resume()
The REMOTE> prompt should reappear in the interactive window. You are now connected to the
remote session again.
exit
This command closes the remote session and returns you to the local session.
Question
How can you run interactive code remotely in R Server from R Client?
Use the remoteExecute function and specify the code to run on the remote R Server.
Specify the name of the server on which to run the code as the remoteServer parameter to
the ScaleR functions
Deploy the code to the remote R Server.
Use the remoteLogin function to connect to the remote R Server and start an interactive
session on that server.
You can't. You must log in to the remote server manually and start an interactive R session
there.
Lesson 3
The ScaleR functions
You use the ScaleR functions to manipulate and analyze big data in Microsoft R in a consistent, efficient
and scalable way. It’s consistent in terms of a common API to the functions, efficient because it makes
good use of whatever hardware configuration you have, and scalable because you can run algorithms
from on a local laptop right up to a massively distributed cluster.
Lesson Objectives
After completing this lesson, you will be able to:
 Explain the purpose of the ScaleR functions.
 Describe some of the common ScaleR functions.
 Explain how the ScaleR functions use compute contexts.
 Describe best practices for using the ScaleR functions when working with big data.
Purpose of the ScaleR functions

The ScaleR functions are central to the data
science workflow in Microsoft R. Use them to:
 Develop analyses locally on smaller datasets,

and then deploy to servers, clusters, or
database services at scale, with minimal code
changes.
 Process data that is too big to fit in memory.
 Distribute analyses over multiple cores,

processors or nodes in a cluster.
ScaleR works on chunked data, which is converted

to an efficient XDF file system that can be
distributed across processors or nodes. This process is an important part of the data science workflow,
although you can still work directly with data stored in a text, SPSS or SAS file, or in a database. XDF data
is optimized for very fast read and writes to arbitrary numbers of rows and columns. XDF files can also be
compressed, if disk space is more valuable for you than read speed.
The ScaleR functions are implemented in the RevoScaleR package that is installed with R Client and R
Server. All ScaleR function names start with either rx, for data manipulation or analysis functions that
operate independently of data source, or Rx, for functions that are class constructors for specific data
sources or compute contexts.
Summary of the ScaleR functions

The RevoScaleR package contains more than 100
functions. They are split into the following groups:
 Input/output. Functions that read in data

from different data sources, and retrieve
variable information from data frames or XDF
files.
 Data manipulation and chunking. Functions
that transfer data into chunked XDF format,
sorting, merging and splitting chunked data.
 Descriptive statistics and cross tabulation.
Functions for summarizing data, applying
summaries to groups, cross tabulations, marginal summaries, calculating tests of association and rank
correlation.
 Statistical modeling. Functions for performing statistical analyses efficiently on large, chunked data.
You can build, and predict from, linear, logistic and generalized linear; tree-based partitioning; and K-
means clustering models.
 Basic graphing. Functions to efficiently produce on-the-fly histograms and line graphs over large
datasets.
 Compute contexts. Class constructors for the different compute context objects (Rx functions).
 Data Sources. Class constructors for data source objects (Rx functions).
 High performance and distributed computing. Lower level functions for high performance and
distributed computing, such as enabling you to run arbitrary code on a cluster.
 Utilities. Miscellaneous functions. For example, functions that check the state of a cluster.
RevoScaleR Functions
https://aka.ms/qyh9pj
Many ScaleR functions are analogous to common data manipulation, summarizing or modeling functions
in base R. For example, rxSort() is similar to sort() in base R and rxlm() is like lm(). The difference is that
the rx functions are designed and optimized to operate in a distributed environment and process XDF
data. Other functions provide a degree of interoperability with base R functions. For example,
rxResultsDF() extracts summary results from an rxCrosstab(), rxCube(), or rxSummary() as a data
frame, which you can then pass to many base R functions for processing.
To find out how many ScaleR functions map to their base R equivalents:
Comparison of Base R and ScaleR Functions

https://aka.ms/km9god
You can use this information to understand where and when you might consider using ScaleR functions in
place of base R. Most analyses you produce are likely to involve a combination of base R and ScaleR
functions.
The ScaleR functions and compute contexts

You provide the following information when you
perform analyses using the ScaleR functions:
 Where the computations should take place

(the compute context)—for example, locally,
on a server, cluster, or database.
 What data to use (the data source)—for

example, a text file, XDF object, or database
connection.
By default, most ScaleR functions attempt to run in

your current compute context. However, many
ScaleR functions can take a compute context as an
argument. Note that the local compute context supports the most data source inputs, including text, XDF,
SPSS, SASS and databases. However, there are some restrictions on the functions and features available in
other compute contexts. For example, the Teradata context (RxInTeradata) only operates the Teradata
data source (RxTeradata). This makes sense because computations that j you run in this context happen
within the Teradata database.
Data Sources
https://aka.ms/xpxb3c
You use the compute context arguments to ScaleR functions to rapidly prototype your algorithms, and
then deploy them in an efficient way. First, you test your algorithm locally on a sample of your data, using
the default local session compute context. When you are happy that the algorithm is doing what you
expect of it, running at scale is often simply a question of changing the compute context to a server,
cluster or database.
Depending on the file system, there might also be differences in availability within a single data source
type. For example, if you create XDF files in your local session in R Client, these files are not chunked,
unlike XDF files created on the Hadoop distributed file. Consequently, you can’t use chunking locally. You
might also need to split and distribute your data across the available nodes of your cluster. For a more
detailed description of this process, see Module 5: Parallelizing Analysis Operations.
Best practices for computing with big data

Working with big data analytics presents
challenges you might not have faced while
working with either smaller scale analytics usually
done with R, or high performance computing.
Small scale analytics is generally performed on in-
memory datasets over one, or a few, local
processors. High performance computing is
traditionally CPU intensive using small data—for
example, running vast numbers of mathematical
simulations.
With big data analytics, data is too large to fit in memory, so you need to deal with transferring data to
the cores, processors or nodes.
This relies on efficient disk I/O, data locality, threading, and data management in RAM.
The RevoScaleR package provides you with tools to address speed and capacity issues involved in big data
analytics:
 Data management and analysis functionality that scales from small, in-memory datasets to huge
datasets stored on disk.
 Analysis functions that are threaded to use multiple cores and computations that can be distributed
across multiple computers (nodes) on a cluster or in the cloud.
While much of the complexity underlying big data analytics has been abstracted in ScaleR analytic
functions, there are still best practices you need to be aware of. If you are not, you will likely incur costs, in
terms of time, money, capacity, and frustration. The following list summarizes best practices for working
with big data:
1. Upgrade your hardware. This is often the easiest way to deal with bigger datasets. Getting a more
powerful laptop or server might mean you can perform your analytics in-memory without having to
worry about distributed computing. If you can get away with it, try and work in memory.
2. Minimize copies of data. Making multiple copies of your data can severely slow down computation.
Bear in mind, too, that many base R analytic functions like lm() and glm() make several copies of the
data as they run. The ScaleR statistical analysis functions, such as rxlm(), minimize this as much as
possible.
3. Transfer data to RDX format. This enables the data analysis and manipulation functions to operate
on chunks that can be farmed out across whatever processors or nodes you have available. Also, the
computation time of many analysis functions increases faster than linearly, so total compute time per
node is also reduced with chunking. Another advantage is that, when you run a model on an XDF file,
only the variables identified in the model formula are read.
4. Consider data types. Numerical computations with integers can be many times faster than
computations with floats. Try to convert as many of your numeric data types to integer as you can
without loss of information—for example, you can multiply 17.4 by 10 to 174, and then store as an
integer.
Note: RevoScaleR provides several tools for handling integers. For instance, within a formula
statement (for example, in rxlm(), rxglm(), rxCube() or rxCrosstabs()), you can wrap a variable in
the F() function to coerce numeric variables into factors, with the levels represented by integers (note
that you cannot use this outside of a formula). Also, you can use rxCube() to quickly tabulate factors
and their interactions for arbitrarily large datasets.
5. Use vectorized functions. This is a general tip for efficiency in R. Loops are slow in R, so try to make
use of vectorized code as much as possible. If there is not a vectorized version of the function you
want to use, you could consider rewriting the function in C++ using the Rcpp package. This package
makes it simple to integrate C and C++ code into R (https://cran.r-project.org/package=Rcpp). All of
the ScaleR functions are written in optimized C++ for maximum speed and efficiency.
6. Be careful when sorting big data. Sorting big data is a time-intensive operation. When you have to
sort huge datasets, use rxSort() on XDF—and then only when you really need to. Use rxQuantile() to
efficiently produce quantiles or medians. If you need to summarize your data by group, use
rxSummary(), rxCube(), or rxCrossTabs(); they make a single pass through the original data and
accumulate the desired statistics by group on the fly.
7. Use row-oriented data transformations where possible. Try to ensure that any data
transformations on a row are not dependent on values in other rows. Your transformation expression
should give the same result, even if only some of the rows of data are in memory at one time. You
can perform data manipulations with lags or leads but these require special handling.
8. Process data transformations in batches. If you need to run multiple transformations on your
dataset, use rxDataStep() to do this in a single pass through the data, processing the data a chunk at
a time. Making multiple passes through your data can be very time consuming.
9. Be careful with categorical variables. If you have a factor with many levels, you might find there
are chunks that don’t contain all the levels. This means that you must explicitly specify the levels or
you might end up with incompatible factor levels from chunk to chunk. rxImport() and rxFactors()
provide functionality for creating factor variables in big datasets.
10. Get more nodes. If the above tips still don’t help, you might need to invest in more nodes to
distribute across. Data analysis algorithms tend to be I/O bound when data cannot fit into memory,
so multiple hard drives can be even more important than multiple cores.
Tips on Computing with Big Data in R

https://aka.ms/iw8eyq
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
You run the rxSummary function over a very large dataset in a

session running in R Client. The rxSummary function will attempt
to read all of the data in the dataset into memory first. True or
False?
Lab: Exploring Microsoft R Server and Microsoft R Client

Scenario
You work as a data scientist for the Department of Transportation. You are investigating how and when
domestic airline flights are delayed. The department wish to provide information that could help to
minimize such delays and the consequent passenger frustration. You have a dataset containing flight
delay information for a number of years, starting in 2000. Your task is to determine whether there are any
specific patterns that occur in this data. Can these patterns be used to help predict when flight delays are
likely to occur?
You have been given a small subset of the flight delay data for the year 2000 as a local CSV file. You wish
to quickly examine its contents and schema to gain familiarity with the structure of the data and the tools
that you are going to use.
Objectives
In this lab, you will:
 Use R Client in Visual Studio (RTVS), or RStudio.
 Use ScaleR functions to perform some basic analyses.
 Perform analysis operations on a remote R server.
Lab Setup
Estimated Time: 45 minutes
 Username: Adatum\AdatumAdmin
 Password: Pa55w.rd
Before starting this lab, ensure that the following VMs are all running:
 MT17B-WS2016-NAT
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
Exercise 1: Using R Client in RTVS and RStudio

Scenario
You wish to gain some familiarity with your development environment, and also take some time to look at
a small part of the flight delay data—just so that you can assess the structure and data that it contains.
You decide to use base R functions while you are getting used to the environment.
The main tasks for this exercise are as follows:
1. Start the development environment and create a new R script
2. Use the development environment to examine data
 Task 1: Start the development environment and create a new R script

1. Log in to the LON-DEV VM as Adatum\AdatumAdmin with the password Pa55w.rd.
2. Start your R development environment of choice (Visual Studio, or RStudio), and create a new R file.
 Task 2: Use the development environment to examine data

1. The data file you will be using, 2000.csv, is located in the E:\Labfiles\Lab01 folder. Set your working
directory to this folder.
2. Read the 2000.csv file into a data frame and look at the first 10 rows. Use the base R read.csv
function and the head function, rather than the ScaleR functions.
3. The data frame has a column named Month that contains the month number. Add a factor column
to the data frame called MonthName that contains the corresponding month name. You can find
month names in the month.name list. You should use the base R factor and lapply functions.
4. Generate a summary of the data frame using the summary function. Time how long it takes to
perform this operation using the system.time function.
5. Examine the data frame as follows:
 Display the name of each column.

 Display the number of rows in the data frame.
 Find the minimum value in the ArrDelay column. This column records the flight arrival delay
time in minutes.
 Find the maximum flight arrival delay.
6. Use the xtabs function to generate a cross tabulation showing the number of flights cancelled and
not cancelled for each month. The Cancellation column contains a value indicating whether a flight
was cancelled (1 – cancelled, 0 – not cancelled).
Note: Strictly speaking, xtabs is not a base R function, but it is commonly used by data scientists
performing cross tabulations.
Note: Record the console output, as this will be referenced in a later exercise.
Results: At the end of this exercise, you will have used either RTVS or RStudio to examine a subset of the
flight delay data for the year 2000.
Question: How many columns are there in the data frame, including the new MonthName
column?
Question: How many columns are there in the data frame?

Question: What are the minimum and maximum arrival delay times recorded in the data
frame?
Question: How many flights were cancelled in June?
Exercise 2: Exploring ScaleR functions

Scenario
Now that you have an understanding of the shape of the data, you want to analyze it using the ScaleR
functions. This gives you the opportunity to gain some familiarity with these functions. You can also verify
that you are using them correctly because the values they return should be the same as those generated
by using the base R functions.
1. Examine the data by using ScaleR functions
 Task 1: Examine the data by using ScaleR functions

1. Generate a summary of the data frame using the rxSummary function with the formula ~. . This
formula generates a summary for all numeric columns and factors in the data frame. Time how long it
takes to perform this operation using the system.time function, and compare the time taken to that
consumed by running the summary function over the same data frame in the previous exercise.
2. Run the rxGetInfo function over the data frame. This function retrieves the number of variables and
observations in the data frame. Verify that these values are the same as those you obtained in
exercise 1.
3. Run the rxGetVarInfo function over the data frame. This function retrieves more detailed information
about the variables in the data frame, such as their names and type. Verify that the MonthName
variable is a factor with 12 levels, each named after a month.
4. Run the rxQuantile function over the data frame. Specify the ArrDelay column as the first argument
and the data frame as the second. This function reports the quantiles for the data. The 0% quantile
should be the same as the minimum value reported earlier, and the 100% quantile should be the
maximum value.
5. Use the rxCrossTabs function to generate a cross tabulation of the number of cancelled/not
cancelled flights by month name. Use the formula ~MonthName:as.factor(Cancelled == 1).
Note that rxCrossTabs uses the ":" character to separate independent variables, rather than the "+"
of xtabs.
6. Use the rxCube function to generate a data cube over the same data. Verify that the values displayed
are the same as the cross tabulation.
7. Remove the data frame from session memory.

Note: Please note the console output, as this will be compared too in a later exercise.
Results: At the end of this exercise, you will have used the ScaleR functions to examine the flight delay
data for the year 2000, and compared the results against those generated by using the ScaleR functions.
Exercise 3: Performing operations on a remote server

Scenario
The data you have used so far is small enough to be analyzed locally, but you know that the entire dataset
will be much too big to do this, and will require R Server resources. You decide to test that you can
successfully connect to a remote server, perform operations, and retrieve the results back to a local
session.
1. Copy the data to the remote server
2. Read the data into a data frame
3. Examine the data remotely

4. Transfer the results from the remote session
 Task 1: Copy the data to the remote server

1. Create a remote session on the LON-RSVR server. This is another VM running R Server. Use the
following parameters to the remoteLogin function:
 deployr_endpoint: http://LON-RSVR.ADATUM.COM:12800
 session: TRUE
 diff: TRUE
 commandLine: TRUE
 username: admin
 password: Pa55w.rd
2. Pause the remote session and return to the local session.
3. Copy the file 2000.csv to the remote session. Use the putLocalFile function.
Note: For large files, you should perform this step as an out-of-band task and manually place the data on
a network share that all files in the R Server cluster can access.
4. Return to the remote session.
 Task 2: Read the data into a data frame

1. In the remote session, read the data into a data frame. Use the same code that you utilized in exercise
1.
2. Add the MonthName column to the data frame as you did in exercise 1.
3. Display the first 10 rows of the data using the head function.
 Task 3: Examine the data remotely

 Repeat the operations that you performed in exercise 2 (rxSummary, rxGetInfo, rxGetVarInfo,
rxQuantile, rxCrossTab, and rxCube), but save the results for each function in variables in the
session.
 Task 4: Transfer the results from the remote session

1. Pause the remote session.
2. Copy the variables that you created in the remote session to the local session using the
getRemoteObject function.
3. Print the variables you have just retrieved. Their contents should match the results you obtained in
exercise 2.
4. Log out of the remote session.
5. Save the script as Lab1Script.R in the E:\Labfiles\Lab01 folder, and close your R development
environment.
Module Review and Takeaways

In this module, you:
 Learned about Microsoft R Server.
 Learned how to use Microsoft R Client.

 Learned about the ScaleR functions.

2-1
Module 2
Exploring Big Data
Contents:
Module Overview 2-1
Lesson 1: Understanding ScaleR data sources 2-2
Lesson 2: Reading and writing XDF data 2-9
Lesson 3: Summarizing data in an XDF object 2-21
Lab: Exploring big data 2-32
Module Overview
ScaleR™ uses the Extensible Data Format (XDF) to store data. This file format was developed by NASA for
managing very large datasets. It is extremely efficient and compact, and is ideally suited to performing big
data analysis. The ScaleR functions use this format for managing data, and it enables these functions to
access data in a piecemeal manner. This means that you can use the ScaleR functions to process datasets
that are arbitrarily large, and that are much too big to fit into the available memory of a computer.
Objectives
In this module, you will learn how to:
 Create and use ScaleR data sources.

 Import and transform data into the XDF format used by ScaleR.
 Summarize data held in an XDF file.

2-2 Exploring Big Data
Lesson 1
Understanding ScaleR data sources
In this lesson, you will learn about the data sources that you can use in ScaleR. This lesson explains the
limitations on which sources you can employ in a distributed environment. This lesson also describes how
to access data held in SQL Server and Hadoop.
Lesson Objectives
 Describe the common data sources provided by ScaleR.
 Read and write data by using a ScaleR data source.

 Work with SQL Server data.
 Explain how to use the ScaleR file system.
 Access the HDFS file system in ScaleR functions.
ScaleR data sources

The ScaleR functions can use data held in various
formats coming from a variety of sources (text,
SAS, ODBC, SQL Server, and others). To access this
data, you create a data source. Each type of data
source has a data source constructor that you use
to create it. The following table contains a list of
these constructors:
Location/Type of Data Data Source Constructor
Text (fixed-format or delimited) RxTextData
SAS RxSasData
SPSS RxSpssData
Database (must have appropriate ODBC driver installed) RxOdbcData
Teradata database RxTeradata
.xdf data files RxXdfData
SQL Server database RxSqlServerData
Hive database RxHiveData
Parquet data RxParquetData

It is important to understand that, although each data source is available when you use the R client on a
local computer, some distributed contexts, specifically Hadoop and Teradata, limit which data sources you
can use. This is due to the fundamental architectures of these contexts. The following table summarizes
which data sources you can use in which compute contexts:
Compute Context
Data Source RxLocalSeq RxHadoopMR RxInTeradata RxInSqlServer RxSpark
Delimited Text x x x
(RxTextData)
Fixed-Format x
Text (RxTextData)
.xdf data files x x x

(RxXdfData)
SAS data files x

(RxSasData)
SPSS data files x

(RxSpssData)
ODBC data x
(RxOdbcData)
Teradata x x
database
(RxTeradata)
SQL Server x x
(RxSqlServerData)
Hive x
(RxHiveData)
Parquet x
(RxParquetData)
The following example shows how to create an RxTextData data source referencing a comma-delimited
text (CSV) file:
Connecting to a CSV file

claimsCSVFileName <- file.path(rxGetOption("sampleDataDir"), "claims.txt")
claimsCSVDataSource <- RxTextData(claimsCSVFileName)
Reading and writing data using ScaleR data sources

After you have created a data source, you can use
the functions rxOpen, rxClose, rxIsOpen,
rxReadNext and rxWriteNext to retrieve and
manipulate data. The following list summarizes
these functions:
 rxOpen(src, mode = "r"). Use this function

to open an existing data source so that you
can use it to read data from the source. Note
that ScaleR currently only supports read-only
data sources by using this function.
 rxClose(src). Use this function to close an

existing data source when you have finished
reading from it.
 rxIsOpen(src, mode= ‘r’). Use this function to check whether a data source can be accessed.
 rxReadNext(src). When a data source is open, use this function to extract a chunk of data from it. A
chunk is a section or block of data rather than the full content of the file. The data can be returned as
a data frame or a list, depending on the value of the returnDataFrame property of the underlying
data source.
 rxWriteNext(from, to, …). Use this function to write a chunk of data to a data source.
The following code shows how to use these functions to read data from an RxXdfData data source. The
example reads the data a chunk at a time:
Reading from a data source

xdfSource <- … # name of XDF file
dataSource <- RxXdfData(xdfSource)
rxOpen(src = dataSource, mode = "r")
currentData <- rxReadNext(src = dataSource)
chunkNum <- 1
while (nrow(currentData) > 0 ) {
print(paste("Chunk ", chunkNumber, ", Number of rows:", nrow(currentData), sep = ""))
currentData <- rxReadNext(src = dataSource)
chunkNumber <- chunkNumber + 1
}
rxClose(src = dataSource)
Working with SQL Server data sources

You use ScaleR to work with SQL Server by using the
RxSqlServerData data source. You can also access
SQL Server (and other relational databases) through
the RxOdbcData data source, but the SQL Server data
source is optimized for working with SQL Server and is
therefore more efficient.
ScaleR provides the following functions that you can

use with the RxSqlServerData data source:
 RxSqlServerData. Use this function to construct a

SQL Server data source. The parameters to this function specify the server to which to connect, the
database to use, login information, and other connection parameters. You can use this function to
connect directly to a table, or you can reference a SQL query that retrieves data.
 rxSqlServerDropTable. This function removes a table (and its contents) from a SQL Server database.
 rxSqlServerTableExists. Use this function to test whether a specified table exists in a SQL Server
database.
 rxExecuteSQLDDL. This function performs Data Definition Language (DDL) operations, enabling you
to perform operations such as creating new tables.
An important parameter to the RxSqlServerData function is rowsToRead. This parameter controls how
many rows the data source reads as a chunk. If you make this parameter too large, you might encounter
poor performance because you don’t have sufficient memory to this volume of data. However, setting
rowsPerRead to too small a value can also result in poor performance due to excessive I/O and inefficient
use of the network. You should experiment with this setting to find the optimal value for your own
configuration and dataset.
The rxSqlServerDroptable, rxSqlServerTableExists, and rxExecuteSQLDDL functions can specify a

connection string that indicates on which instance of SQL Server they should be run. Alternatively, if you
are running in a SQL Server compute context, these functions will operate in that context if you omit the
connection string.
The following example shows how to connect directly to a SQL Server table from ScaleR. In this example,
the Airport table contains the details of US airports:
Connecting to a SQL Server data source

sqlConnString <- "Driver=SQL Server;Server=LON-
SQLR;Database=AirlineData;Trusted_Connection=True"
connection <- RxSqlServerData(connectionString = sqlConnString,
table = "dbo.Airports", rowsPerRead = 1000)
# Display the first few rows from the Airports table

head(connection)
# Results:
iata airport city state country lat long
1 00M Thigpen Bay Springs MS USA 31.95376 -89.23450
2 01G Perry-Warsaw Perry NY USA 42.74135 -78.05208
3 06D Rolla Municipal Rolla ND USA 48.88434 -99.62088
4 06M Eupora Municipal Eupora MS USA 33.53457 -89.31257
5 06N Randall Middletown NY USA 41.43157 -74.39192
...
The following example shows how to check whether a table exists, and then remove it:
Performing a DDL operation against SQL Server

RxInSqlServer(…, …) # Use a SQL Server compute context
tempTable <- … # Specify the name of the table to drop

if (rxSqlServerTableExists(tempTable)) {
rxSqlServerDropTable(tempTable)
}
ScaleR file systems

Some of the ScaleR data sources specify files that
are located in a file system. ScaleR supports the
native file system of whichever operating system
you are using, but it also gives access to files held
in the Hadoop Distributed File System (HDFS). This
is a file system aimed at providing efficient data
storage and access for very large files. It uses a
clustered architecture. HDFS has its own
distributed structure—what appears to be a single
file to a consumer application is likely to be
implemented as many pieces stored on different
computers. To access a file held in HDFS, you
should use the RxHdfsFileSystem. For regular files, use the RxNativeFileSystem (the default). You can
specify the file system as a parameter to many data source constructors, or you can set the file system
globally for a session by using the rxSetFileSystem function. The rxGetFileSystem function reports
which file system is currently active.
Note that you can only use the RxHdfsFileSystem file system in a supported compute context, such as
RxHadoopMR. You can also use the RxHdfsFileSystem file system from a local session running on a
Hadoop cluster. The RxNativeFileSystem file system is available in all compute contexts.
The following example shows how to connect a data source to a text file stored in HDFS. The parameters
to the RxHdfsFileSystem constructor provide the information for connecting to the HDFS cluster:
Using the RxHdfsFileSystem file system

hdfsFS1 <- RxHdfsFileSystem(hostName = "default", port = 0)
textDS <- RxTextData(file = "myfile.txt", fileSystem = hdfsFS1)
Working with Hadoop data sources

Hadoop stores much of its data using HDFS. If you
are running in a Hadoop compute context
(RxHadoopMR), you can utilize the HDFS helper
functions to manage HDFS files. The following list
summarizes these functions:
 rxHadoopCopyFromClient. Use this function

to copy a file from a remote client to the
HDFS file system in the Hadoop cluster.
 rxHadoopCopyFromLocal. Use this function

to copy a file from the native file system to
the HDFS file system in the Hadoop cluster.
 rxHadoopCopy. Use this function to copy an HDFS file.

 rxHadoopMove. Use this function to move a file around the HDFS file system.
 rxHadoopRemove. This function removes a file from the HDFS file system.
 rxHadoopRemoveDir. Use this function to remove a directory from HDFS.

 rxHadoopListFiles. This function generates a list of files in a specified HDFS directory.
 rxHadoopFileExists. This function tests whether a specified file exists in an HDFS directory.
Behind the scenes, these functions perform the corresponding Hadoop fs commands in an SSH shell
created by the RxHadoopMR compute context.
You can also use the rxHadoopCommand function to run an arbitrary Hadoop command. You can
perform any Hadoop operation, including managing the HDFS file system, and submitting Hadoop jobs.
This command uses an explicit SSH connection to communicate with the Hadoop cluster.
Revo ScaleR functions for Hadoop

https://aka.ms/dfrwnk
Demonstration: Reading data from SQL Server and HDFS

Demonstration Steps
Reading data stored in SQL Server

1. Open your R development environment of choice (RStudio or Visual Studio).
2. Open the R script Demo1 - SQL Server.R in the E:\Demofiles\Mod02 folder.
3. Highlight and run the code under the comment # Connect to SQL Server. These statements create
an RxSqlServerData data source that references the Airports data in the AirlineData database on
the LON-SQLR server.
4. Highlight and run the code under the comment # Use R functions to examine the data in the
Airports table. These statements display the first few rows of the table, and then use the
rxGetVarInfo and rxSummary functions to display the columns in the table and calculate summary
information.
Reading data stored in HDFS

1. Open the R script Demo1 - HDFS.R in the E:\Demofiles\Mod02 folder.
2. Highlight and run the code under the comment # Create a Hadoop compute context. These
statements open a connection to the Hadoop cluster and set this connection as the compute context.
3. Highlight and run the code under the comment # List the contents of the /user/instructor folder
in HDFS. This command uses the Hadoop fs -ls command to display the file names from HDFS.
4. Highlight and run the code under the comment # Connect directly to HFDS on the Hadoop VM.
These statements create an RxHdfsFileSystem object and make it the default file system for the
session.
5. Highlight and run the code under the comment # Create a data source for the CensusWorkers.xdf
file. This statement creates an XDF data source that references the CensusWorkers.xdf file in HDFS.
6. Highlight and run the code under the comment # Perform functions that read from the
CensusWorkers.xdf file. These statements display the first few lines from the CensusWorkers.xdf file,
and generate a summary of the file contents.
Note that the rxSummary command might take a couple of minutes. This is because Hadoop is
optimized to handle very large data files. With a small file such as the one used in this demonstration,
the overhead of using Hadoop exceeds the time spent performing any processing.
7. Close your R development environment of choice (RStudio or Visual Studio) without saving any
changes.
Statement Answer
You want to use the rxImport function to transfer data from a local CSV
file into a SQL Server table. You use an RxTextData data source to read
the data from the CSV file and an RxSqlServerData data source to write
to the SQL Server database. You should perform this operation in the
RxInSqlServer compute context. True or False?
Lesson 2
Reading and writing XDF data
In this lesson, you will learn how to read data into an XDF object from a variety of sources. You will also
learn how to control the formats in which they are created, transform any of the variables you need to,
and perform union and one-to-one merges.
Lesson Objectives
 Explain how using XDF overcomes many limitations of traditional R data formats.
 Import data into an XDF object.

 Describe how to control the format of data in an XDF file by using a schema.
 Perform transformations on data as it is imported.
 Refactor variables in an XDF file.
 Import composite files into XDF.
 Split large XDF files into smaller pieces.
 Merge XDF files together.

 Use the rxXdfData data source to read data and extract variables.
 Examine the structure of an XDF object.
The XDF format

R traditionally uses data frames that cache data in
memory. While this approach can improve
processing speed by reducing the volume of I/O
being performed, it limits the size of datasets to
the amount of available memory. Schemes that
store data frames on disk improve capacity at the
expense of increased I/O and processing
complexity. The XDF file format, and the ScaleR
functions, have been designed to help overcome
these issues. The XDF format is very efficient for
reading arbitrary rows and columns, and the
ScaleR functions can adapt to make the best use
of available memory and processing resources when reading and writing an XDF file. Many of the ScaleR
functions can break down their operations to perform elements in parallel in a clustered environment. You
can take advantage of this mechanism to create your own parallelized analysis functions. Module 5:
Parallelizing Analysis Operations, discusses this process in more detail.
XDF files are subdivided into “chunks”. The ScaleR functions only need to read a single chunk into
memory at a time, process it, and then write it back to disk. This leads to improved scalability because the
size of a dataset that you can process is not limited by the memory available on the computer. The
parallel algorithms used by the ScaleR functions enable different processors to work on these data chunks,
and then combine the chunked results together.
Occasionally, it might be necessary for the ScaleR functions to make multiple passes over one or more
chunks, but this processing is transparent and is controlled by the functions themselves.
However, you should consider some tradeoffs when determining whether to import data into the XDF
format or leave it in a more traditional format. There might be times when using an in-memory data
frame is more appropriate for performing a specific task. Additionally, many existing R packages use the
data frame interface, and are not designed to work with XDF files. For more information on these
tradeoffs, see:
Trade-offs to consider when reading a large dataset into R using the RevoScaleR package
https://aka.ms/let4so
Importing data into an XDF object

You use the rxImport function to retrieve data from a
file and import it into an XDF object. You can use this
function to convert data held in other formats, such as a
text file or a SQL Server database.
The following example shows how to use the rxImport

function to read from a text file:
Using rxImport to read a text file

readPath <- rxGetOption("sampleDataDir")
infile <- file.path(readPath, "claims.txt")
claimsDF <- rxImport(infile)
By default, the result returned by rxImport is in in-memory data frame. If the data file is large, the data
frame can exhaust the available memory, but you can also use rxImport to create an XDF file by
specifying the outFile parameter to rxImport.
The following example shows how to create an XDF file from the claims.txt file used by the previous
example:
Using rxImport to create an XDF file

readPath <- rxGetOption("sampleDataDir")
infile <- file.path(readPath, "claims.txt")
outfile <- file.path(readPath, "claims.xdf")
claimsXDF <- rxImport(infile, outFile = outfile)
If you specify the outFile parameter, the value returned by rxImport is an RxXdfData data source object
that references the new XDF file, rather than a data frame. The data is not read into memory, but is
instead written to the XDF file. You can use the RxXdfData data source to read and process the data in
this file, chunk by chunk.
The following example shows how to use the RxXdfData data source object created in the previous
example to generate a summary of the XDF data:
Generating a summary from an XDF file

rxSummary(~., claimsXDF)
Filtering data
You can control the rxImport process by using the numRows and rowSelection arguments. The
numRows argument causes the import to stop after reading the specified number of rows. You use the
rowSelection argument to filter the data as it is imported, so that only rows that match specified criteria
make it to the XDF file.
The following example shows how to limit the number of rows being imported, and how to filter the data
as it is being imported. In this example, cost is a field containing numeric data in the file being imported:
Limiting and filtering data

claimsXDF <- rxImport(infile, outFile = outfile, numRows = 10000,
rowSelection = (cost >= 0))
You can use the varsToKeep and varsToDrop arguments to specify which columns from input you wish
to retain or discard from the XDF file. These are character vectors containing the names of the columns.
You can specify either argument, but not both. You cannot use these arguments if you are importing data
from an ODBC data source or a fixed format text file.
Controlling chunk size

You can control the size of the chunks in the XDF file by using the rowsPerRead argument to rxImport.
This parameter specifies how many rows each chunk should contain. You might need to experiment with
this argument to optimize your files. If you make chunks too big, then processing this file might consume
more resources than are available. If you make chunks too small, then you can incur excessive I/O when
processing the file. Note that you can reblock a file by using the rxDataStep function, as described in
Module 4: Processing Big Data.
Overwriting or extending an XDF file

You use the overwrite argument to rxImport to specify whether the function should overwrite an
existing file with the same name specified by the outFile parameter. This is a logical (TRUE/FALSE) value.
You can use the append argument in conjunction with overwrite. If you set append to "rows" and
overwrite to FALSE, the rxImport operation will append the data to the end of an existing file. If you set
append to "none", the results depend on the value of overwrite and whether the file already exists. In this
case, if you set overwrite to TRUE, rxImport will overwrite the existing file contents, but if overwrite is
false, rxImport will fail.
Importing data using Sacle R in Microsoft R

https://aka.ms/q8jsi9
Controlling data schemas

The rxImport function attempts to determine the
type and name of each column by examining the
source data file. Depending on the type of the file,
this information could be encoded in the
metadata, or it might be inferred from information
included at the start of the file (such as the first
line in a CSV file that can contain the names of
columns). You can override the metadata (if it
exists) by providing your own schema information
as the colClasses argument to rxImport.
The colClasses argument should contain a vector

of named column name/column type pairs. The
rxImport function uses this information to set the type of each named column.
The following example shows how to use the colClasses argument to set the data type of columns:
Using the colClasses argument to rxImport

flightDataColumns <- c("Year" = "factor",
"Month" = "factor",
"DayofMonth" = "factor",
"DayOfWeek" = "factor",
"DepTime" = "character",
"ArrTime" = "character",
"Delay" = "numeric",
"Cancelled" = "logical")
flightDataSampleDF <- read.csv(flightDataCsv, colClasses = flightDataColumns)
If you omit any columns in colClasses, the rxImport function will infer the type of those columns from
the source metadata.
You can also specify the stringsAsFactors argument to rxImport if you want the import process to
convert all character strings as factors. In this case, you can override specified columns by using
colClasses and indicating that these columns should be retained as character data.
Handling fixed format text data

If you are importing text data from a fixed format file, you can use the colInfo argument to specify how
to split each line into individual fields. You provide a list of data for each column, indicating where it starts
and ends in the line, and its type.
The following example reads information about car insurance claims from a fixed format text file:
Using the colInfo argument to rxImport

claimsFF <- file.path(rxGetOption("sampleDataDir"), "claims.dat")
claimsXdf <- file.path(tempdir(), "importedClaims.xdf")
claimsFFColInfo <-list(rownames = list(start = 1, width = 3),

age = list(type = "factor", start = 4, width = 5),
car.age = list(type = "factor", start = 9, width = 3),
type = list(type = "factor", start = 12, width = 1),
cost = list(type = "numeric", start = 13, width = 6),
number = list(type = "integer", start = 19, width = 3))
rxImport(inData = claimsFF, outFile = claimsXdf, colInfo = claimsFFColInfo)

Transforming data on import

You might have instances when you want to
transform data rather than just reformatting it. For
example, you might wish to group responses,
replace missing values or round numeric values.
ScaleR provides the following arguments to
rxImport to help perform these tasks:
 transforms. This is a list of transformations

that the rxImport function will apply to each
row. You can use this argument to add new
columns to the XDF file, or change the values
of existing fields in a row. You can reference
any fields in the current row, and a number of
internal variables managed by the rxImport process. These variables include:
o .rxStartRow. The row number of the first row in the current chunk being processed.
o .rxChunkNum. The current chunk number.
o .rxNumRows. The number of rows in the current chunk.
 transformObjects. Transformations in rxImport operate inside their own closed scope, and variables
external to the rxImport operation are inaccessible. If a transformation requires access to an external
variable, you must specify it using this argument. This argument is a list. Note that if you reference
external variables in the rowSelection argument, you must also include them in this list.
 transformFunc. You can implement complex transformations as a function rather than specifying
them in the transforms list. You provide the name of the function in this argument.
 transformVars. If the transformFunc takes parameters, you must provide their values in this vector.
 transformPackages. If a transformation references external packages, you must specify them in this
vector.
The following example uses a transform to add a variable named logcost, containing the log of the cost
column in the current row, to an XDF file:
Using a transform to add a variable

inFile <- file.path(rxGetOption("sampleDataDir"), "claims.txt")
outfile <- "claimsXform.xdf"
claimsDS <-rxImport(inFile, outFile = outfile, transforms=list(logcost = log(cost)))
Best Practice: Test transformations over a small subset of the data first. This gives you a
way to quickly test that the transformation is correct. When you are satisfied that the
transformation is working correctly, you can perform it over the entire data.
Best Practice: The transforms argument contains a list of transformations. Use this list to
batch together transformations over different fields, rather than performing separate runs of
rxImport over the same data, each with its own single transformation.
Module 4: Processing Big Data, describes how to use the transformFunc, transformVars, and
transformPackages arguments in detail.
Refactoring variables
You can convert nonfactor variables in an XDF file
into factors by using the rxFactors function. This
function takes an XDF file and a specification of
how to categorize data in the factorInfo
argument. Like the rxImport function, rxFactors
can write the results to a new XDF file or return it
as a data frame. You can also specify whether to
retain or drop columns by using the varsToKeep
and varsToDrop arguments. However, it doesn't
support filtering by row selection or
transformations.
The following example shows how to recode a

nonfactor variable as a factor. In this example, the gender variable can be either "M" or "F", and the state
field is a two-letter abbreviation of a US state. The example categorizes both fields:
Categorizing a nonfactor variable in an XDF file

infile <- "dataFile.xdf"
outfile <- "factoredDataFile.xdf"
myNewData <- rxFactors(inData = infile, outFile = outfile, factorInfo = c("sex",

"state"))
You can also use rxFactors to refactor a variable that is already a factor.
In the next example, the DayOfWeek variable is a factor that contains the values "1" through "7" to
represent the days of the week. The following code changes the variable to use the levels "Mon" through
"Sun" instead:
Recoding an existing factor with rxFactors

NewData <- rxFactors(inData = infile, outFile = outfile,
factorInfo = list(DayOfWeek = list(newLevels = c(Mon = "1", Tue =
"2", Wed = "3", Thu = "4", Fri = "5", Sat = "6", Sun = "7"), varName = "DayOfWeek")))
For more information about using the rxFactors function, see:
Transforming and Subsetting data

https://aka.ms/eoaui5
Importing composite data

You might have situations where you wish to use a
set of data files as inputs rather than just one—for
example, you might have one data file per day.
You could use an interim process to combine
them, but you can use rxImport to import them
collectively.
In the following example, the E:\Data folder

contains a collection of CSV files (named 2000.csv
through 2004.csv). The code creates an
RxTextData data source that references this
folder. You can then use this data source as the
input to the rxImport function:
Importing data from many CSV files

csvDataDir <- "E:\\Data"
csvData <- RxTextData(csvDataDir)
rxImport(inData = csvData, outFile = "E:\\CompositeData.xdf")
# Messages displayed as the data is imported:

# Importing file E:\Data\2000.csv
# Rows Read: 1135221, Total Rows Processed: 1135221, Total Chunk Time: 95.300 seconds
# Importing and appending file E:\Data\2001.csv
You can also output the data as a composite set of files rather than a single file. To do this, create an
RxXdfData data source that references the destination folder, and specify the createCompositeSet
argument to the rxImport function, as follows:
Creating a composite XDF object

csvDataDir <- "E:\\Data"
csvData <- RxTextData(csvDataDir)
xdfDataDir <- "E:\\XDFData"
xdfData <- RxXdfData(xdfDataDir)
rxImport(inData = csvData, outFile = xdfData, createCompositeSet = TRUE)
In this case, the resulting folder contains two subfolders; a metadata folder holding an .xdfm file that
contains the metadata describing the contents of the XDF files, and a data folder that contains the data in
the form of a set of .xdfd files. This is the structure that HDFS uses for storing XDF data; the various .xdfd
files might be physically stored on separate machines in the HDFS cluster.
When you need to process the data, create an RxXdfData data source using the folder holding the data
and metadata folders.
Splitting large XDF files

You might be faced by a situation where different
subsets of a dataset are treated, kept or modeled
differently. For example, you might want to work
with a small subset of your data, or you might
want to divide the data into pieces, so that you
can distribute it across different nodes in a cluster.
In these cases, rxSplit is a useful function.
You use the rxSplit function to quickly divide an

XDF file into pieces based on the value of a factor
variable that you select. The data for different
values of the factor variable is stored in different
files. The splitByFactor argument specifies the
factor to use to split the data.
The following example splits census data held in an XDF file by using the year variable. The year is a
factor variable:
Splitting an XDF file by factor

infile <- "CensusData.xdf"
outfiles <- "YearData"
rxSplit(inData = infile, outFilesBase = outfiles, splitByFactor = "year")
The previous example creates files named YearData1.xdf, YearData2.xdf, and so on.
You can also split an XDF file into a number of approximately uniform-sized pieces rather than splitting by
factor. To do this, specify the numOut argument, as follows:
Splitting an XDF file into a specified number of pieces

infile <- "CensusData.xdf"
outfiles <- "YearData"
rxSplit(inData = infile, outFilesBase = outfiles, numOut = 5)
Note that, in common with many of the ScaleR functions concerned with importing or modifying data,
you can also use arguments such as varsToKeep, varsToDrop, rowSelection, and transforms when
splitting the data.
Combining XDF files

You use the rxMerge function to combine
multiple XDF files together into a single dataset.
This function supports several types of merge
operations:
 Union. This operation appends the data in

one XDF file to the end of another. Both input
files must have the same number of columns,
and the corresponding columns in each file
must have the same type. If necessary, you
can use the varsToDrop1 and varsToDrop2
arguments to rxMerge to filter out any
variables that might cause a mismatch in
either file. For more information about performing a union merge, see:
Union Merge
https://aka.ms/lvdsku
 OneToOne. This operation performs a column-wise append, concatenating a row in the second file
onto the end of the corresponding row in the first. For more information about one-to-one merging,
see:
One to one Merge
https://aka.ms/ihwfxm
 Inner Merge and Outer Merge. These are analogous to the inner join and outer join operations
between tables in a relational database. These options are covered in more detail in Module 4:
Processing Big Data.
The following example illustrates how to perform a simple union merge between two XDF files. Notice
that the type argument specifies the type of merge to perform:
Using rxMerge to perform a union merge

firstFile <- "SplitData1.xdf"
secondFile <- "SplitData2.xdf"
mergedFile <- "MergedData.xdf"
rxMerge(inData1 = firstFile, inData2 = secondFile, outFile = mergedFile, type = "union")
Reading data in blocks

You can use the rxOpen, rxReadNext, and
rxClose functions of an RxXdfData data source to
retrieve data from an XDF file chunk by chunk. The
value returned by rxReadNext in this case is a
data frame containing the chunk. You can access
the individual variables inside this chunk by using
the $ and [[]] operators, as with any data frame.
The following code snippets show examples of

how to use these operators with an XDF data
source. The example uses airline delay data. The
ArrDelay variable contains information about
flight arrival delays:
Accessing variables in an XDF data source

delayDataFile <- "AirlineDelayData.xdf"
xdfSource <- RxXdfData(delayDataFile)
rxOpen(xdfSource)
data <- rxReadNext(xdfSource)
flightArrivalDelays = data$ArrDelay # can also use flightArrivalDelays =

data[["ArrDelay"]]
rxClose(xdfSource)
Note that this approach is not too common. If you need to process data in chunks, a better method is to
use the rxDataStep function described in Module 4: Processing Big Data.
Examining and modifying the structure of an XDF object

ScaleR provides the rxGetInfo and rxGetVarInfo
functions to enable you to view the structure of an XDF
object. These functions also work on other types, such as
data frames.
The rxGetInfo function returns information about the

contents of the object as a whole, such as the number of
variables and observations, and the number of blocks in
the file.
The following code shows some sample output returned

by rxGetInfo:
Using rxGetInfo with an XDF object

rxGetInfo(delayDataFile)
# Typical output
# File name: E:\AirlineDelayData.xdf
# Number of observations: 227044
# Number of variables: 30
# Number of blocks: 1
# Compression type: zlib
You can also specify the getVarInfo argument to obtain detailed information about the individual
variables in the dataset, as shown in the following example:
Getting information about variables in an XDF object

rxGetInfo(delayDataFile, getVarInfo = TRUE)
# Typical output
# File name: E:\AirlineDelayData.xdf
Var 1: .rxRowNames, Type: character
Var 2: Year
1 factor levels: 2000
Var 3: Month
12 factor levels: 1 2 3 4 5 ... 8 9 10 11 12
Var 4: DayofMonth
31 factor levels: 1 3 9 17 19 ... 18 12 8 21 5
Var 5: DayOfWeek
7 factor levels: 6 1 7 3 4 5 2
Var 6: DepTime, Type: character
…
You can limit the variables displayed by using the varsToKeep and varsToDrop arguments.
The rxGetVarInfo function is similar, except that it only returns variable information from a data source.
You use the rxSetVarInfo function to modify variable metadata in a data source.
The following example uses rxGetVarInfo and rxSetVarInfo to change the levels for a factor in an XDF
file. In this case, the Cancelled factor originally used the levels "0" and "1" to indicate whether the flight
had been cancelled. This code changes the levels to "No" and "Yes":
Changing the labels for a factor

varInfo <- rxGetVarInfo(xdfSource)

varInfo$Cancelled$levels <- c("No", "Yes")
rxSetVarInfo(varInfo, xdfSource)
There is also a corresponding rxSetInfo function that you can use to set the metadata of the XDF file
rather than the individual variables in the file.
Note: The rxGetVarInfo function only accesses the variable metadata at the start of an
XDF file. The rxSetVarInfo function does not modify the actual data for each row, only the
metadata. Consequently, both functions can run very quickly, even over large XDF objects.

Question
Which argument to the rxImport function does not enable you to filter data?
rowSelection
varsToDrop
varsToKeep
colClasses
numRows
Lesson 3
Summarizing data in an XDF object
In this lesson, you will learn how to generate statistical summaries of the data in an XDF object. You can
use many regular R functions to perform these tasks, but ScaleR also provides several functions of its own
specifically designed to efficiently process big data held in the XDF format.
Lesson Objectives
 Use some common base R functions over XDF data.
 Generate summaries using the rxSummary function.

 Examine an rxSummary object.
 Temporarily transform data as it is being summarized.
 Perform dplyr style transformations over XDF data.
 Cross-tabulate XDF data.
 Test cross-tabulated data for variable independence.
 Compute quantiles over XDF data.
Using base R functions over an XDF object

You can use many of the base R functions to
summarize the data in an XDF object. Common
examples include summary, head, tail, names,
dim, nrow, and ncol. This means that you can
continue to use many existing R scripts that you
might have developed to operate on data frames
without recoding them to use the ScaleR
functions.
Internally, the summary, head, and tail functions

are generic; they are designed to examine the
class of the object on which they are called, and
then invoke a function implemented by that class
to perform the operation. For example, when you invoke the head function over an XDF object, it actually
calls RxXdfData.head (an S3 method for the class RxXdfData), which in turn invokes rxSummary to
extract the first few rows of data. The data is converted back into the usual format reported by head. All of
this happens behind the scenes, but the key point to note is that calling these functions over an XDF
object does not require casting the data to a data frame—so they will work even on very large data
objects.
The names function operates in a slightly different manner. It is also generic, but the technique it uses
internally differs. This detail is not actually important, but the key point is that names eventually invokes
the rxGetVarNames function. Similarly, dim invokes the rxGetInfo function, as do the nrow and ncol
functions.
The key fact to take from this discussion is that you can continue to use these common Base R functions
over XDF data. However, if you are writing new scripts from scratch, you might find that using the
equivalent ScaleR functions directly is more efficient.
Summarizing XDF data with rxSummary

The rxSummary performs a similar task to the
Base R summary function. However, it has some
features that provide improved flexibility.
The basic form of rxSummary specifies the data

to summarize, and the object containing the data.
You use the formula notation to specify which
variables to summarize.
The following example summarizes the ArrDelay

(arrivals delay) variable in the airline flight delay
dataset:
Using rxSummary to obtain arrival delay

statistics
rxSummary(~ArrDelay, xdfSource)
# Typical output
# Computation time: 0.041 seconds.
# Call:
# rxSummary(formula = ~ArrDelay, data = xdfSource)
# Summary Statistics Results for: ~ArrDelay

# Data: xdfSource (RxXdfData Data Source)
# File name: SplitData1.xdf
# Number of valid observations: 227044
# Name Mean StdDev Min Max ValidObs MissingObs

# ArrDelay 8.105258 32.32713 -1298 1084 217503 9541
If you need to summarize more than one variable, specify them in the formula separated by the +
operator, like this:
Generating a summary for arrival and departure delays

rxSummary(~ArrDelay + DepDelay, xdfSource)

You can use the ":" notation to include dependencies in a formula. The following example summarizes
departure delay times as a function of the airport of origin for a flight:
Referencing dependent and independent variables in a formula

rxSummary(~DepDelay:Origin, xdfSource)
# Typical output
# Call:
# rxSummary(formula = ~DepDelay:Origin, data = xdfSource)
#
# Summary Statistics Results for: ~DepDelay:Origin
#
# DepDelay:Origin 9.374303 31.25425 -45 1435 218064 8980
#
# Statistics by category (205 categories):
#
# Category Origin Means StdDev Min Max ValidObs
# DepDelay for Origin=ATL ATL 10.04044364 29.627925 -17 466 10459
# DepDelay for Origin=AUS AUS 6.69516971 25.208972 -17 259 1532
# DepDelay for Origin=BHM BHM 5.51197982 24.143043 -26 282 793
# DepDelay for Origin=BNA BNA 7.59791004 27.797410 -11 480 2201
# DepDelay for Origin=BOS BOS 10.45093105 33.601146 -15 521 3974
# DepDelay for Origin=BUR BUR 13.16031196 30.553457 -8 316 1154
# DepDelay for Origin=BWI BWI 10.23133191 30.173772 -10 348 3281
# DepDelay for Origin=CLE CLE 6.65045455 26.232894 -10 348 2200
…
You can also use the special formula "~." to summarize all variables. For more information about using
formulas with rxSummary, see:
Data Summaries
https://aka.ms/obvzrd
You can refine the summaries generated by using the following arguments to the rxSummary function:
 byGroupOutFile. If the formula uses factor variables, you can save the summary output to a set of
files each containing the data for a different factor value. This helps you to process the results
separately for each factor value later. Use the byGroupFile argument to specify a base file name for
the data to be saved. The rxSummary function will generate a set of files using this name with a
numeric suffix.
 summaryStats. By default, rxSummary generates statistics for the Mean, StdDev, Min, Max, ValidObs
(number of valid observations), MissingObjs (number of missing observations), and Sum. If you only
require a subset of these statistics, use the summaryStats to specify which ones as a character vector.
 byTerm. This is a logical argument. If TRUE, missing values will not be included when computing the
summary statistics.
 removeZeroCounts. This is a logical argument. If TRUE, rows with no observations will not be
included in the output for counts of categorical data.
 fweights and pweights. These are character strings that specify the name of a variable to use as the
frequency weight or probability weight for the observations being summarized.
Examining an rxSummary object

The rxSummary function returns an rxSummary
object. This object contains the following items:
 sDataFrame. A data frame that contains the

summaries for the continuous variables.
 categorical. A list of summaries for

categorical variables.
 categorical.type. A list of the types of

categorical summaries. A type can have the
value "counts", "cube", or "none" if there are
no categorical summaries.
 formula. The formula used to generate the

summary.
 nobs.valid. The number of valid observations found in the dataset.
 nobs.missing. The number of observations missing from the dataset.
The following example shows how to retrieve the data frame that contains the summary for the
categorical variables from an rxSummary object:
Extracting the results data frame for a summary

summaryData <- rxSummary(~DepDelay:Origin, xdfSource)

summary$sDataFrame
# Typical results
# 1 Delay 17.40734 60.90534 -1273 2121 217503 9541
Transforming data as it is summarized

You use the rxSummary function to transform
data as it is being summarized. This process uses
the same technique as rxImport; you specify a list
of transformations using the transforms
argument. This technique is useful if you need to
generate data to summarize dynamically.
The following example generates a column called

Delay, which is the sum of the ArrDelay and
DepDelay (arrival delay and departure delay)
variables in each row. The rxSummary function
generates a summary using this cumulative delay:
Generating a column dynamically for rxSummary

rxSummary(~Delay, xdfSource, transforms = list(Delay = ArrDelay + DepDelay))
You can also discount rows from the summary by using the rowSelection argument.
Note that these transformations are transient; they are used while the rxSummary function is running,
and the changes are not stored permanently. If you expect the same transformation to be required
elsewhere, consider using the rxDataStep function to save the transformed data. To find out how to do
this, see Module 4: Processing Big Data.
Using the dplyrXDF package

The dplyr package is a popular toolkit for
performing data transformation and manipulation.
Since its launch, dplyr has become very popular
for the way in which it streamlines and simplifies
many common data manipulation tasks.
The dplyr package works on data frames, so has

limited use with XDF objects that are too large to
fit into memory. However, you can use the
dplyrXDF package, which provides dplyr-like
functions over XDF objects.
The following example uses a dplyrXDF pipeline to

filter airline delay data to limit the information to
the year 2002. It then creates a pair of calculated variables (DistKM and AvgDelay), groups the data by the
airport or origin, and summarizes the AvgDelay and DistKM variables before sorting the results:
Using a dplyrXdf pipeline

results <- xdfSource %>%

filter(Year == "2002") %>%
mutate(DistKm = Distance * 1.6093, AvgDelay = (ArrDelay + DepDelay)/2) %>%
group_by(Origin) %>%
summarise(mean_delay=mean(AvgDelay), sum_dist=sum(DistKm)) %>%
arrange(desc(mean_delay))
head(results)
# Origin mean_delay sum_dist

# 1 HDN 18.19366 176559.52
# 2 CAK 14.67857 59479.73
# 3 OME 13.85606 155786.68
# 4 OTZ 13.03382 134881.87
# 5 EGE 12.13023 371608.29
# 6 DUT 11.85321 184812.01
The functions in this package use ScaleR functions behind the scenes to chunk the data, and they create
temporary XDF files while they are running. These temporary XDF files are automatically removed when a
function finishes. If you need to retain the data generated at any stage in a dplryXdf pipeline, use the
persist function to write the data to an XDF file.
Many of the dplyrXDF functions enable you to pass additional arguments to the underlying ScaleR
functions by using the .rxArgs argument. For more information, see:
Introducing the dplyrXdf package

https://aka.ms/wdgbkc
Note: Currently, the dplyrXdf package is only available through Git, and you must
download and build it locally. You can perform this task from within R by using the devtools
package. You must also install the dplyr package because some of the functionality of dplyrXdf
references functions in this package. Use the following code to do this:
# Install dplyrXdf
install.packages("dplyr")
install.packages("devtools")
devtools::install_github("RevolutionAnalytics/dplyrXdf")
library(dplyr)
library(dplyrXdf)
Cross-tabulating XDF data

ScaleR provides two functions that you can use to
cross-tabulate data: rxCrossTabs and rxCube.
These functions generate contingency tables that
you can use to analyze the relationship between
variables and test for independence.
Both functions require a formula to specify which

dependent variable to cross-tabulate against
which independent variables.
Generating CrossTabs
The following example calculates the mean
departure delay for airline flights as a function of
origin airport and month. Usually, the rxCrossTabs function generates counts, but this example calculates
means. You do this by setting the means argument to TRUE:
Cross-tabulating mean arrival delays by origin airport and month

results <- rxCrossTabs(formula = DepDelay~Origin:Month, data = xdfSource, means = TRUE)

results
# Typical results
# Call:
# rxCrossTabs(formula = DepDelay ~ Origin:Month, data = xdfSource,
# means = TRUE)
# Cross Tabulation Results for: DepDelay ~ Origin:Month

# File name: AirlineDelayData.xdf
# Dependent variable(s): DepDelay
# Number of missing observations: 8980
# Statistic: means
# DepDelay (means):
# Month
# Origin 1 2 3 …
# ATL 10.6929763 9.49295775 9.78934741
# AUS 4.7921875 7.38102410 10.03947368
# BHM 5.2704403 4.54227405 8.61363636
# BNA 7.2029478 7.53543307 8.53720930
# BOS 13.1485643 8.68876611 8.49798116
# BUR 8.7056075 17.95685279 13.21084337
# BWI 10.8261905 11.43887623 6.90767045
# CLE 7.0021906 6.55760870 6.00817439
# CLT 7.9783641 6.42066806 5.24920128
# CMH 7.3936348 8.33013436 5.49802372
# COS 2.0535714 7.51351351 3.27118644
# CVG 11.9991823 12.60124334 11.07074830
# DEN 7.2984729 8.61186114 11.36315789
…
Note: If you are already familiar with xtabs, notice that the formula syntax is different with
rxCrossTabs. Use the ":" character to separate cross-classifying variables rather than the "+"
character that xtabs uses.
You can convert the rxCrossTabs object returned by the rxCrossTabs function into an xtabs
object by using the as.xtabs function.
Ideally, independent variables should be categorical, but you can cause the rxCrossTabs function to treat
a character variable as a categorical variable by using the F function. If you have numeric data, use the
as.factor function instead, although you should be cautious of using this approach because it can
generate significant quantities of information if the data has a large set of possible values.
Another important argument is na.rm, which causes NA values to be excluded from the calculations. Also,
as with many of the other ScaleR functions, you can perform transformations and row selection on the fly
by using the transforms and rowSelection arguments.
The rxCrossTabs function returns an rxCrossTabs object, which contains lists of the cross-tabulation
sums, counts, and means, together with a list of chi-squared test results.
The following example extracts counts from the results object generated by the previous example:
Extracting counts from an rxCrossTab object

results$counts
# Typical results
# $DepDelay
# Month
#Origin 1 2 3 …
# ATL 4257 4118 2084
# AUS 640 664 228
# BHM 318 343 132
# BNA 882 889 430
# BOS 1602 1629 743
# BUR 428 394 332
# BWI 1260 1317 704
# CLE 913 920 367
# CLT 1895 1916 939
# CMH 597 521 253
…
Generating cubes
The rxCube function generates a data cube. The calculations are the same as those performed by
rxCrossTabs, but the data is presented in a more matrix-like manner, showing the individual statistics for
each combination of independent variables.
As an example, the following code creates a cube of arrival delay statistics grouped by origin airport and
month:
Generating a cube of arrival delay statistics

results <- rxCube(formula = DepDelay~Origin:Month, data = xdfSource)
results
# Typical results
# Call:
# rxCube(formula = DepDelay ~ Origin:Month, data = xdfSource)
#
# Cube Results for: DepDelay ~ Origin:Month
# Dependent variable(s): DepDelay
# Statistic: DepDelay means
#
# Origin Month DepDelay Counts
# 1 ATL 1 10.69297627 4257
# 2 AUS 1 4.79218750 640
# 3 BHM 1 5.27044025 318
# 4 BNA 1 7.20294785 882
# 5 BOS 1 13.14856429 1602
# 6 BUR 1 8.70560748 428
# 7 BWI 1 10.82619048 1260
# 8 CLE 1 7.00219058 913
# 9 CLT 1 7.97836412 1895
…
The resulting rxCube object contains a variable for each column of output—in this case, Origin, Month,
DepDelay, and Counts. You can access these variables by using the $ operator.
For more information about cross-tabulating data, see:

Crosstabs
https://aka.ms/rsxn3f
Testing for independence in cross-tabulated data

You can perform a number of independence tests
against the variables in an rxCrossTabs or rxCube
object. ScaleR provides the following methods:
 rxChiSquaredTest
 rxFisherTest
 rxKendallCore
Note: You can also pass an xtabs object to

these methods.
The following example tests for independence between the origin airport and month for flight departure
delays. Notice that the data for the cube is prefiltered to remove any negative values of the DepDelay
variable; this is required by the rxChiSquaredTest function:
Testing for independence by performing a chi-squared test

results <- rxCube(formula = DepDelay~Origin:Month, data = xdfSource, rowSelection =
DepDelay >= 0)
rxChiSquaredTest(results)
# Typical results
# Chi-squared test of independence between Origin and Month
# df
# 2244
Note: The rxFisherTest function can exhaust workspace resources if you use it on a large
set of results.
For more information about these functions, see:
Cross-tabulated data
https://aka.ms/qgqtbq
Computing quantiles over XDF data

The rxQuantile function sorts data into bins
according to the value of a specified numeric
variable, and then computes the distribution of
the data in each bin. The result is a list containing
the cumulative total showing the probability of a
given value lying in the range, up to the maximum
value defined by each bin. By default, the
rxQuantile uses four bins marking the 0–25%, 25–
50%, 50–75%, and 75–100% probabilities.
The following example calculates the quantiles for departure delay in the airline delay data:
Calculate departure delays by quantile

rxQuantile("DepDelay", xdfSource)
# Results
# 0% 25% 50% 75% 100%
# -45 -3 0 7 1435
The results in this example show that 25 percent of aircraft depart between three and 45 minutes early,
half of all aircraft leave early or on time (the delay is 0), and 75 percent of aircraft have a departure delay
of no more than seven minutes. However, there are some extreme cases with long delays—the 75–100%
range represents a very long tail of values.
To analyze this tail in more detail, you can customize the number of bins and probabilities to use for the
calculations. The following example creates additional bins for the 75–100% range, using the probs
argument to rxQuantile. This argument takes a numeric vector containing the probability values (in the
range 0 through 1) for each bin:
Customizing the probability ranges

bins = c(0, 0.25, 0.5, 0.7, 0.8, 0.9, 1.0)
rxQuantile("DepDelay", xdfSource, probs = bins)
# Results
# 0% 25% 50% 70% 80% 90% 100%
# -45 -3 0 5 12 32 1435
From these results, you can see that 90 percent of all departures are no more than 32 minutes late.
Demonstration: Transforming, summarizing, and cross-tabulating XDF data

This demonstration shows how to use ScaleR functions to perform common analytical operations over
XDF data.
Demonstration Steps
Transforming XDF data

2. Open the R script Demo2 - Transforming Data.R in the E:\Demofiles\Mod02 folder.
3. Highlight and run the code under the comment # Flight delay data for the year 2000.
4. Highlight and run the statements under the comment # Examine the raw data. These statements
perform a quick import of the first 1,000 rows of the airport delay data and display the structure of
this data. Note the variables names and the number of variables (30).
5. Highlight and run the code under the comment # Create an XDF file that combines the ArrDelay
and DepDelay variables. These statements import the CSV data into an XDF file, and add a new field
called Delay. The rowSelection argument contains an expression that randomly selects 10 percent of
the data in the source file.
6. Highlight and run the code # Examine the stucture of the XDF data. Point out that there are now
31 variables, including Delay.
Summarizing XDF data

1. Highlight and run the code under the comment # Generate a quick summary of the numeric data
in the XDF file. This statement uses the ~. formula to summarize all numeric fields in the XDF file.
2. Highlight and run the code under the comment # Summarize the delay fields. This statement
generates summaries for the ArrDelay, DepDelay, and Delay fields only. Notice the syntax in the
formula.
3. Highlight and run the code under the comment # Examine Delay broken down by origin airport.
These statements factorize the Origin and Dest columns using the rxFactors function, and generate
the summary.
Cross-tabulating XDF data

1. Highlight and run the code under the comment # Generate a crosstab showing the average delay.
This statement creates a sparse tabular output showing the average delay that has occurred for flights
between each airport.
2. Highlight and run the code under the comment # Generate a cube of the same data. This
statement outputs the delay information, but also displays the counts. Note that the cube includes
routes that don't exist; they have a delay of NaN, and a count of 0.
3. Highlight and run the code under the comment # Omit the routes that don't exist. This statement
sets the removeZeroCounts argument so that routes that have a count of zero are no longer
included.
4. Close your R development environment of choice (RStudio or Visual Studio) without saving any
changes.
Statement Answer
When you perform a base R function, such as summary or head, over

XDF data, the XDF data is cast into a data frame first. True or False?
Lab: Exploring big data

Scenario
Now that you are familiar with the raw structure of the data, your next task is to convert the totality of the
data into a format that you can use for subsequent analysis. The data comprises a number of CSV files,
and you must convert them into a single XDF object. You also need to tidy up the data a little—filter
some observations and variables to eliminate data that you might not be interested in, combine some
columns, and perform some calculations over the results to convert the data into a more useful set of data
that you can use later in your investigations.
Objectives
 Import data held in a CSV file into an XDF object and compare the performance of operations using
these formats.
 Combine multiple CSV files into a single XDF object and transform data as it is imported.
 Combine data retrieved from SQL Server into an XDF file.
 Refactor XDF data and generate summaries.
Lab Setup
Before starting this lab, ensure that the following VMs are all running, and then complete the steps below:
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
1. Log in to the LON-DEV VM as AdatumAdmin with the password Pa55w.rd.
2. Click Start, type Microsoft SQL Server Management Studio, and then press Enter.
3. In the Connect to Server dialog box, log in to LON-SQLR using Windows authentication.
4. In Object Explorer, right-click Databases, and then click New Database.
5. In the New Database dialog box, in the Database name box, type AirlineData, and then click
OK.
6. Close SQL Server Management Studio.
7. Start your R development environment of choice (Visual Studio® or RStudio).
8. Open the R script E:\Demofiles\Mod02\importAirports.R.

9. Run the script.
10. Close your R development environment, without saving any changes.

Exercise 1: Importing and transforming CSV data

Scenario
You have decided to import a CSV file containing airline delay data for the year 2000 into an XDF file. You
can then compare the size and performance of using this XDF file to the original CSV file.
1. Import the data for the year 2000
2. Compare the performance of the data files
 Task 1: Import the data for the year 2000

2. Start your R development environment of choice (Visual Studio Tools, or RStudio), and create a new R
file.
3. The data is located in the file 2000.csv, in the folder E:\Labfiles\Lab02. Set your current working
directory to this folder.
4. Import the first 10 rows of the data file into a data frame and examine its structure. Note the type of
each column.
5. Create a vector named flightDataColumns that you can use as the colClasses argument to
rxImport. This vector should specify that the following columns are factors:
 Year
 DayofMonth
 DayOfWeek
 UniqueCarrier
 Origin
 Dest
 Cancelled
 Diverted
6. Import the data into an XDF file named 2000.xdf. If this file already exists, overwrite it. Use the
flightDataColumns vector to ensure that the specified columns are imported as factors.
7. When the data has been imported, examine the first 10 rows of the new file and check the structure
of this new file.
8. Use File Explorer to compare the relative sizes of the CSV and the XDF files.
 Task 2: Compare the performance of the data files

1. Return to your R environment.
2. Use the rxSummary function to generate a summary of all the numeric fields in the 2000.csv file,
and then repeat using the 2000.xdf file. Compare the timings by using the system.time function.
3. Use the rxCrossTabs function to generate a cross-tabulation of the data in the CSV file showing the
number of flights that were cancelled and not cancelled each month. Note that the Month and the
Cancelled columns are both numerics, but dependent variables referenced by a formula in
rxCrossTabs must be factors. You use the as.factor function to cast these variables to factors. Display
the cancellation values as TRUE/FALSE values. Note how long the process takes.
4. Generate the same cross-tabulation for the XDF file, and compare the timing with that for the CSV
file.
5. Repeat the previous two steps, but generate cubes over the CSV and XDF data using the rxCube
function rather than crosstabs. Compare the timings.
6. Tidy up the workspace and remove the flightDataSample, flightDataSampleXDF,

csvDelaySummary, xdfDelaySummary, csvCrossTabInfo, xdfCrossTabInfo, csvCubeInfo, and
xdfCubeInfo variables.
Results: At the end of this exercise, you will have created a new XDF file containing the airline delay
data for the year 2000, and you will have performed some operations to test its performance.
Exercise 2: Combing and transforming data

Scenario
You have been provided with a set of CSV files containing flight delay data for the years 2000 through
2008. You need to create a single XDF file from this data. You also decide to create a new column named
Delay that contains the sum of all the delay reasons from each observation. You want to add a factor that
contains the month name rather than using the numeric month number. This process will require more
resources than are available on your desktop computer using R client, so you decide to perform these
tasks on a more powerful R server cluster.
1. Copy the data files to R Server

2. Create a remote session
3. Import the flight delay data to an XDF file
 Task 1: Copy the data files to R Server

 On the LON-DEV VM, and copy the following files from the E:\Labfiles\Lab02 folder to the \\LON-
RSVR\Data share:
 2000.csv
 2001.csv
 2002.csv
 2003.csv
 2004.csv
 2005.csv
 2006.csv
 2007.csv
 2008.csv
 Task 2: Create a remote session

2. Start a remote session on the LON-RSVR VM. When prompted, specify the username admin, and the
password Pa55w.rd.
3. At the REMOTE> prompt, temporarily pause the remote session and return to the local session
running on the LON-DEV VM.
4. Use the putLocalObject function to copy the local object, flightDataColumns, to the remote
session.
5. Resume the remote session.

6. Verify that the flightDataColumns variable is now available in the remote session.
 Task 3: Import the flight delay data to an XDF file

1. Use the rxImport to import a test sample of 1,000 rows from the 2000.csv file into a file named
2000.xdf. Save the 2000.xdf file to the \\LON-RSVR\Data share. Perform the following
transformations as you import the data:
 Create a Delay column that sums the values in the ArrDelay, DepDelay, CarrierDelay,
WeatherDelay, NASDelay, SecurityDelay, and LateAircraftDelay columns. Note that, apart
from the ArrDelay and DepDelay columns, this data can contain NA values that you should
convert to 0 first.
 Add a column named MonthName that holds the month name derived from the month
number. This column must be a factor.
 Filter out all cancelled flights (flights where the Cancelled column contains 1).
 Remove the variables FlightNum, TailNum, and CancellationCode from the dataset.
2. Examine the first few rows of the XDF file to verify the results.
3. Delete the XDF file from the \\LON-RSVR\Data share.
4. Import all the CSV files in the \\LON-RSVR\Data share into an XDF file called FlightDelayData.xdf.
Perform the same transformations from step 1, and save the file to the \\LON-RSVR\Data share.
Note that this process can take some time—you should enable progress reports for the rxImport
operation so you have confirmation that it is working. You might also find that setting the number of
rows per read can impact performance. Try setting this value to 500,000, which will create chunk sizes
of approximately half a million rows.
5. Close the remote session.
Results: At the end of this exercise, you will have created a new XDF file containing the cumulative
airline delay data for the years 2000 through 2008, and you will have performed some transformations
on this data.
Exercise 3: Incorporating data from SQL Server into an XDF file

Scenario
The flight delay data includes codes for the origin and destination of each flight (these are international
IATA codes), but does not specify in which state each airport is located. You have access to a SQL Server
database that contains a separate list of airport IATA codes and their locations, including the state. You
need to retrieve this information from SQL Server and incorporate it into the flight delay XDF file.
1. Import the SQL Server data
2. Add state information to the flight delay XDF data
 Task 1: Import the SQL Server data

1. Use a trusted connection to create an RxSqlServerData data source that connects to the Airports
table in the AirlineData database on the LON-SQLR server.
2. View the first few rows of data to establish the columns that it contains.
3. Import the airport data into a small data, and then convert all string data to factors.
4. View the first few rows of the data frame file to ensure that they contain the same data as the original
SQL Server table.
 Task 2: Add state information to the flight delay XDF data

1. Start another remote session on the LON-RSVR VM. When prompted, specify the username admin,
and the password Pa55w.rd.
2. Temporarily pause the remote session and copy the data frame containing the airport information to
the remote session. Use the putLocalObject function.
4. Use the rxImport function to read the FlightDelayData,xdf file on the \\LON-RSVR\Data share and
add the following variables:
 OriginState. This should contain the state from the data frame you created in the previous task
where the IATA code in that data frame matches the Origin variable in the XDF file.
 DestState. This is very similar, except it should match the IATA code in the data frame against
the Dest variable in the XDF file.
Save the result as EnhancedFlightDelayData.xdf in the \\LON-RSRV\Data share.
Note that you must use the transformObjects argument of rxImport to make the data frame
accessible to the transformation.
5. View the first few rows of the XDF file to ensure that they contain the OriginState and DestState
variables.
6. Stay connected to the remote session.
Results: At the end of this exercise, you will have augmented the flight delay data with the state in which
the origin and destination airports are located.
Exercise 4: Refactoring data and generating summaries

Scenario
Now that you have imported the flight delay data and included the information that you require, you
want to start analyzing the data and determine which factors might be influential in causing flight delays.
You theorize that delays might be a function of the origin and/or destination airport and state for a flight,
so you decide to examine the delay statistics broken down by these factors.
1. Examine delay intervals by airport and state
2. Summarize delay intervals by airport and state
 Task 1: Examine delay intervals by airport and state

1. The Delay variable is a continuous value. You determine that it would be more useful to group the
delay times into the following series of intervals:
 No delay
 Up to 30 minutes
 30 minutes to 1 hour
 1 to 2 hours
 2 to 3 hours
 More than 3 hours
In the remote session, generate four cross-tabulations (using rxCrossTabs) that report the delay
intervals by origin airport, destination airport, origin state, and destination state. Use a transformation
with each cross-tabulation to factorize the delays as described.
2. Display the results.
3. Close the remote session, and return to the local session on the LON-DEV VM.
 Task 2: Summarize delay intervals by airport and state

1. The cross-tabulations are probably still too detailed to enable you to draw any conclusions with
plotting the data graphically. Instead, you decide to quickly calculate the mean delay by airport and
state to see whether there are any notable variations. This requires using dplyrXdf, so you should
build and install the dplyrXdf package.
2. Create an RxXdfDataSource that references the following variables in the XDF file. These are the only
variables that the analysis will require:
 Delay
 Origin
 Dest
 OriginState
 DestState
3. Use the data accessible through the data source to calculate the mean delay for each origin airport,
sort the results, and then display them. Use a dplyrXdf pipeline. You will need to persist the final
results, otherwise they will be deleted automatically by the pipeline.
4. Repeat the previous step to calculate and display the mean delay for each destination airport.
5. Repeat step 3 to calculate and display the mean delay by origin state.
6. Repeat step 3 to calculate and display the mean delay by destination state.
environment.
Results: At the end of this exercise, you will have examined flight delays by origin and destination
airport and state.
Question: From the summaries that you developed, were you able to perceive any
relationship between airport or state and flight delay times?
Question: Given your answer to the first question, is the effort performing these tasks
justified so far?

In this module, you learned how to:
 Create and use ScaleR data sources.
 Import and transform data into the XDF format used by ScaleR.
 Summarize data held in an XDF file.

3-1
Module 3
Visualizing Big Data
Contents:
Module Overview 3-1
Lesson 1: Visualizing in-memory data 3-2
Lesson 2: Visualizing big data 3-22
Lab: Visualizing data 3-33
Module Overview
Visualization is an essential part of the data manipulation and modeling process. In addition to presenting
results in a report, you should visualize your data or results often and in a variety of ways before, during
and after your analysis. Building plots is an important tool for experimental design and can help you to
identify issues in your data that need to be addressed. Plotting your model coefficients can also help you
interpret them much more easily.
The common first step in working with big data is to build a subset of your big dataset that you can
process locally, in-memory. You typically develop models and algorithms iteratively on this subset before
deploying to a server, cluster or database service to run on the full big dataset. After the analysis has run,
you will commonly bring model objects and results back onto your local machine to produce plots for
presentation. You need to be able to visualize data both locally and at scale—there are two different tools
to use.
With in-memory data, you have access to a broader and more flexible range of visualization tools.
However, these tools would soon break when applied to large, chunked data too big to fit in memory.
ScaleR™ provides functions for the quick and efficient building of simple plots to visualize very large
datasets.
Objectives
 Use the ggplot2 package to visualize in-memory data.

 Use the ScaleR rxLinePlot and rxHistogram functions to generate graphs based on big data.
3-2 Visualizing Big Data
Lesson 1
Visualizing in-memory data
This lesson describes how to use ggplot2. This is a very popular and flexible third-party R package that
provides an incredibly rich suite of functions for visualizing in-memory data—it can be used for
generating a wide range of graphs. Although it has the limitations associated with using in-memory data,
the ggplot2 package is useful when working with big data. You might break down big data into smaller
subsets, or use clustering algorithms to coalesce similar data together, and then use ggplot2 functions to
help spot trends in this data. You can then use the big data to develop predictive models that substantiate
or disprove these trends.
Lesson Objectives
 Describe the elements of a ggplot2 graph and how

to compose them.
 Use ggplot2 geometries to present information in
different ways.
 Combine plots to create overlays.
 Use aesthetic mappings to style the elements of a
graph.
 Create side-by-side plots to help you to visualize
grouped data.
 Customize bar plots.
 Explain the different coordinate systems available in the ggplot2 package.
Overview of the ggplot2 package

R provides several plotting functions as standard, but some people find that many of these functions are
quite limiting. Base R plotting functions typically follow the “pen and paper” model—each part of the plot
is “drawn” on top of the previous ones. For example, you might have a scatter plot and draw a regression
line on top, and then a legend and further annotation on top of that. This method makes it difficult to
both reuse plotting code between different datasets and to combine different types of plots. It also means
that you must be careful with the order in which you add the layers to achieve the desired effect. The
ggplot2 package was designed to get around these problems by following a concept known as the
“grammar of graphics”. In the grammar of graphics, a statistical plot is a mapping from your raw data to
aesthetic attributes (such as color, position, shape, and so on) of geometric objects in a coordinate space.
Plots in ggplot2 are made up of several independent components that can be put together in a variety of
ways. Although this might sound confusing, this method has several advantages over base R plotting:
 Because the data is independent from the other components, you can reuse the same plot with
different datasets.
 You can add statistical transformations and summaries to your plots.
 You can chain plot objects (such as annotations, legends, trend lines, and so on) together without
having to worry about the order in which they are combined.
 Sensible defaults mean that you can produce professional looking plots very quickly.
Elements of a graph
A ggplot2 graph consists of the following components:
 Data. This is the raw data that is passed to the plot object—this must be a data frame.
 Aesthetic mappings. These translate the raw data to data that is readable on a graph. For example, a
continuous variable might map to points along an x axis, and a factor might map to a point size or a
color.
 Geometric objects (geoms). The actual plot type you will be using—for example, line, point, or
histogram. Each plot must have at least one geom.
 Coordinate system. This defines how the data maps onto space in the plot. A single coordinate
system applies to a single plot object. Options include Cartesian coordinates, log and polar
coordinates.
 Statistical transformations (stats). Use these to transform or summarize your data—for example, to
generate regression lines, smoothers, bins for histograms, and boxplots. Each geom has its own
default stat.
 Facets. These describe how to split data into different panels, according to the values of the supplied
variables.
Note: The functions in the ggplot2 package are accessible through the ggplot2 library. You
should bring this library into scope:
library(ggplot2)
Alternatively, you can install the tidyverse package and bring the tidyverse library into scope.
This package and library includes ggplot2 and other useful packages often used with ggplot2,
such as dplyr:
install.packages(tidyverse)
library(tidyverse)
The following example shows a simple scatter plot based on the sample mtcars dataset provided with R.
The graph shows the relationship between engine capacity (displacement) and the miles per gallon
achieved:
Using the ggplot function to generate a scatter plot

ggplot(mtcars) +
geom_point(mapping = aes(x = disp, y = mpg))
The first line defines the basic plot object by calling ggplot with the dataset as the first argument. You
can then chain geoms, stats, facets, coordinate types, and so on, to this using the “+” operator. The
second line adds the point geom to make an XY plot. All geom_* functions accept a “mapping” argument
to define the mapping from the data to plot space. This is an aes() function call, here mapping “disp”
(engine displacement in cc) to the x axis and “mpg” (miles per gallon) to the y axis. The default is always
Cartesian coordinates and no facets, so you don’t need to define these. Also, the default stat for
geom_point is “identity”, which means no transformation takes place—so you don’t need to set this either.
The graph generated by this example looks like this:
FIGURE 3.1: SCATTER PLOT OF ENGINE DISPLACEMENT VERSUS MILES PER GALLON
Note that any of the objects in the ggplot chain are just standard R objects, so they can be assigned to
variables. The following example gives the same result as the code:
Using an intermediate variable rather than chaining

p <- ggplot(mtcars)
p + geom_point(mapping = aes(x = disp, y = mpg))
Understanding ggplot geometries

You use geometries (geoms) to determine the
type of plot to be produced. When you are
deciding how to present statistical graphics, it is
useful to think about the type of data and number
of dimensions you want to represent. The example
shown in the previous topic used the geom_point
geometry. You typically use this geometry to
illustrate any relationship between two variables.
The following examples show how to use other

geometries.
This example produces a bar plot of counts of cars, with each number of gears in the mtcars data. The
default statistic is count, which counts the number of cases at each x position. Use this geometry to
describe a single discrete variable:
Using a geom_bar with a single discrete variable

ggplot(mtcars) +
geom_bar(mapping = aes(x = gear))
The result looks like this:
FIGURE 3.2: BAR PLOT SHOWING THE NUMBER OF CARS WITH DIFFERENT GEARS
The next example creates a histogram using the geom_histogram geometry. Use this geometry to
categorize a single continuous variable with binned values—in this case, the frequencies of times taken by
cars in the mtcars dataset to travel a quarter of a mile. You can adjust the width of the bins with the
binwidth argument, and the number of bins with the bins argument.
Using geom_histogram with a single continuous variable

ggplot(mtcars) +
geom_histogram(mapping = aes(x = qsec), binwidth = 2)
This is the result:
FIGURE 3.3: HISTOGRAM SHOWING THE TIMES TAKEN FOR CARS TO TRAVEL A QUARTER OF A
MILE
Use the geom_line geometry to show two variables with values joined by lines. The following example
presents the same data shown earlier as a line plot, but this time with data points connected by lines. This
is particularly useful for presenting time series or repeated measures data where there is a relationship
from point to point. You can replace geom_line with geom_step to build a stairstep plot to highlight
exactly where change happens—or replace with geom_path to join points in the order they appear in the
data.
Using geom_line with two variables

ggplot(mtcars) +
geom_line(mapping = aes(x = disp, y = mpg))
The resulting graph looks like this:
FIGURE 3.4: LINE PLOT SHOWING THE RELATIONSHIP BETWEEN ENGINE DISPLACEMENT AND
MILES PER GALLON
You can refine a line plot by smoothing the line to create a regression curve, using the geom_smooth
geometry.
The default smoothing method is local regression (LOESS).
Using geom_smooth with default smoothing

ggplot(mtcars) +
geom_smooth(mapping = aes(x = disp, y = mpg))
The graph that this generates also includes an indication of the confidence intervals, shown shaded as a
gray area around the smoothed line:
FIGURE 3.5: SMOOTHED LINE REGRESSION PLOT INCLUDING CONFIDENCE INTERVALS

Rather than using LOESS, you can perform a standard linear regression by specifying the value "lm" to the
method argument of the geometry. You can also use nonlinear regression terms such as polynomials
using the formula argument.
Using geom_smooth with nonlinear regression terms

ggplot(mtcars) +
geom_smooth(mapping = aes(x = disp, y = mpg), method = "lm",
formula = y ~ poly(x,2))
This code generates the following results over the mtcars dataset:
FIGURE 3.6: NONLINEAR REGRESSION PLOT USING POLYNOMIAL TERMS

Combining plots
You can combine multiple geometries together using the
+ operator to create overlay plots.
The following example shows the geom_smooth

regression plot in combination with geom_point to show
a scatter plot with an overlaid regression line:
Overlaying geometries to create a combined plot

ggplot(mtcars, mapping = aes(x = disp, y = mpg)) +
geom_smooth(method = "lm", formula = y ~ poly(x,2)) +
geom_point()
Notice this example has moved the mapping argument to the initial call to ggplot. This means that the
mappings are then available to all the chained geoms.
This is the resulting graph:
FIGURE 3.7: OVERLAYING GEOMETRIES TO CREATE COMBINED PLOTS

Using aesthetic mappings

You use aesthetic mappings to modify the way in
which the data is presented. Using the aes
function, you can set many options, including:
 color. The color of points and lines.
 size. The size of points.
 shape. The shape of points.
 linetype. The style of line.
 fill. The color of bars, histograms, and

polygons.
For more information on specifying aesthetic

mappings, see: https://cran.r-project.org/web/packages/ggplot2/vignettes/ggplot2-specs.html.
The following example uses the same scatter plot from earlier in this module, but it groups the cars by
their transmission type (manual or automatic). The mapping uses the aes function to set the color of the
points based on the type of transmission. In this example, the scale_color_manual function specifies the
colors to use for each category. Note that, rather than feed the raw variable which has the values 0 (for
automatic) and 1 (for manual), the code converts it to a factor with more intuitive labels. This code also
uses the labs function to set the labels displayed for each variable, and color them in the same way as the
points on the graph.
Setting the colors of points and labels by category

ggplot(mtcars, mapping = aes(x = disp, y = mpg,
color = factor(am, levels = c(0,1),
labels = c("Automatic", "Manual")))) +
scale_color_manual(values = c("blue", "yellow")) +
geom_point() +
labs(color = "Transmission")
The graph looks like this:
FIGURE 3.8: SCATTER PLOT SHOWING POINTS AND LABELS COLORED BY CATEGORY
The following example shows another use of the aes function with a tile plot. This example illustrates the
relationship between three variables in mtcars—the number of gears (“gear”), the number of cylinders
(“cyl”), and the brake horsepower (“hp”). Notice that “hp” has been assigned to fill in the mapping
function. The geom fits an appropriate scale to the color variation by default.
Using the aes function to create a colored tile plot

ggplot(mtcars, mapping = aes(x = cyl, y = gear, fill = hp)) +
geom_tile()
This is the result:
FIGURE 3.9: TILE PLOT USING COLOR TO SHOW THE RELATIONSHIP BETWEEN THREE VARIABLES
Creating facets
You use facets to split your data into different plot panels according to the categorical variables you
choose to facet by. These panels are then laid out in a grid. Any statistical transformations occur within
each panel—for example, if you use geom_smooth, a new regression line will be fitted to the data in
each panel. There are two faceting functions you can chain into your ggplot2 plot objects—facet_grid
and facet_wrap.
Using facet_grid
You use the facet_grid function to split your data into
rows and/or columns of plotting panels. The first
argument takes a formula expression, with the right-hand
side representing row facets and the left-hand side
representing column facets. A dot on either side of the
“~” means that you don’t want to facet in this dimension.
The following example takes the scatter plot from earlier

and splits it into columns based on the car transmission
type. Note the "." on the left-hand side of the formula expression:
Creating a faceted plot using facet_grid

ggplot(mtcars) +
geom_point(mapping = aes(x = disp, y = mpg)) +
facet_grid(. ~ am)
The grid contains two columns (0 for automatic, 1 for manual). Note that for simplicity, this example has
used the original factor values rather than creating labels for each column.
FIGURE 3.10: DISPLAYING COLUMNS USING THE FACET_GRID FUNCTION

You can also facet by rows and columns. The following example splits the scatter plot into columns by the
“am” transmission type variable and rows by the “gear” variable, producing a 2 x 3 grid of plot panels.
Creating rows and columns in a faceted plot

ggplot(mtcars) +
facet_grid(gear ~ am
While the graph is interesting, it does demonstrate that sometimes "less is more". In this case, breaking
the graphs down to this level of detail starts to obscure the previously clear relationship between
displacement and miles per gallon.
FIGURE 3.11: DISPLAYING ROWS AND COLUMNS USING THE FACET_GRID FUNCTION
Using facet_wrap
You use the facet_wrap function when you need to facet by a single variable, but that variable has more
levels than can fit in a single row or column. The panels will “wrap” around to the next line. You can set
the number of columns using the ncol argument, as shown in the next example. In this case, the data is
grouped by the number of carburetors, which has a value between 1 and 8 in the sample data:
facet_wrap function
ggplot(mtcars) +
facet_wrap( ~ carb, ncol = 2)
The result is the following set of panels. Note that because no cars have 5 or 7 carburetors, there are no
panels for these numbers:
FIGURE 3.12: WRAPPING PANELS USING THE FACET_WRAP FUNCTION
Customizing bar plots

You use the geom_bar geometry to create bar plots, as
shown earlier in this lesson. This geometry gives you
several options to customize the plot. The most
commonly used options are position and stat.
Using the position argument

The position option is useful when you are dealing with
grouped bar plots (grouping using the fill argument).
The default is position = "stack" which stacks the grouped
bars on top of each other. This example shows the
number of cars with each number of cylinders. The bars are organized and stacked according to the
transmission type:
The position argument

ggplot(mtcars) +
geom_bar(mapping = aes(x = factor(cyl), fill = factor(am)),
show.legend = FALSE,
position = "stack")
The results look like this:
FIGURE 3.13: CREATING A STACKED BAR PLOT
If you want to have the bars for the different groups next to each other, set position = "dodge". You can
set position = "fill" to have the bars stacked and stretched to a constant height to show relative
proportions.
Using the stat argument

By default, geom_bar geometry counts data items in each group. If you want the bars to represent actual
data points, set stat = “identity” and map a variable to the y axis.
The following example shows the mpg for each car in the mtcars dataset. The code creates a data frame
containing the row names and the mpg variable. The graph calls the geom_bar function with stat =
”identity”.
Using the stat argument to graph data points

mpg_df <- data.frame(name = rownames(mtcars), mpg = mtcars$mpg)
ggplot(mpg_df) +
geom_bar(mapping = aes(x = name, y = mpg),
stat = "identity") +
coord_flip()
Because of their length, the names would have been unreadable on a standard x axis, so you chain in a
call to coord_flip() to flip the x and y axes, making the names readable. This is an example of a
coordinate system function and will be discussed in a later topic.
FIGURE 3.14: BAR PLOT DISPLAYING VALUES FOR INDIVIDUAL INSTANCES OF VARIABLES
Understanding coordinate systems

You have already seen how to switch the x and y
axes in a plot by using coord_flip(). Coordinate
systems map the position of objects onto the
plane of the plot. You will probably use Cartesian
coordinates (coord_cartesian) for most of your
statistical plots—this is the default. However, there
are alternative coordinate systems that are
sometimes useful, too. Not that a coordinate
system applies to the entire plot; you can’t have
different coordinate systems for different geoms in
the same plot. The following list summarizes the
coordinate systems available with ggplot2:
 coord_fixed. These are Cartesian coordinates that have a fixed aspect ratio between the x and y axes.
 coord_flip. Flipped Cartesian coordinates.
 coord_polar. Polar coordinates can be used to create pie, donut and radar charts.
 coord_trans. Provides transformed Cartesian coordinates. This is different from statistical

transformations and happens after any transformations. It affects the appearance of the geoms so, for
example, grid lines might no longer be straight. Default is the square root transformation.
 coord_map. Use this to create map projections using ggplot2. For more details, see:
http://docs.ggplot2.org/current/coord_map.html.
Arranging plots
You might want to place two plots next to each other
to compare the results. Base R has the par() function
to do this for you, but the ggplot2 system is
different—you can’t use this function. Instead, you
can use the gridExtra package to arrange multiple
plots in a variety of ways.
The following example compares two smoothing

transformations to the mtcars data. The first is the
default LOESS smoother and the second is a linear model with a second-degree polynomial
transformation. You first need to assign the individual plot objects to variables, then you use the
grid.arrange() function to arrange the plots together. The nrow argument controls how the individual
plots are arranged. Here they are arranged side by side.
Arranging plots side by side

library(gridExtra)
p1 <- ggplot(mtcars, mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_smooth() +
labs(title = "LOESS smoother")
p2 <- ggplot(mtcars, mapping = aes(x = disp, y = mpg)) +
geom_point() +
geom_smooth(method = "lm",
formula = y ~ poly(x,2)) +
labs(title = "2nd degree polynomial lm")grid.arrange(p1, p2, ncol = 2)
FIGURE 3.15: SIDE-BY-SIDE PLOTS ARRANGED BY USING GRIDEXTRA

The gridExtra package is very powerful and can arrange ggplot objects with plots from base R, lattice
plots and descriptive text boxes.
For more information, see: https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html.
Demonstration: Creating a faceted plot with overlays using ggplot

This demonstration shows how to use the ggplot2 package to create scatter plots.
Demonstration Steps
Creating a scatter plot

1. Open your R development environment of choice (RStudio or Visual Studio®).
2. Open the R script Demo1 - ggplot.R in the E:\Demofiles\Mod03 folder.
3. Highlight and run the code under the comment # Install packages. This code loads the tidyverse
package (including ggplot2 and dplyr), and brings the tidyverse library into scope.
4. Highlight and run the code under the comment # Create a data frame containing 2% of the flight
delay data. This code populates a data frame with a small subset of the flight delay data. The
rowSelection argument uses the rbinom function to select a random sample of observations.
5. Highlight and run the code under the comment # Generate a plot of Departure Delay time versus
Arrival Delay time. These statements use ggplot to create a scatter plot. Note that the geom_point
method sets an alpha level of 1/50. This helps to highlight the density of the data (the more dense
the data points, the darker the plot area). Even with a small subset of the data, it still takes a couple of
minutes to create this plot.
Add an overlay
 Highlight and run the code under the comment # Fit a regression line to this data. This code uses
the geom_smooth function to fit a line to the data. Note that the mapping argument has moved to
the ggplot function so that it is available to geom_point and geom_smooth.
Facet the plot

1. Highlight and run the code under the comment # Facet by month. This code uses the facet_wrap
function to organize the data by month and display the data for each month in a separate panel.
Note how the regression line varies between months.
2. Close the R development environment, without saving any changes.
Statement Answer
The data source for a ggplot2 graph must be a data frame. True or False.
Lesson 2
Visualizing big data
The ggplot2 package provides a wide range of powerful data visualization tools. However, when you
work with big datasets, you will often not have the luxury of being able to read all your data into
memory—so ggplot2 might not always be appropriate. The big data plotting functions in the RevoScaleR
package, while involving a little more complexity, have some very important advantages:
 They can operate on data frames or XDF files.
 XDF files can be accessed on a server or cluster.
 When working with XDF files, only the selected variables are read in, making them highly efficient.
 XDF files are processed in chunks, the algorithms summarizing as they go.
You use these features to visualize arbitrarily large datasets on the fly whereas, with traditional plotting
tools, you would have to first take subsets or samples of the data small enough to fit into memory.
Note that, if you are using these functions locally, R Client does not support chunking. This means that, in
this case, all the data needs to be read into memory—in a large dataset, this can exhaust your memory
quickly. To get around this, you push the compute context to a Microsoft R server instance.
For more information about visualizing big data with Microsoft R, see:
Visualizing Huge Data Sets: An Example from the US Census
https://aka.ms/qd95hg
Ostensibly, the ScaleR big data plotting functions appear to have quite limited functionality. However,
these functions act as wrappers for the more comprehensive features of the lattice graphics package—
you can use many lattice features to customize the layout and appearance of graphs.
For a comprehensive introduction to lattice graphics in R, see: http://lattice.r-forge.r-
project.org/Vignettes/src/lattice-intro/lattice-intro.pdf.
Lesson Objectives
 Create and customize scatter and line plots using the rxLinePlot function.
 Create histograms using the rxHistogram function.
 Save the graphical results of a plot.
 Transform data as it is being plotted.

Creating scatter and line plots

Use the rxLinePlot() function to quickly construct
line plots or XY scatter plots from data frames or
XDF files. To use this function, you provide:
 A formula that describes the data and

relationship to plot.
 The data source—this can be a data frame, an

RxXdfData data source, or the name of an
XDF file.
 The type of plot.
 Optional arguments, such as titles and labels

for axes.
The following example uses rxLinePlot to create a scatter plot showing the relationship between engine
displacement and miles per gallon using the mtcars data frame:
rxLinePlot(mpg ~ disp, data = mtcars, type = "p",

xlab = "Engine displacement (cu.in.)",
ylab = "Miles per gallon")
This is the result:
FIGURE 3.16: A SCATTER PLOT GENERATED BY USING RXLINEPLOT

In the example, the first argument is a formula expression, with the dependent variable (the “y” variable)
on the left-hand side and the independent variable (the “x” variable) on the right-hand side. Note that the
example also sets the labels for the x and y axes using the xlab and ylab arguments.
The rxLinePlot function is a wrapper around the xyplot() function in the lattice package that comes with
base R. This means that, alongside using the plot customization arguments in rxLinePlot, you use the
ellipsis (“…”) arguments to customize your plots further by passing arguments to xyplot. For example, you
can customize the scales for the x and y axes by using the scales argument.
You determine the type of plot by passing a character vector to the type argument. The options are:
 l: this gives a line plot, where the lines are connected in the order they appear in the data. This is
analogous to geom_path in ggplot2.
 P: an XY scatter plot like geom_point.
 B: both points and lines.
 S: a stairstep plot, like geom_step.
 Smooth: a LOESS smoother fit like geom_smooth.

 R: a regression line like geom_smooth(method = "lm").
You can combine multiple types. For example, specifying type = c("p", "r") generates a scatter plot with a
regression line overlay.
FIGURE 3.17: A SCATTER PLOT WITH A REGRESSION LINE OVERLAY

Creating a trellis plot

In addition to creating simple scatter and line plots, you can build trellis plots, with different plots
grouped by categorical variables—in the same way as you can with the facet functions in ggplot2.
There is no separate function to do this—you must specify the facets in the formula input to the first
argument. You use a “|” (vertical bar) on the right-hand side of the formula to indicate that you are
conditioning on a variable. The next example shows the miles per gallon versus displacement plot
conditioned by gear, and the number of gears in the transmission of the different cars:
Creating a trellis plot

rxLinePlot(mpg ~ disp | gear, data = mtcars, type = "p")
FIGURE 3.18: A TRELLIS PLOT ORGANIZED BY NUMBER OF GEARS
If required, you can further refine this approach to drill deeper into the data. For example, you can specify
multiple conditioning variables by using the + operator. This example includes the car transmission (the
am variable).
Conditioning by number of gears and transmission

rxLinePlot(mpg ~ disp | gear + am, data = mtcars, type = "p")
This code generates a lattice containing a pane for each combination of conditioning variable:
FIGURE 3.19: A TRELLIS PLOT WITH TWO CONDITIONING VARIABLES
Working with chunked data

If you are using a large XDF data source, generating a graph can take some time. You can specify the
following options to rxLinePlot to help optimize the process, and report on how the data is being
managed:
 blocksPerRead. This is the number of blocks to read for each chunk of data read from the data
source. You can vary this to create plots more efficiently from very large datasets. This argument is
ignored when you run locally in R Client.
 reportProgress. You can use this argument to generate feedback while generating a plot over a big
dataset. You can specify the following values:
o 0: no progress is reported.
o 1: the number of processed rows is reported.
o 2: rows processed and timings are reported.
o 3: rows processed and all timings are reported.
Another common technique is to summarize a large dataset into a more manageable size by using the
rxCube function. You can then convert the rxCube object into a data frame by using the rxResultsDF
function and pass this object to rxLInePlot.
Creating histograms
You can create histogram plots from both data
frames and XDF files using the rxHistogram
function. This function uses the rxCube to process
the data at scale and to create the bins. These bins
are then passed to graphics functions from the
lattice package. Many of the arguments are the
same as for rxLinePlot—you can pass arguments
to lattice using the ellipsis (“…”) arguments.
Behind the scenes, rxHistogram uses the lattice
barchart function.
The simplest form of histogram is just looking at

the frequencies of the bins for a single variable.
The first argument is a one-sided formula, with the variable to be plotted on the right-hand side. This
example uses the small airline demonstration flight delay data file that comes with Microsoft R Client. This
graph shows the distribution of 600,000 airline scheduled departure times.
Generating a histogram of airline departure times

airlineData <- file.path(rxGetOption("sampleDataDir"),
"AirlineDemoSmall.xdf")
rxHistogram(~CRSDepTime, data = airlineData)
FIGURE 3.20: A HISTOGRAM SHOWING THE DISTRIBUTION OF AIRLINE DEPARTURE TIMES

In the previous example, the CRSDepTime variable was treated as a continuous numeric variable by the
graph, which makes the bins rather an odd size. You can use the F function to refactorize data on the fly,
and split it into bins according to the integer range in which it lies. The following example uses this
approach to show the data hour by hour:
Using the F function to factorize data on the fly

rxHistogram(~F(CRSDepTime), data = airlineData)
The histogram produced by this code is a little easier to interpret. The busiest time of the day for
departures is between 6:00 AM and 8:00 AM. Unsurprisingly, very few aircraft depart in the small hours of
the morning.
FIGURE 3.21: A HISTOGRAM SHOWING THE REFACTORED DEPARTURE TIMES

You can modify the number of bins for the histogram by using the numBreaks argument.
Specifying the number of bins

rxHistogram(~CRSDepTime, data = airlineData, numBreaks = 8)
You can also specify lower and upper limits for numeric data using the startVal and endVal arguments.
Like rxLinePlot, you can split up your histogram plots, conditioned on other variables. Again, you use the
“|” (bar) notation to select the conditioning variables. In this example, the scheduled airline departure
times are grouped by the day of the week:
Creating a trellised histogram

rxHistogram(~F(CRSDepTime)| DayOfWeek, data = airlineData,
xlab = "Departure time")
FIGURE 3.22: A HISTOGRAM SHOWING DEPARTURE TIMES GROUPED BY DAY
Note: The default statistic that rxHistogram uses is a count of the number of observations
that fall in each bin. You can change this to report the percentage of observations in each bin by
setting the histType argument to "Percent".
Saving plots
You can save your rxLinePlot and rxHistogram plots in the
same way you can with base R plots. First, open a file handle
for a graphical file, then produce your plot, and then close
the file handle:
Saving a histogram as a jpeg file

jpeg("testplot.jpg")
rxHistogram(~F(CRSDepTime)| DayOfWeek, data = airlineData,
xlab = "Departure time")
dev.off()
You can save to various file types, including jpeg, tiff, pdf, and postscript.
Transforming data as it is plotted

You use the rxLinePLot and rxHistogram functions to
perform transformations using many of the same features
available with the other ScaleR functions. You use the
transforms argument to modify or create variables, and you
can filter data using the rowSelection argument.
The following example generates a line plot based on the log
of the miles per gallon and displacement data.
Transforming data in rxLinePlot

rxLinePlot(logMpg ~ logDisp, data = mtcars, type = "p",
transforms = list(logMpg = log(mpg),
logDisp = log(disp)),
xlab = "log Engine displacement (cu.in.)",
ylab = "log Miles per gallon")
Note that, when performing transformations on big datasets, you should carry out multiple
transformations in a single pass rather than performing several passes of single transformations. Also note
that, if you expect to use the transformed variables more than once, it might be beneficial to transform
your data before you plot it, using rxDataStep. The rxDataStep function is discussed in more detail in
Module 4: Processing Big Data.
Demonstration: Generating a histogram with rxHistogram

This demonstration shows how to visualize big data by using the rxHistogram function.
Demonstration Steps
Creating a histogram
2. Open the R script Demo2 - rxHistogram.R in the E:\Demofiles\Mod03 folder.
3. Highlight and run the code under the comment # Use the flight delay data. This code creates an
RxXdfDdata data source for the FlightDelayData.xdf file. This file contains 11.6 million records.
4. Highlight and run the code under the comment # Create a histogram showing the number of
flights departing from each state. This code uses the rxHistogram function to display a count of
flights for each value of the OriginState variable. The scales argument is a part of the lattice
functionality that underpins rxHistogram; your code uses this argument to change the orientation
and size of the labels on the x axis.
5. Highlight and run the code under the comment # Filter the data to only count late flights. This
code uses the rowSelection argument to only include observations where the ArrDelay variable is
greater than zero.
Creating side-by-side histograms and overlays

1. Highlight and run the code under the comment # Flights by Carrier. This chart shows a histogram of
the number of flights for each airline. The code uses the yAxisMinMax argument to set the limits of
the vertical scale.
2. Highlight and run the code under the comment # Late flights by Carrier. This chart shows a
histogram of the number of delayed flights for each airline. The code uses the yAxisMinMax
argument to set the limits of the vertical scale to the same as that of the previous chart. Additionally,
the plotAreaColor argument makes the background transparent; this chart will be used as an
overlay.
3. Highlight and run the code under the comment # Display both histograms in adjacent panels. This
code installs the latticeExtra package. This package provides functionality that you can use to
customize the layout of the panels that display graphs and chart. The code then displays both charts;
they appear in adjacent panels. Both charts use the same vertical scale, enabling you to compare the
number of flights against the number of delayed flights for each airline.
4. Highlight and run the code under the comment # Overlay the histograms. This statement uses the
+ operator to overlay the second chart (with the transparent background) on top of the first, making
it even easier to see the proportion of late flights for each airline.
5. Close the R development environment, without saving any changes.


Question
How can you control the number of bins used by the rxHistogram function if the
data being plotted is continuous rather than categorical?
Convert the data into a factor.
Specify the numBreaks argument of the rxHistogram function.
Filter the data to remove any non-factor items.
Use the transforms argument of the rxHistogram function to round the data
up or down a set of discrete values.
Set the xNumTicks argument of the rxHistgram function to the number

required.
Lab: Visualizing data

Scenario
You have decided that it might be useful to perform some visualizations of the flight delay data to give
some ideas as to how and why delays might occur. You decide to generate plots of the delay times
against flight distance. You consider that generating plots of delay as a proportion of total travel time
might also be informative. Additionally, you want to ascertain whether delays might also be a function of
the departure state (are more states inherently prone to departure delays than others?), and the day of
the week. Finally, you decide to create histograms to see how bad weather leads to delays.
Objectives
 Use the ggplot2 package to generate plots of flight delay data, to visualize any relationship between
delay and distance.
 Use the rxLinePlot function to examine data by departure state and day of the week.
 Use the rxHistogram function to examine the relative rates of the different causes of delay.
Lab Setup
 20773A-LON-DC
 20773A-LON-DEV
Exercise 1: Visualizing data using the ggplot2 package

Scenario
You want to determine whether there is any relationship between flight delay times and distance traveled.
You also want to see whether the state in which the departure airport is located is a factor in flight delays.

1. Import the flight delay data into a data frame
2. Generate line plots to visualize the delay data
 Task 1: Import the flight delay data into a data frame

2. Start your R development environment of choice (Visual Studio or RStudio), and create a new R file.
3. You will find the data for this exercise in the file FlightDelayData.xdf, located in the folder
E:\Labfiles\Lab03. Set your current working directory to this folder.
4. The ggplot2 package requires a data frame containing the data. The sample XDF file is too big to fit
into memory (it contains more than 11.6 million rows), so you need to generate a random sample
containing approximately 2% of this data. To reduce the size of the data frame further, you are only
interested in the Distance, Delay, Origin, and OriginState variables. Finally, you also want to remove
any anomalous observations; there are some rows that have a negative or zero distance which you
should discard.
Use the rxImport function to create a data frame that matches this specification from the
FlightDelayData.xdf file. You can use the rbinom base R function with a probability of 0.02 to
generate a random sample of 2% of the data as part of the rowSelection filter.
 Task 2: Generate line plots to visualize the delay data

1. Install the tidyverse package and specify the tidyverse library. This package and library includes the
ggplot2 package and dplyr.
2. Create a scatter plot that shows the flight distance on the x axis versus the delay time on the y axis.
Give the axes appropriate labels.
3. Overlay the data with a line plot to help establish whether there is any pattern to the data presented
in the graph. Reduce the intensity of the points using the alpha function. This will help to show the
frequency of delay times (more common delay times will appear darker). Also, filter outliers, such as
negative delays and delays greater than 1,000 minutes (you can use the dplyr filter function to
perform this task).
4. Facet the graph by OriginState to determine whether all states show the same trend. Use the
facet_wrap function (there are more than 50 states and US territories included in the data).
Note: You might receive some warning messages due to insufficient data for regressing the data for
some states. You can ignore these warnings, but you will see some states that don't include the
overlay.
Results: At the end of this exercise, you will have used the ggplot2 package to generate line plots that
depict flight delay times as a function of distance traveled and departure state.
Question: Does the regression indicate that there is any relationship between flight distance
and flight delay times?
Exercise 2: Examining relationships in big data using the rxLinePlot

function
Scenario
Having established that distance traveled does not seem to be a major factor in predicting flight delay
times, you want to examine how flight delays vary with distance, as a function of the proportion of the
overall travel time for a flight. You surmise that the overall flight time should be related to the distance
traveled and hence show little relationship to the delay times, but you want to double-check this theory.
You also want to see how flight delays vary according to the day of the week, to establish whether this
could be a factor.
1. Plot delay as a percentage of flight time
2. Plot regression lines for the delay data
3. Visualize delays by the day of the week
 Task 1: Plot delay as a percentage of flight time

1. Import the data into a new XDF file with the following specification:
 Only include the Distance, ActualElapsedTime, Delay, Origin, Dest, OriginState, DestState,
ArrDelay, DepDelay, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, and
LateAircraftDelay variables.
 Remove all observations that have a negative distance.

 Remove all observations where the delay is negative or greater than 1,000 minutes.
 Remove all observations that have a negative value for ActualElapsedTime.
 Add a transformation that creates a variable called DelayPercent that contains the flight delay as
a percentage of the actual elapsed time of the flight.
 Save the data in an XDF file named FlightDelayDataWithProportions.xdf. Overwrite this file if
it already exists.
2. Create a cube that summarizes the data of interest (DelayPercent as a function of Distance and
OriginState). Use the following formula:
DelayPercent ~ F(Distance):OriginState
Note that you must factorize the Distance variable to use it on the right-hand side of a formula.
Filter all observations where the DelayPercent variable is more than 100%.
Summarizing the data in this way makes it much quicker to run rxLInePlot as it now only has to
process a focused subset of the data.
3. Change the name of the first column in the cube from F_Distance (the name generated by the
F(Distance) expression in the formula) to Distance. You can use the base R names function to do
this.
4. Create a data frame from the data in the cube. This is necessary because the rxLinePlot function can
only process XDF format data or data frames, not rxCube data. Use the rxResultsDF function to
perform this conversion.
5. Generate a scatter plot of DelayPercent versus Distance. Experiment with the symbolStyle,
symbolSize, and symbolColor arguments to the rxLinePlot to see their effects.
Note: The legend on the x axis can become unreadable. You can remove this legend by setting the
scales argument to (list(x = list(draw = FALSE))). The scales argument is passed to the underlying
xyplot function.
 Task 2: Plot regression lines for the delay data

1. Add a smoothed line to the plot to help identify any pattern in the data (use the smooth type
together with the p type).
2. Break the plot down into facets, organized by OriginState.
Results: At the end of this exercise, you will have used the rxLinePlot function to generate line plots that
depict flight delay times as a function of flight time and day of the week.
Question: What do you observe about the graphs showing flight delay as a proportion of
travel time against distance? Does this bear out your theory that there is little relationship
between these two variables? If not, how do you account for any discrepancy between your
theory and the observed data?
 Task 3: Visualize delays by the day of the week

1. Import the data in the original FlightDelayData.xdf file into a new XDF file with the following
specification:
o Only include the Delay, CRSDepTime, Delay, and DayOfWeek variables.
o Remove all observations where the delay is negative or greater than 180 minutes (three hours).
o Remove all observations that have a negative value for ActualElapsedTime.
o Add a transformation that converts the CRSDepTime variable into a factor with 48 half-hour
intervals. Use the base R cut function to do this.
o Save the data in an XDF file named FlightDelayWithDay.xdf. Overwrite this file if it already
exists.
2. The DayOfWeek variable comprises numeric codes (1 = Monday, 2 = Tuesday, and so on). Recode
the DayOfWeek variable in the XDF file to a meaningful set of text abbreviations suitable for display
as values on a graph. Use the rxFactors function to do this.
3. Create a cube that summarizes Delay as a function of CRSDepTime and DayOfWeek). Use the
following formula:
Delay ~ CRSDepTime:DayOfWeek
4. Create a data frame from the cube.

5. Generate a scatter plot overlaid with a smooth line plot of delay as a function of departure time. Use
the following scales argument to display and orient the labels for the x and y axes:
scales = (list(y = list(labels = c("0", "20", "40", "60", "80", "100", "120", "140",
"160", "180")),
x = list(rot = 90),
labels = c("Midnight", "", "", "", "02:00", "",
"", "", "04:00", "", "", "", "06:00", "", "", "", "08:00", "", "", "", "10:00", "",
"", "", "Midday", "", "", "", "14:00", "", "", "", "16:00", "", "", "", "18:00", "",
"", "", "20:00", "", "", "", "22:00", "", "", "")))
Question: Using the graph showing delay times against departure time for each day of the
week, which time of day generally suffers the worst flight delays, and which day of the week
has the longest delays in this period?
Exercise 3: Creating histograms over big data

Scenario
You feel it might be useful to drill into the different causes of delay to see whether there are any patterns
here. In particular, you want to examine whether the arrival state might be a contributing factor to arrival
delays. You also want to determine whether delays caused by adverse weather conditions are more
prevalent at particular times of the year.
1. Create histograms to visualize the frequencies of different causes of delay
 Task 1: Create histograms to visualize the frequencies of different causes of delay

1. Import the data in the original FlightDelayData.xdf file into a new XDF file with the following
specification:
o Only include the OriginState, Delay, ArrDelay, WeatherDelay and MonthName variables.
o Remove all observations where the arrival or departure delay is negative, or the total delay is
negative or greater than 1,000 minutes.
o Save the data in an XDF file named FlightDelayReasonData.xdf. Overwrite this file if it already
exists.
2. Create a histogram showing the frequency of the different arrival delays. Set the histType argument
to "Counts" to show the number of items in each bin.
3. Modify the histogram to show the percentage of items in each bin (set histType to "Percent").
4. Modify the histogram again to show the percentage delay by state.
5. Create a new histogram that shows the frequency of weather delays.
6. Modify the histogram to display the frequency of weather delays by month.
environment.
Results: At the end of this exercise, you will have used the rxHistogram function to create histograms
that show the relative rates of arrival delay by state, and weather delay by month.
Question: What is the most common arrival delay (in minutes), and how frequently does this
delay occur?
Question: Which month has the most delays caused by poor weather? Which months have
the least delays caused by poor weather?

 Use the ggplot2 package to visualize in-memory data.
 Use the ScaleR rxLinePlot and rxHistogram functions to generate graphs based on big data.
Tools
The ggplot2 package is huge and has a bewildering array of different options, plot types and
transformations. See these resources for further information:
 The official ggplot2 documentation covers every option in fine detail—see:

http://docs.ggplot2.org/current/index.html.
 RStudio produces a very useful cheat sheet—see: https://www.rstudio.com/wp-

content/uploads/2015/03/ggplot2-cheatsheet.pdf.
 Winston Chang’s Cookbook for R has an excellent chapter on ggplot2—see: http://www.cookbook-

r.com/Graphs/.
 Hadley Wickham, the author of ggplot2, wrote a paper on the Layered Grammar of Graphics used in
the package—see: http://vita.had.co.nz/papers/layered-grammar.pdf.
Whatever ggplot2 problem you have, it is very likely someone else has experienced something similar and
has found a fix on stackoverflow.com.

4-1
Module 4
Processing Big Data
Contents:
Module Overview 4-1
Lesson 1: Transforming big data 4-2
Lesson 2: Managing big datasets 4-13
Lab: Processing big data 4-19
Module Overview
Many data scientists are familiar with the notion of data wrangling—the way in which you need to
manipulate and arrange data to get it into a shape where you can perform your various analyses. Data
wrangling can include operations such as generating information derived from the raw data, filtering the
data, performing additional computations such as normalizing the data, sorting the data, and possibly
splitting or merging datasets to form a coherent source of data.
It’s relatively easy to perform these tasks when you are using small datasets that will easily fit into memory
and do not take long to process. However, when you are working with much larger datasets, you need to
plan these operations more carefully; you need to ensure that you don't run out of resources partway
through a lengthy task—and you will also want to optimize jobs to minimize their duration. The ScaleR™
functions are designed to help you.
Objectives
 Perform transformations over big data in an efficient manner.
 Perform sort and merge operations over big data.

4-2 Processing Big Data
Lesson 1
Transforming big data
This lesson focuses on the transformation framework implemented by ScaleR functions. Many ScaleR
functions provide the transforms argument that you can use to manipulate data as you read it in,
summarize it, or perform operations such as generating plots. The key aspect of the transformation
framework is that it is designed to work at scale; it uses the chunking capabilities of the XDF format to
process data in blocks. This lesson describes how to use this framework in detail to perform scalable
transformations.
Lesson Objectives
 Explain when to transform data on the fly, and when to make a transformation permanent.
 Use the rxDataStep function to implement transformations.

 Use transformations to dynamically generate new variables in a dataset.
 Subset variables in a large dataset.
 Utilize third-party packages in transformations.
 Perform complex transformations using custom transformation functions.
 Reblock a transformed XDF file to balance chunk sizes.
When should a transformation be permanent?

When you import data using the rxImport
function, you have the option of creating an in-
memory data frame or an XDF file. Data frames
are transient, and unless you take explicit steps to
save their contents, they disappear when your R
session finishes. In contrast, XDF files are
permanent; they are stored on disk and transcend
the life of the R session. If you perform a
transformation using rxImport, the transformed
data will either be temporary (if you create a data
frame), or permanent (if you create an XDF file).
Functions such as rxSummary, rxCube,
rxLinePlot, and many others that read XDF objects, can also perform transformations. However, these
transformations are only used by the function being performed, and are then discarded. If you repeat the
same operation, your R session has to perform the same transformation again. This can take time and
might consume considerable resources if you regularly duplicate the same analysis (as part of a script,
perhaps). With this in mind, you might be tempted to always perform permanent transformations, just in
case you need the same data again later.
However, remember that in the world of big data, a dataset might be many thousands of gigabytes in
size. Writing the transformed data to disk can generate a lot of additional I/O, and might incur excessive
storage charges, depending on where you are saving the data.
Therefore, saving the results each time you transform data—just in case you might need it again later—is
not always feasible. You need to strike a balance. In particular, you should consider:
 How likely are you to require the transformed data again?
 Does the cost of performing the transformation when you need the transformed data exceed the
costs associated with saving transformed data to storage?
 How big is the transformed data compared to the effort required to generate it? If it is small and used
occasionally, but takes considerable effort to construct, then you should save it; the associated I/O
and storage costs will be minimal.
 How volatile is the underlying data? If you are performing analyses on live data (rather than historic
records), it might not be appropriate to save the transformed information because it could quickly
become outdated.
Using the rxDataStep function

The rxDataStep function is specifically designed
to implement transformations on XDF data. This
function works like many of the ScaleR functions in
that it reads data from an XDF file a block at a
time. Once in memory, a block is referred to as a
chunk. The rxDataStep function can then apply
any transformations to this chunk and write it out
to the destination before proceeding to the next
block. When you use rxDataStep, you should
ensure that sufficient memory is available to hold
a chunk of data.
Understanding chunks and blocks

It is important to understand the difference between a block and a chunk. A chunk is a set of rows in
memory whereas a block is the unit in which the rows are stored on disk. If you create an XDF file using
rxImport and specify a value of 50000 for the rowsPerRead argument, the file will be created with a
block size of 50000 rows.
You can view the block size and the number of blocks in an XDF file using the rxGetInfo function; set the
getBlockSizes argument to TRUE.
rxGetInfo function
rxGetInfo("FlightDelayData.xdf", getBlockSizes = TRUE)
# Typical output
File name: FlightDelayData.xdf
Number of observations: 1135221
Number of variables: 29
Number of blocks: 23
Rows per block (first 10): 50000 50000 50000 50000 50000 50000 50000 50000 50000 50000
Compression type: zlib
When you process the file using a function such as rxDataStep, you can either specify the number of
blocks to read as a chunk using the blocksPerRead argument, or the number of rows using the
rowsPerRead argument. So, if the XDF file has a block size of 50000 rows and you specify a value of 2 for
the blocksPerRead argument, each chunk will contain 100000 rows (2 * 50000).
When the data is written back, it will use the new chunk size of 100000 rows for each block. Alternatively,
if you set rowsPerRead to 25000, rxDataStep will only read half a block at a time as a chunk into
memory, and when the data is written out, it will have a block size of 25000 rows.
Note: Chunking only operates in an R Server environment. If you are using R Client, the
ScaleR functions do not perform chunking, and the entire dataset must fit into memory.
Using the transforms argument

The rxDataStep function has many features in common with the rxImport function, and operates in a
similar manner. However, rxImport is intended primarily as a tool for importing data held in different
formats and converting it to XDF—rxDataStep is optimized to operate on XDF data (although you can
use it to transform in-memory data frames). If you are transforming data rather than migrating it,
rxDataStep is likely to run more quickly.
The primary focus of the rxDataStep function lies in the transforms argument. As with the other ScaleR
functions, this argument contains a list of transformations to be performed. You can use this feature to
add new variables, in addition to modifying the values of existing variables.
The most common form of transformations apply an expression to a variable, as shown in the following
example. This transformation converts the values in the Distance column in the flight delay dataset from
miles into kilometers:
rxDataStep function
rxDataStep(inData = "FlightDelayData.xdf", outFile = " FlightDelayDataMetric.xdf",
rowsPerRead = 50000,
transforms = list(Distance = Distance * 1.6))
Best Practice: Remember that the transforms argument is a list that can contain any
number of transformations. Add all the transformations that you require to this list to process the
data in a single pass. Do not perform multiple runs of rxDataStep, each implementing a single
transformation.
Be aware of the chunk-oriented nature of transformations. You have immediate access to all the data in
the current chunk. You can read data held in other blocks (this is discussed in the topic Using Custom
Transformation Functions), but this will incur additional I/O—try to avoid repeatedly reading the same
blocks over and over.
Best Practice: Avoid defining transformations that require access to all observations in the
dataset simultaneously, such as the poly and solve matrix operations. These operations can be
expensive because they can involve repeatedly reading the dataset, and they will be performed
for every row in the dataset.
Also, when sampling data in a transform, remember that the sampling algorithm only has access
to the current chunk unless you reread the entire dataset.
Transformations and closures

The rxDataStep function executes the code associated with every transformation for each row in the
dataset, and you have direct access to any of the variables in that row. However, all transformations
operate within the confines of a closure that limits access to the session environment (R variables and
functions). The rationale behind this approach is to reduce, or at least document, the dependencies on
which a transformation is based. This helps to ensure that a transformation does not accidentally rely on
some global variable that might not be present the next time the transformation is used, causing the
transformation to fail. If you need to reference variables in the session, you must add them to the closure
by using the transformObjects argument. This argument takes a named list of variables and the names
by which you reference them in your transformation code.
milesToKm <- 1.6

transforms = list(Distance = Distance * conversionFactor),
transformObjects = list(conversionFactor = milesToKm)
Adding new variables to a dataset

The transforms list contains items of the form var
= expression. If var is an existing variable in the
dataset, then the transformation modifies the
value of that variable for the current row. If var
does not exist, then the transformation creates it
and adds it to the dataset as part of every row.
The following example adds two new variables to

the flight delay dataset. The DistanceKm variable
records the distance travelled in kilometers, but
leaves the existing Distance variable intact (the
previous example overwrote the data in the
Distance variable):
Adding a variable to the flight delay dataset

milesToKm <- 1.6

transforms = list(DistanceKm = Distance * conversionFactor),
transformObjects = list(conversionFactor = milesToKm))
Creating factors
You can add categorical variables to a dataset, but you must remember that only one chunk of data is
accessible at a time. This means that you should not write code that attempts to define factor levels and
labels automatically. Doing this could cause inconsistencies in the variable across the dataset (different
chunks might omit some factor levels and labels if there is no matching data in that chunk). Instead, you
should specify levels and labels explicitly. In the following example, IsCancelled is a categorical variable
that maps the values of the Cancelled variable (0, 1) to TRUE and FALSE.
Adding a categorical variable to the flight delay dataset

transforms = list(IsCancelled = factor(Cancelled, levels = c(0, 1), labels =
c(FALSE, TRUE)))
)
Subsetting variables in a dataset

If you want to remove variables from the dataset, use the
varsToDrop and varsToKeep arguments of rxDataStep.
These two arguments are mutually exclusive—if you want
to keep most of the variables, list the ones you want to
remove with varsToDrop. However, if you want to lose
most of the variables, list the ones you want to retain with
varsToKeep.
If you need to delete entire rows, use the rowSelection

argument. This is a logical expression that is evaluated for
each row in the dataset. It can reference any of the
variables in the dataset, in addition to external variables made available through the transformObjects
list. If the expression returns true, the row is included in the result—otherwise it is discarded. Note that the
rowSelection expression must reference at least one variable in the dataset, otherwise it will cause an
error with the message “The sample dataset for the analysis has no variables”. This is important if you are
using rxDataStep (or rxImport) to generate a random sample of the data.
The following example will not actually filter any data, and the result will contain the entire dataset:
Using rxDataStep (or rxImport) to generate a random sample of the data

rxDataStep(inData = "FlightDelayData.xdf", outFile = "FlightDelayDataSample.xdf",
rowSelection = as.logical(rbinom(n = 1000, size = 1, prob = 0.01))
)
To get around this issue, you amend the rowSelection expression to involve at least one of the variables
in the dataset, as shown in the following example:
Amending the rowSelection expression

rowSelection = as.logical(rbinom(n = .rxNumRows, size = 1, prob = 0.01))
)
The .rxNumRows variable used in this example is one of the special variables created by the
transformation process; it contains the number of rows in the current chunk. Later in this lesson, the topic
Using Custom Transformation Functions describes the special variables in more detail.
You can also use the startRow, and numRows arguments to limit the size of the transformed dataset. The
startRow argument specifies the starting offset at which to begin the process, and numRows indicates
the number of rows to transform. The resulting dataset will only contain rows that fall into this range. The
startBlock and numBlocks arguments are similar, except that you specify blocks rather than rows.
Incorporating third-party packages into a transformation

The code in a transformation can be arbitrarily complex and
involve functions defined in packages other than in ScaleR.
However, by default these functions are not part of the
closure in which the transformation code runs. You can
reference these functions using the package::function
notation of R, or you can bring the functions in a package
into scope by using the transformPackages argument to
rxDataStep.
Note: If you repeatedly reference the same packages in all transformations in a session, you
can use the rxOption function to set the transformPackages option globally.
The following example uses functions in the lubridate package to add the flight departure date to the
flight delay dataset. The departure data is created as a POSIXct date/time variable, based on the values in
the Year, Month, DayofMonth, and CRSDepTime variables. Note that the CRSDepTime variable is a
character string containing up to four digits that represent the departure time in the format "hhmm". For
times before 10:00 AM, the format is "hmm".
Referencing an external package

install.packages("lubridate")
rxDataStep(inData = "FlightDelayData.xdf", outFile = "FlightDelayData2.xdf",

transforms = list(FlightDepartureDateTime = make_datetime(year = Year, month =
Month, day = DayofMonth, hour = trunc(as.numeric(CRSDepTime) / 100), min =
as.numeric(CRSDepTime) %% 100)),
transformPackages = c("lubridate")
)
Using custom transformation functions

You can provide inline code that performs
transformations directly in the transforms list.
However, for more complex transformations, that
technique can become very unwieldy, and difficult
to document and maintain. In this situation, you
should code the transformation as a function. You
can then specify the function by name in the
transformFunc argument of rxDataStep.
rxDataStep invokes a transformation function

once for each chunk (not per row). It passes the
function an argument containing the data from
the current chunk in the form of a list of vectors,
each vector containing the values for a single variable in the original dataset. You specify the variables
from the original dataset to include in the vector list by using the transformVars argument to
rxDataStep.
You can modify the data in these vectors, or construct a new list of vectors containing the transformed
data. You must return a list containing these vectors at the end of the function. All vectors in this list must
have the same length. This data is merged back into the chunk overwriting the appropriate variables with
the transformed values, before being written back to disk. Note that, if you remove a vector from the list,
the corresponding variable will be removed from the resulting dataset.
Inside the transformation function, you have access to a set of special variables managed by rxDataStep.
These include:
 .rxStartRow. The row number of the first row in the chunk.
 .rxChunkNum. The number of the current chunk.
 .rxNumRows. The number of rows in the current chunk.
 .rxReadFileName. The name of the XDF file from which the data was read. If you need to access data
in other blocks in the file, you can open an RxXdfData data source using this variable and navigate
to the appropriate location.
 .rxIsTestChunk. TRUE if this is a test pass of the transformation function; otherwise it is FALSE.
 .rxTransformEnvir. The transformation environment, including any data specified in the

transformObjects argument to rxDataStep.
The .rxIsTestChunk variable is important. Some transformations will perform an initial test pass over the
first block of data. If this test pass is successful, the same block is repeated, followed by the remaining
blocks. You should always check the .rxIsTestChunk variable to avoid generating duplicated results for
the first block of data.
You can manipulate the objects in the transformation environment by using the .rxGet(objName) and
.rxSet(objName, objValue) functions. These functions give you a way to pass information from one
chunk to the next. The following example uses these functions to add a running total of the total flight
distance for all flights recorded in the flight delay dataset.
Adding a running total to the flight delay data

addRunningTotal <- function (dataList) {
# Check to see whether this is a test chunk

if (.rxIsTestChunk) {
return(dataList)
}
# Retrieve the current running total for the distance from the environment
runningTotal <- as.double(.rxGet("runningTotal"))
# Add a new vector for holding the running totals

# and add it to the list of variable values
runningTotalVarIndex <- length(dataList) + 1
dataList[[runningTotalVarIndex]] <- rep(as.numeric(NA), times = .rxNumRows)
names(dataList)[runningTotalVarIndex] <- "RunningTotal"
# Iterate through the values for the Distance variable and accumulate them
idx <- 1
for (distance in dataList[[1]]) {
runningTotal <- runningTotal + distance
dataList[[runningTotalVarIndex]][idx] <- runningTotal
idx <- idx + 1
}
# Save the running total back to the environment, ready for the next chunk
.rxSet("runningTotal", as.double(runningTotal))
return(dataList)
}
rxDataStep(inData = "FlightDelayData.xdf", outFile = "EnhancedFlightDelayData.xdf",

overwrite = TRUE, append = "none",
transformFunc = addRunningTotal,
transformVars = c("Distance"),
transformObjects = list(runningTotal = 0)
)
The addRunningTotal function uses the variable runningTotal as an accumulator. The value of this
variable is initialized to 0 in the transformObjects argument of the rxDataStep object. At this point, the
runningTotal variable becomes part of the environment used by the addRunningTotal function. The
function retrieves the current value of runningTotal by using .rxGet, and saves it at the end of the
function by using .rxSet.
The logic in the body of the function adds a new vector to the dataList list and names it RunningTotal.
This vector is populated with the accumulated total of the Distance variable read from each row in the
first vector of dataList. This first vector is filled in by the rxDataStep function as specified by the
transformVars argument.
Note: If your transformation function uses functions in external packages, you must
reference these packages in the transformPackages argument of rxDataStep.
Best Practice: Milliseconds matter

Make sure that your transformations are as efficient as possible. It is worth spending time tuning
and analyzing their performance on small subsets of your data. A large dataset might consist of
100 million rows. If each iteration in a transformation function takes 1 millisecond, then it will
require 100,000 seconds, or nearly 28 hours, to process your data.
Needless to say, if your code is only slightly less efficient and takes 1.5 milliseconds per iteration,
this will add another 14 hours to the processing time.
This is also a situation where you should consider the size of the platform on which you are
running your code. Add as much memory and processing power to your computing environment
as possible. It might even be worth creating a temporary cluster of large VMs in Azure®
especially to perform the task. You can remove these VMs once you have finished.
Reblocking an XDF file

When you first create an XDF file, each of the
blocks—apart from the final one—will contain the
same number of rows. As you transform and
subset data, you find that the blocks become
uneven, some blocks become sparsely populated,
and some contain many fewer rows than others.
This fragmentation can impact the performance of
the ScaleR functions.
The following example shows what happens if you

use the code shown in the earlier topic to sample
data; the original file contained 23 blocks of 50000
rows each. The resulting file contains fewer blocks
that are much emptier.
# The original XDF file

rxGetInfo("FlightDelayData.xdf", getBlockSizes = TRUE)
# Results:
# File name: FlightDelayData.xdf.xdf
# Rows per block (first 10): 50000 50000 50000 50000 50000 50000 50000 50000 50000 50000
# Sample the data

rowSelection = as.logical(rbinom(n = .rxNumRows, size = 1, prob = 0.01))
)
# Look at the metadata

rxGetInfo("FlightDelayDataSample.xdf", getBlockSizes = TRUE)
# Results:
# File name: FlightDelayDataSample.xdf
# Rows per block (first 10): 1008 969 942 952 1088 1021 1034 1009 1016 1022
You can reblock a file by using rxDataStep. Read the file and write it out again, and specify the number
of rows to include in each block using the rowsPerRead argument. The result should be a defragmented
file consisting of fewer blocks:
Reblock a file by using rxDataStep

# Reblock the file
rxDataStep(inData = " FlightDelayDataSample.xdf ", outFile = "DefragmentedSample.xdf",
rowsPerRead = 50000
)
# Look at the metadata

rxGetInfo("DefragmentedSample.xdf", getBlockSizes = TRUE)
# Results:
# File name: DefragmentedSample.xdf
# Rows per block: 11433
Best Practice: If you change the name of a variable in an XDF file, you might need to
reblock the file afterwards to ensure that the metadata recording the new variable name is
updated in every block.
Best Practice: To reduce the chances of fragmentation, avoid transformations that change
the length of a variable.
Demonstration: Using a transformation function to calculate a running

total
This demonstration walks through the example shown in the topic Using a Custom Transformation
Function. In this example, you will see how to add a new column to the flight delay data that contains a
running total of all the flight distances.
Demonstration Steps
Creating a transformation function

2. Open the R script Demo1 - tranformations.R in the E:\Demofiles\Mod04 folder.
3. Highlight and run the code under the comment # Connect to R Server. This code connects to R
Server running on the LON-RSVR VM.
4. Highlight and run the code under the comment # Examine the dataset. This code shows the
structure of the sample data and displays the first 10 rows. Note that the data contains 29 variables.
5. Highlight and run the entire block of code that creates the addRunningTotal function, under the
comment # Create the transformation function. This function performs the following tasks:
a. It checks to see whether this is a test pass over a chunk, and if so it returns immediately.
b. It retrieves the value of the runningTotal variable from the environment (this variable will be
initialized to 0 by the rxDataStep function).
c. It adds a new vector to the dataList list. This vector will add the data for the new column
containing the running total. The column is named RunningTotal.
d. It iterates through the values in the vector for the Distance variable, generates the running total,
and adds the total for each row to the RunningTotal vector.
e. After completing its work, the function saves the current value of the runningTotal variable to
the environment.
f. It returns the updated dataList list that now includes the RunningTotal vector. This vector will
be added as a variable to the dataset.
Testing the transformation function

1. Highlight and run the code under the comment # Run the transformation. This code uses
rxDataStep to run the transformation function. Note the following points:
 The transformFunc argument is set to addRunningTotal. This is the name of the transformation
function.
 The transformVars argument specifies the Distance variable.
 The transformObjects argument creates and initializes the runningTotal environment variable.
 The numRows argument limits the operation to the first 2 million rows. It is always best to test
transformations on a subset of your data first.
 The transforms list also adds a variable named ObservationNum to the data. This variable holds
the row number. It is generated by using a range based on the .rxStartRow and .rxNumRows
special variables.
2. As the function runs, note the progress messages that are reported, listing the time taken to process
each chunk.
Viewing the results of the transformation

1. Highlight and run the code under the comment # View the results. This code retrieves the metadata
for the transformed XDF file and displays the first few rows and the last few rows. Notice that the
RunningTotal variable has been added.
2. Highlight and run the code under the comment # Plot a line to visualize the results using a
random sample of the data. This statement creates a line plot over a sample of the data, showing
how the RunningTotal variable increases with the number of observations.
3. Close your R development environment of choice (RStudio or Visual Studio).
Statement Answer
You create a transformation function for use with rxDataStep. The

transformation function runs once for every row retrieved by rxDataStep.
True or False?
Lesson 2
Managing big datasets
Sorting and merging are common tasks when wrangling data—you frequently require data to be in a
specific sequence, or include information retrieved from disparate datasets. When operating with datasets
that will easily fit into memory, these tasks are trivial. However, when you are handling big data, they
become much more significant. Because of the resources required, sorting and merging many hundreds
of gigabytes of data is not a task that you should undertake lightly, due to the resources required—
specifically, memory, processor power, disk space, and time. This lesson examines these issues in more
detail, and describes how you can use ScaleR functions to address them.
Lesson Objectives
1. Considerations for when to sort big data.

2. How to sort data using the rxSort function.
3. How to combine data from different datasets using the rxMerge function.
Considerations for sorting big data

Before starting a lengthy sort operation, ask
yourself why the data needs to be sorted. If you
are likely to reuse the sorted data frequently, then
the overhead of performing this task might be
worthwhile. However, if you are sorting to
calculate specific quantiles, medians, or other
similar statistics, there are some alternatives
available:
 Generate cross-tabulations of the data using

rxCrossTabs or rxCube. This is frequently
much faster than sorting the data explicitly
because these operations only require a single
pass through the data. The results are generated in order of the variables specified in the formula. If
you include a running total for a variable, you can use this technique to calculate exact values for
quantiles of this variable. You can process the summary information returned by these functions as a
data frame to wrangle the data further.
 If you need to sort by a noninteger numeric independent variable, consider scaling the data as you
cross-tabulate it. For example, if you have numeric values falling in the range between 0 and 1, scale
them by 1,000. The integer conversion performed by using the F function will sort the data by within
1/500th of the original values. If you require greater or less accuracy, you iterate to refine this process.
 If you are sorting to calculate aggregates by groups of data, consider using a transformation function
that creates a running total for each group, as described in the previous lesson. This calculation can
be performed by taking a single pass through the data a block at a time and can be very efficient.
Remember that you can use custom transformation functions with ScaleR functions such as
rxSummary, rxCrossTabs, and rxCube.
 Be aware that many ScaleR functions, such as rxQuantiles, rxLorenz and rxRoc, are specifically
intended for operating on big data and do not require data to be presorted. If possible, use these
functions in preference to other, more traditional R packages.
Sorting and removing duplicates from big data

Use the rxSort function to sort data held in a data
frame or XDF object. If possible, data will be sorted
in memory, but if the dataset is too large, rxSort
will perform a merge sort. A merge sort can
require multiple passes through the data; the data
in each chunk is sorted on the first pass, and then
further passes merge the sorted data from each
chunk into the final result.
You specify the variables by which to sort the data

using the sortByVars argument, and the order in
which to present the data (increasing or
decreasing) by using the decreasing argument. Both of these arguments are vectors. The values in the
decreasing argument are logicals—TRUE means sort in decreasing order, FALSE means sort in increasing
order.
The following example sorts the flight delay data by descending order of Origin airport, and then by
ascending order of Dest airport within each Origin group:
Sorting flight delay data by origin and destination

rxSort(inData = flightDelayData, outFile = sortedFlightDelayData,
sortByVars = c("Origin", "Dest"), decreasing = c(TRUE, FALSE)
)
Note: If you sort on a factor variable, the data is sorted by factor level and not by name.
This can cause confusion if the levels are not in any specific order.
You can reduce the resources required to sort data by only selecting the variables that you really need in
the result—by using the varsToKeep and varsToDrop. If you can cut the number of variables to a
minimum, rxSort might be able to sort data in memory rather than by performing a merge sort, and
consequently run much more quickly.
Note: The rxSort function does not support filtering through rowSelection, or
transformations. Additionally, you cannot use numRows to limit the number of rows in the
source dataset.
Use the removeDupKeys argument to rxSort to remove rows from the result that have duplicated sort
key values. You can track the number of rows removed by using the dupFreqVar argument. This
argument specifies the name of a column to add to the sorted result containing this number. The
following example uses this technique to show the popularity of each airline route, based on the origin
and destination airports. (Remember that Origin and Dest are both factors in this dataset, so the data is
sorted by level rather than alphabetically).
Remove duplicates and tracking frequency

rxSort(inData = flightDelayData, outFile = sortedFlightDelayData, overwrite = TRUE,
sortByVars = c("Origin", "Dest"), decreasing = c(TRUE, FALSE),
varsToKeep = c("Origin", "Dest"),
removeDupKeys = TRUE, dupFreqVar = "RoutesFrequency"
)
# Sample results
head(sortedFlightDelayData, 10)
# Origin Dest RoutesFrequency

# 1 OTH SFO 62
# 2 OTH PDX 27
# 3 LMT SFO 70
# 4 LMT PDX 38
# 5 MKG MKE 50
# 6 MKG FNT 26
# 7 MKG GRR 6
# 8 RKS SLC 63
# 9 RKS DEN 102
# 10 RKS GCC 8
Joining big datasets

You can use the rxMerge function to combine
XDF data in a variety of ways. The simplest forms
of merge are union and oneToOne.
 union appends one dataset to the end of

another, vertically. Both datasets must have
the same number of columns.
 oneToOne joins datasets horizontally,

appending the data for each row in the
second dataset to the end of the data for the
corresponding row in the first. Both datasets
must have the same number of rows.
These operations are described in more detail in Module 2: Exploring Big Data.
The rxMerge function also supports relational-style styles, meaning you can perform inner and outer
joins across datasets.
Inner joins
An inner join merges datasets across columns that share a common key value. For example, the flight
delay dataset specifies airport information by using codes:
Origin Dest Distance CRSDepTime …

1 ATL PHX 1587 846
2 ATL PHX 1587 846
3 ATL PHX 1587 1932
4 ATL PHX 1587 1932
5 ATL PHX 1587 1932
6 ATL PHX 1587 1932
7 ATL PHX 1587 826
8 ATL PHX 1587 826
9 ATL PHX 1587 826
10 ATL PHX 1587 826
11 ATL PHX 1587 826
12 ATL PHX 1587 826
13 AUS PHX 872 1731
14 AUS PHX 872 1453
15 AUS PHX 872 1731
…
The details of each airport, such as its name, city, state, and location, could be held in a separate dataset:
iata airport city state country

1 ATL William B Hartsfield-Atlanta Intl Atlanta GA USA
2 AUS Austin-Bergstrom International Austin TX USA
3 PHX Phoenix Sky Harbor International Phoenix AZ USA
You can merge these two datasets using the airport codes, but you should first change the name of the
join column to be the same in both datasets. In this example, the name of the "iata" variable in the airport
data (sortedAirportData) is changed to "Origin" to match the flight delay data.
names(sortedAirportData)[1] <- "Origin"
You specify the type of join to perform as "inner", and you use the matchVars argument to specify the
variables to use for performing the join. Note that before merging, both datasets must be sorted by these
variables, in the same order. You can specify the autoSort argument to rxMerge to do this, or you can
presort the data manually:
Merging the flight delay data and airport datasets using airport codes
rxMerge(inData1 = sortedFlightDelayData, inData2 = sortedAirportData, outFile =
mergedData,
matchVars = c("Origin"), type = "inner")
Joining datasets over factors

If you are joining datasets across factor variables, you must ensure that the variables in both datasets have
the same factor levels. This might require that you refactor using the rxFactors function for one or both
datasets.
Outer joins
If a row in the first dataset has no corresponding row in the second, then the first row will not appear in
the result. If you need to retain all rows in the first dataset, you can set the type of join to "left". In this
case, the rxMerge function performs a left outer join operation, and uses NAs for the values of all
variables from the second dataset.
You can also perform a right outer join by setting the type to "right"; all rows from the second dataset
will appear in the result, joined with rows from the first containing NA values if necessary. Finally, you can
carry out a combination of left and right outer join operations by setting type to "full".
Demonstration: Sorting data with rxSort

This demonstration shows how to sort big data and remove duplicates from the result, using the rxSort
function
Demonstration Steps
Tuning a sort
2. Open the R script Demo2 - sorting.R in the E:\Demofiles\Mod04 folder.

4. Highlight and run the code under the comment # Examine the data. This code shows the first few
rows from the flight delay data.
5. Highlight and run the code under the comment # Sort it by Origin. This code sorts the data by the
Origin variable, in decreasing order. This process can take up to 90 seconds.
6. Highlight and run the code under the comment # Note the factor levels for Origin. This code
displays the factor levels for the Origin variable. This is the sequence in which the data should be
sorted. Note that the final three levels are MKG, LMT, and OTH.
7. Highlight and run the code under the comment # View the data. It should be sorted in
descending order of Origin. This statement displays the first 200 rows of the data. Note that the
data for OTH appears first, followed by LMT, and then MKG.
8. Highlight and run the code under the comment # Sort the data again. This statement sorts a dataset
containing a much smaller number of variables. The sort should be much quicker. This shows the
importance of being selective when sorting a dataset.
9. Highlight and run the code under the comment # View the data. The data should still be sorted by
Origin.
Removing duplicates from sorted data

1. Highlight and run the code under the comment # De-dup routes. This code sorts the data by Origin,
but only selects the Origin, Dest, and Distance variables. The removeDupKeys argument removes
any duplicates. The number of occurrences of each set of these variables is recorded in the
RoutesFrequency variable which is added to the results.
2. Highlight and run the code under the comment # View the data. The data is sorted by Origin, but will
only contain the Origin, Dest, and Distance variables, together with RoutesFrequency, indicating how
many times each route was found in the original data.


Question
Which option for the rxMerge function enables you to combine data horizontally from two
different datasets that have a different number of rows?
oneToOne
union
combine
lookup
inner
Lab: Processing big data

Scenario
Before continuing with your analysis of flight delays, you decide to reshape the data and add some
additional useful information to the dataset. Specifically, you want to modify the data to standardize the
flight departure and arrival times to use Universal Coordinated Time (UTC) rather than the local times that
are currently recorded in the dataset. This will involve merging some data from another dataset then
using the lubridate package to construct the dates from the information contained in the dataset. Finally,
you want to add cumulative averages for the flight delay times for each route.
Objectives
 Merge data from a second dataset into the flight delay data.
 Write a transformation function to add variables that record the departure and arrival times as UTC
times.
 Create a transformation function to generate the cumulative departure and arrival delays for each
route.
Lab Setup
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
Exercise 1: Merging the airport and flight delay datasets

Scenario
The flight departure and arrival times in the flight delay dataset are recorded using local times for each
airport. You want to standardize all of these times to UTC. The time zone information that you need to do
this is recorded in the separate airport information dataset. You need to join the flight delay dataset with
the airport information dataset using the departure airport code that is recorded in the Origin variable in
the flight delay dataset—and the iata field in the airport information dataset. You notice that the fields in
both datasets are factors, but with different factor levels. The airport information dataset contains codes
for 3,376 airports, whereas the Origin variable in the flight delay dataset only references a subset of these
airports. Therefore, you will need to perform some refactoring before you can merge these datasets.

1. Copy the data to the shared folder
2. Refactor the data
3. Merge the datasets

 Task 1: Copy the data to the shared folder

2. Copy the following files from the E:\Labfiles\Lab04 folder to the \\LON-RSVR\Data share:
 airportData.xdf
 FlightDelayData.xdf
 Task 2: Refactor the data

 session: TRUE
 diff: TRUE
 username: admin
3. Examine the factor levels for the iata field in the airportData XDF file, and the factor levels for the
Origin and Dest variables in the FlightDelayData XDF file. Notice that the two files use different
factor levels for this data. There is even some minor variation between the Origin and Dest variables
in the flight delay data, as shown in the following output:
Var 1: iata
3376 factor levels: 00M 00R 00V 01G 01J ... ZEF ZER ZPH ZUN ZZV
Var 1: Origin
329 factor levels: ATL AUS BHM BNA BOS ... GCC RKS MKG LMT OTH
Var 1: Dest
331 factor levels: PHX PIA PIT PNS PSC ... GCC RKS MKG OTH LMT
4. Combine the levels in the iata, Origin, and Dest variables into a new set of factor levels. Remove any
duplicates.
5. Use the rxFactors function to refactor the iata field in the airportData XDF file with this new set of
factor levels.
6. Use the rxFactors function to refactor the Origin and Dest variables in the FlightDelayData XDF file
with this set of factor levels.
7. Verify that the factor levels in both XDF files are now the same, with 3377 levels in the same order.
 Task 3: Merge the datasets

1. Rename the iata variable in the refactored airport data XDF file to Origin. This step is necessary
because the rxMerge command expects variables used to join files to have the same name.
2. Reblock the airport data file using the rxDataStep function. This task ensures that the metadata for
the renamed field is updated in every block.
3. Use the rxMerge to combine the data in the two refactored files, as follows:
 Perform an inner join.
 Use the Origin variable to match the rows in each file.
 Automatically sort the data.
 Keep all fields from the flight delay data file, but only retain the timezone and Origin fields from
the airport data file.
 Rename the timezone field to OriginTimeZone in the merged file.
 Set the output block size to 50000 rows.
4. Verify that the flight delay data now contains the OriginTimeZone variable, containing time zone
information as shown in the following example:
rxGetVarInfo(mergedFlightDelayData)
Var 1: Year
9 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008
...
Var 29: OriginTimeZone, Type: character
Results: At the end of this exercise, you will have created a new dataset that combines information from
the flight delay data and airport information datasets.
Exercise 2: Transforming departure and arrival dates to UTC

Scenario
You can now use the timezone information in the flight delay data XDF file to standardize the departure
and arrival times to UTC. To do this, you will create a transformation function that operates on the data
block by block. The processing for each block will iterate through the rows in that block and add two new
variables named StandardizedDepartureTime and StandardizedArrivalTime. You will construct the
departure time using the Year, Month, DayofMonth, DepTime, and OriginTimeZone variables in each
row. You will use the functions in the lubridate package to convert the local time represented by these
variables into UTC. You will then add the ElapsedTime variable to this value to work out the arrival time.
Performing timezone conversions is a complex task that lubridate makes appear very simple. However,
this involves a considerable amount of processing. Therefore, you decide to test the transformation on a
small subset of the flight delay data, comprising approximately 20,000 rows.
1. Generate a sample of the flight delay data
2. Transform the data
 Task 1: Generate a sample of the flight delay data

1. Use the rxDataStep function to create a new XDF file containing 0.5% (approximately 20,000 rows) of
the data in the flight delay data file. Use the following expression for the rowSelection argument:
rowSelection = rbinom(.rxNumRows, size = 1, prob = 0.005)

2. Verify the number of rows in the sample by using the rxGetInfo function. The result should look like
this (the number of rows and block sizes in your output might vary slightly):
File name: \\LON-RSVR\Data\flightDelayDataSubset.xdf

Number of observations: 21794
Number of variables: 29
Number of blocks: 9
Rows per block: 2500 2534 2474 2428 2529 2598 2484 2461 1786
Compression type: zlib
 Task 2: Transform the data

1. Install the lubridate package.
2. Create a transformation function named standardizeTimes. In this function, perform the following
tasks:
 If the current chunk is a test chunk, then return immediately.

 Create a new vector for holding the standardized departure time and add it to the list of variable
values. Name the new variable StandardizedDepartureTime.
 Create another vector for the arrival time. Name the variable StandardizedArrivalTime.
 For each row in the chunk:
i. Retrieve the departure year, month, day of month, time, and timezone.
ii. Construct a string containing the date and time in POSIXct format: "yyyy-mm-dd hh:mi".
iii. Use the as.POSIXct function from the lubridate package to convert this string into a date.
Include the local timezone.
iv. Use the format function to generate a string representation of the date converted to UTC
format.
v. Save the string in the StandardizedDepartureDate field of the dataset.
vi. Retrieve the elapsed flight time. This is an integer value representing a number of minutes.
vii. Add the elapsed time to the standardized departure time. You can use the minutes function
to convert an integer into a number of minutes, and then use the + operator.
viii. Save the arrival time as a string in the StandardizedArrivalDate field of the dataset.
3. Use the rxDataStep function to perform the transformation over the sample subset of the flight
delay data. You will need to include the following arguments:
 transformFunc = standardizeTimes
 transformVars = c("Year", "Month", "DayofMonth", "DepTime", "ActualElapsedTime",
"OriginTimeZone")
 transformPackages = c("lubridate")
4. Examine the data in the transformed file and verify that the StandardizedDepartureDate and
StandardizedArrivalDate variables have been added successfully.
Results: At the end of this exercise, you will have implemented a transformation function that adds
variables containing the standardized departure and arrival times to the flight delay dataset.
Exercise 3: Calculating cumulative average delays for each route

Scenario
You want to record the cumulative average flight delays over time for each route. This will help you to
determine whether delays are getting better, or worse, or staying the same for each route. A route is
defined as “all flights that start at one selected airport and end at another”. For example, all flights that
have an Origin value of ATL, and a Dest value of PHX (Atlanta to Phoenix). You decide to use a
transformation function to perform this task. Before you transform the data, you must first sort it, to
ensure that the flights are recorded in date order.
1. Sort the data

2. Calculate the cumulative average delays
3. Verify the results
 Task 1: Sort the data

1. Use the rxSort function to sort the flight delay data by the standardized departure date and time.
2. Use the head and tail functions to examine the data and verify that it has been sorted correctly. The
first few flights in the data should be dated sometime on January 1, 2000; the last few flights should
be dated December 31, 2008.
 Task 2: Calculate the cumulative average delays

1. Create a transformation function named calculateCumulativeAverageDelays. In this function,
perform the following tasks:
 If the current chunk is a test chunk, then return it immediately.

 Create a new vector for holding the cumulative average delay and add it to the dataset. Name
this new variable CumulativeAverageDelayForRoute.
 You will calculate the cumulative average delay, based on the cumulative number of flights for
the route and the cumulative total delay for the route. You need to record and save this
information so it can be accessed as the function processes each row. To do this, you will use two
lists named cumulativeDelays and cumulativeRouteOccurrences when you run the
rxDataStep function. Use the .rxGet function to retrieve these two lists.
 Iterate through the rows in the block and perform the following actions:
i. Retrieve the Origin and Dest variables, and concatenate their values together. You will use
this string as the key for the cumulativeDelays and cumulativeRouteOccurrences lists.
ii. Retrieve the value of the Delay variable.
iii. Find the current cumulative delay for the route in the cumulativeDelays list, add the value
of the Delay variable, and store the result back in the cumulativeDelays list.
iv. Find the current cumulative count for occurrences of the route in the
cumulativeRouteOccurrences list, increment this value, and store it back in the list.
v. Calculate the cumulative average delay for the route by dividing the cumulative delay by the
cumulative number of occurrences for the route, and write the result to the
CumulativeAverageDelayForRoute variable.
 Use the .rxSet function to save the cumulativeDelays and cumulativeRouteOccurrences lists
so that they can be accessed when processing the next block.
2. Use the rxDataStep function to run the transformation. You will need to include the following
arguments:
 transformFunc = calculateCumulativeAverageDelays
 transformVars = c("Origin", "Dest", "Delay")
 transformObjects = list(cumulativeDelays = list(), cumulativeRouteOccurrences = list())
 Task 3: Verify the results

1. Examine the structure of the transformed data, and look at the first and last few rows. Verify that the
data now includes the CumulativeAverageDelayForRoute variable.
2. Use the rxLinePlot function to generate a scatter plot and regression line of average flight delays for
the following routes:
 ATL to PHX
 SFO to LAX
 LAX to SFO
 DEN to SLC
 LGA to ORD
Use the following formula:
CumulativeAverageDelayForRoute ~ as.POSIXct(StandardizedDepartureTime)
Use the rowSelection argument to specify the origin and destination airport for each route.
environment.
Results: At the end of this exercise, you will have sorted data, and created and tested another
transformation function.

 Perform transformations over big data in an efficient manner.
 Perform sort and merge operations over big data.


5-1
Module 5
Parallelizing Analysis Operations
Contents:
Module Overview 5-1
Lesson 1: Using the RxLocalParallel compute context with rxExec 5-2
Lesson 2: Using the RevoPemaR package 5-11
Lab: Parallelizing analysis operations 5-20
Module Overview
The ScaleR™ functions in the revoScaleR package you have seen to this point are excellent tools for High
Performance Analytics (HPA), where jobs are typically data-limited and the problem for the software is
effectively distributing that data to the different nodes in the cluster or server. In the HPA model, you run
a task using the various rx* functions on an R Server cluster. One of the computing resources in the
cluster takes on the role of managing the task and becomes the master node for that task. The master
node splits the computation out in subtasks, which it distributes across all nodes in the cluster. All nodes
in the cluster have access to the data, and the master node determines which parts of the data each node
should process. When the nodes have completed their work, the master node collects the results, and
then accumulates an overall result, which it returns.
Another set of problems are more accurately classed as High Performance Computing (HPC). These jobs
are typically CPU-limited; you have less data but you require a lot of CPU power. These problems are
often known as “embarrassingly parallel”—little or no effort is required to split up the processing into
tasks that can be run in parallel. In other words, the tasks are not dependent on each other, so they can be
easily separated to run on different nodes. Base R has a number of packages and functions to assist with
this, such as the parallel package that implements parallel versions of the lapply function, and the
foreach package that you can use for parallelizing loops.
The revoScaleR package includes functions that enable you to do HPC and embarrassingly parallel
computations, in addition to HPA operations. Just like the rx* functions, these functions make use of the
“write once, deploy anywhere” model, where you can write your code and check that it works locally
before deploying it to a more powerful remote server by simply changing the compute context in which it
runs.
Objectives
 Use the rxExec function with the RxLocalParallel compute context to run arbitrary code and
embarrassingly parallel jobs on specified nodes or cores, or in your compute context.
 Use the RevoPemaR package to write customized scalable and distributable analytics.
5-2 Parallelizing Analysis Operations
Lesson 1
Using the RxLocalParallel compute context with rxExec
Use the rxExec function to perform traditional HPC tasks by executing a function in parallel across the
node of a cluster or cores of a remote server. It offers great flexibility regarding how arguments are
passed—you can specify that all nodes receive the same arguments, or provide different arguments to
each node. However, unlike the HPA ScaleR functions, you need to control how the computational tasks
are distributed and you are responsible for any aggregation and final processing of results.
Lesson Objectives
 Describe when to use the RxLocalParallel compute context to perform parallel jobs.
 Use the rExec function to run code in parallel.
 Use the doRSR package to parallelize foreach jobs.

 Create and run non-waiting jobs.
 Invoke HPA functions from rExec.
Using the RxLocalParallel compute

context
The rxExec function exposes its raw power in a parallel
environment, such as a cluster. If you don’t have a
clustered environment available, you can still exploit
parallelism by using the RxLocalParallel compute
context to distribute computations.
The RxLocalParallel compute context utilizes the

doParallel back end for HPC computations. For more
information on doParallel, see: https://cran.r-
project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf.
Despite its name, you can also use this compute context on clusters with more compute power than a
local machine. However, you can only use the RxLocalParallel compute context with rxExec; it is ignored
by the other ScaleR HPA functions that handle parallel and distributed computing in their own way.
You should use the combination of rxExec and the RxLocalParallel compute context only in situations
when tasks are well suited to parallel execution, such as:
 Embarrassingly parallel tasks where individual subtasks are not dependent upon each other. These
include mathematical simulations, bootstrap replicates of relatively small models, image processing,
brute force searches, growing trees in a random forest algorithm, and almost any situation where you
would use the lapply function on a single core computer.
 Tasks that reference local resources that should not be relocated.

The default compute context is rxLocalSeq, which enables only sequential processes when using rxExec.
To change to the RxLocalParallel, you need to first create a RxLocalParallel object to use with rxExec,
and then set this as the main compute context:
Creating an RxLocalParallel object

parallelContext <- RxLocalParallel()
rxSetComputeContext(parallelContext)
Subsequent calls to rxExec will now make use of the parallel compute context.
Using the rxExec function to perform tasks in parallel

Base R functions run sequentially. You use the
rxExec function to take an arbitrary base R
function and run it in parallel on your distributed
computing resources. You can then tackle
traditional HPC tasks as described earlier.
The only arguments needed by rxExec are the

function to be run, and any required arguments of
that function. You use additional arguments to
control the computation.
For more information, see:
Distributed and parallel computing with

ScaleR in Microsoft R
https://aka.ms/vpa68r
You should also check the documentation in R Client for rxExec and RxLocalParallel.
There are two primary use cases for running rxExec:
1. To run a function multiple times, collecting the results in a list.
2. As a parallel lapply type function that operates on each object of a list, vector or similarly iterable
object.
The following examples illustrate instances of these use cases. Note that these examples do not necessarily
demonstrate efficient uses of parallel computing.
This example shows a simulation function that takes no arguments and returns a numeric in the FUN
argument to rxExec. The number of times the simulation function g is executed is determined by the
number passed to the timesToRun argument. The value returned by rxExec is a list containing the results
for each iteration:
Running an independent function multiple times

# Example function
g <- function() 3 * runif(1) + rnorm(1)
# Run the function and collect the results

y2 <- rxExec(FUN = g, timesToRun = 10)
y2
# Typical results
# [[1]]
# [1] 1.59938
#
# [[2]]
# [1] 2.322001
# …
# [[10]]
# [1] 0.3149986
Note that, if you are running on a cluster, you can influence the distribution of tasks to nodes in the
cluster by using the taskChunkSize argument to rxExec. This argument specifies the number of tasks that
should be allocated to each node. For example, if you set timesToRun to 5000 and you have a five-node
cluster, you can set the taskChunkSize to 1000 to force each node to perform 1,000 iterations of the task
rather than letting the master node decide.
In the next example, you have a list of numbers, xs, which you want to use as inputs to a simulation
function, f. The rxExec function applies the function f (in the FUN argument) to each element of xs (in
the elemArgs argument) to produce the list ys. Here, you determine the number of times the function is
called, not by the timesToRun argument, but by the length of the object passed to elemArgs. Because
the function operation on each element is independent of the others, the different operations can be
farmed out across as many nodes or cores as appropriate for the size of your problem and your available
compute resource.
Using rxExec as a parallel lapply function

# Create test data
xs <- runif(10)
# Define the function to run in parallel

f <- function(x) 3 * x + rnorm(1)
# Invoke the function 10 times using rxExec and gather the results
ys <- rxExec(FUN = f, elemArgs = xs)
ys
# Typical results
# [[1]]
# [1] 1.681905
#
# [[2]]
# [1] 0.5026906 # …
# [[10]]
# [1] 1.145017
This final example shows how to supply multiple arguments to the FUN function by providing a nested list
to elemArgs. The simulation function h takes two arguments, x1 and x2. The code in the second line
builds a list of lists, xx, each element of which contains two numeric values, also named x1 and x2. In the
call to rxExec, the length of the list xx determines the number of times h is called, and the values in each
element are passed on to h for that iteration. The results are returned by the rxExec function as a list.
Supplying multiple arguments using nested lists

# Define a function that takes multiple arguments
h <- function(x1, x2) 3 * x1 + 2 * x2 + rnorm(1)
# Build a list of 10 lists, each containing two random values

xx <- lapply(1:10, function(x) list(x1 = runif(1), x2 = runif(1)))
# Invoke the h function n times, where n is the length of list xx

y2 <- rxExec(FUN = h, elemArgs = xx)
Note that, in all three examples, vectorized versions of base R functions could be more efficient than the
code shown. However, the power of parallelized code is apparent with more complex simulation functions
running for vastly more iterations.
Note: The rxExec function operates in the same closed environment as other ScaleR
functions. If you invoke a function that references another package, you must specify the
package name using the packagesToLoad argument. Similarly, if you reference other R objects
from your environment inside the function, you must provide a list of the object names using the
execObjects argument.
Using rExec as a back end for foreach

The foreach package provides a popular way to
perform parallel processing in base R. You can use
this package to parallelize code that uses a
standard for loop syntax. Note that you might
have to share parallel code with people not using
the Microsoft R stack, so you need to ensure that
your code will work on other systems.
The doRSR package provides a parallel back end

for the %dopar% function in foreach, built on top
of rxExec. This is available in all RevoScaleR builds.
When you have loaded the package and
registered the doRSR back end, the %dopar%
function is connected to the current compute context and the code is run through rxExec. This gives you
the flexibility to use familiar syntax, and means you can use the same code with a different back end on
non-Microsoft R builds.
To use doRSR, you first need to load the package into the namespace, and then register the back end.
After this, any code using the %dopar% function will run using rxExec and your current compute context.
The following code shows a simple example:
Using the doRSR package

# Load and register doRSR
library(doRSR)
registerDoRSR()
# Run a foreach loop containing the %dopar% function to calculate square roots in
parallel
foreach(i=1:3) %dopar% sqrt(i)
# Results:
# [[1]]
# [1] 1
#
# [[2]]
# [1] 1.414214
#
# [[3]]
# [1] 1.732051
Note: The ScaleR package defines a compute context, RxForeachDoPar that is specifically
optimized to handle parallel foreach operations. This compute context creates a parallel
environment, and registers the back end automatically. You can use this compute context in
place of RxLocalParallel if you are only using rxExec to run loops.
Demonstration: Using rxExec to perform tasks in parallel

The demonstration runs a dice game simulation. The rules of the game are:
 If you roll a 7 or 11 on your initial roll, you win.
 If you roll 2, 3, or 12, you lose.
 If you roll a 4, 5, 6, 8, 9, or 10, then that number becomes your point. You continue rolling until you
either roll your point again (you win), or roll a 7 (you lose).
Demonstration Steps
Running a simulation sequentially
2. Open the R script Demo1 - dice.R in the E:\Demofiles\Mod05 folder.
4. Highlight and run the code that creates the playDice function, under the comment # Create dice
game simulation function.
5. Highlight and run the statement that runs the playDice function, under the comment # Test the
function. This code should display the message Win or Loss, depending on the output of the
simulation.
6. Highlight and run the code under the comment # Play the game 100000 times sequentially. This
code uses the replicate function to run the playDice function. The results are captured and
tabulated, showing the percentage of wins and losses in the 100,000 runs of the games. Note the
user and system statistics reported by the system.time function.
7. Repeat step 6 several times, to get an average of the user and system timings.
Running a simulation using parallel tasks
1. Highlight and run the code under the comment # Play the game 100000 times using rxExec. Note
that the code is currently running using the RxLocalSeq compute context, so tasks are still being
performed sequentially. However, the user and system timings should be much quicker than before.
2. Highlight and run the code under the comment # Switch to RxLocalParallel. The time spent running
in user mode should be lower still, although the overall elapsed time is likely to be higher in this
example.
The reason for this is that this is a simple simulation using a very small amount of data on a modest
server. The overheads of splitting up the job into tasks and running them in parallel actually exceeds
any performance benefits gained. However, if the job was much more compute intensive and
involving vast amounts of data, then these overheads become a much less significant part of the
processing.
Note: the remote session might be interrupted with the message:
Canceling execution...
Error in remoteExecute(line, script = FALSE, displayPlots = displayPlots, :
object 'r_outputs' not found
Type resume() in the console to return to the remote session.
Creating and running non-waiting jobs

By default, compute contexts are waiting (or
blocking)—that is, your R session waits for results
from the job before returning control to you.
Often, jobs only take a few seconds and this is not
a problem. However, some very large HPC jobs
might run for several minutes or hours, even if
they are running on a large cluster. In such cases,
you might prefer to send the job out to the cluster
and continue working on your local R session. To
do this, you can specify a compute context to be
non-waiting (or non-blocking), which will return
control of the local session after the remote
session has been started. You can then check back on the progress of the job and retrieve the results
when it is complete.
Another use for non-waiting compute contexts is for massively parallel jobs involving multiple clusters.
You can define a non-waiting compute context on each cluster, launch all your jobs, and then aggregate
the results. The job scheduler on the cluster can control the timing of these jobs.
You can set a compute context object to be non-waiting by setting the wait argument to FALSE when
you call the context constructor function. Calls to rxExec will then return control back to the local session.
Note that you cannot define a local compute context (RxLocalSeq or RxLocalParallel) as non-waiting.
To find the status of a running non-waiting job, you can call rxGetJobStatus with the object name of the
job (defined in the call to rxExec) as the argument. If you forget to assign a name to the job in the rxExec
call, you can use the function rxLastPendingJob to retrieve it and assign a name to it. To cancel a non-
waiting job, use the rxCancelJob function with the job name as the argument.
To retrieve the results of a finished non-waiting job, you can call rxGetJobResults with the name of the
job as the argument.
For more information about creating and managing non-waiting jobs, see:
Distributed and parallel computing with ScaleR in Microsoft R

https://aka.ms/qrkdba
Calling HPA functions from rxExec

You can use rxExec to farm out jobs containing HPA
functions to different nodes on a cluster. For example,
you might have a cluster where each node has several
cores. You could then run an independent analysis on
each node with the HPA functions making use of the
available cores on their assigned node. To do this, you set
the elemType argument in the rxExec function to
“nodes”. This will parallelize over the nodes in the
cluster. You can then include HPA functions (such as
rxGlm, rxLm, and so on) within the function you pass to
the FUN argument. This function will, in turn, attempt to make the best use of the resources available to
it, given the compute context associated with that node.
Demonstration: Creating waiting and non-waiting jobs in Hadoop

This demonstration shows how to create waiting and non-waiting jobs using the Hadoop compute
context, and how to cancel a non-waiting job.
Demonstration Steps
Uploading data to Hadoop
2. Open the R script Demo2 - nonwaiting.R in the E:\Demofiles\Mod05 folder.
statements establish a waiting compute context running on the Hadoop VM.
4. Highlight and run the code under the comment # Upload the Flight Delay Data. This statement
removes any existing version of the flight delay data, and then copies the latest flight delay data XDF
file to HDFS on the Hadoop VM. This will take two or three minutes to run.
Note: the rxHadoopRemove function will return FALSE if the file doesn't already exist. You can
ignore this status.
The rxHadoopCopyFromClient functions should display the message TRUE if the file is uploaded
successfully.
Running an analysis as a waiting job

1. Highlight and run the code under the comment # Perform an analysis on the flight data. These
statements use the rxSummary function to generate a list of routes (Origin, Dest pairs), and the
number of flights made for each route. This summary will be performed as a Hadoop Map/Reduce
job. This task will take a couple of minutes to complete.
2. Highlight and run the code under the comment # Try and sort the data in the Hadoop context.
This code uses the rxSort function to sort the results of the summary. However, the way in which
rxSort works renders it unsuitable for running in a distributed environment such as Hadoop, so the
code fails with an error message.
3. Highlight and run the code under the comment # Use rxExec to run rxSort in a distrubuted
context. This statement uses the rxExec function to invoke rxSort. You should note that this is more
of a workaround to enable you to run this function (and others like it that are not inherently
distributable), because, in this case, rxExec runs the function as a single task.
When the sort has completed, the head function displays the first few hundred routes, with the most
popular ones at the top of the list.
Running an analysis as a non-waiting job

1. Highlight and run the code under the comment # Create a non-waiting Hadoop compute context.
This code creates another Hadoop compute context but with the wait flag set to FALSE.
2. Highlight and run the code under the comment # Perform the analysis again. This statement runs
the same rxSummary task as before, but this time it executes as a non-waiting job. The value
returned is a job object.
3. Highlight and run the code under the comment # Check the status of the job. This statement
checks the status of the job. If it is still in progress, it returns the message running. If the job has
completed, it returns the message finished.
4. Keep running the code under the comment # Check the status of the job until it reports the status
message finished.
5. Highlight and run the code under the comment # When the job has finished, get the results. This
code uses the rxGetJobResults function to retrieve the results of the rxSummary function, and then
prints the result (it has not been sorted).
6. Highlight and run the code under the comment # Run the job again.
7. Highlight and run the code under the comment # Check the status of the job. Verify that the job is
running.
8. Highlight and run the code under the comment # Cancel the job. This statement uses the rxCancel
function to stop the job and tidy up any resources it was using.
Note: you can speed up cancellation by setting the autoCleanup flag to FALSE when you create the
Hadoop compute context. However, this will leave any temporary artifacts and partial results in place
on the Hadoop server. You will eventually need to remove these items to avoid filling up server
storage.
9. Highlight and run the code under the comment # Check the status of the job. The job should now
be reported as missing.
10. Highlight and run the code under the comment # Return to the local compute context. This
statement resets your compute context back to the local client VM.

Statement Answer
The RxLocalParallel compute context enables you to parallelize all ScaleR

functions running in the local compute context when running on R Client.
True or False.
Lesson 2
Using the RevoPemaR package
In this lesson, you will learn about the RevoPemaR package—a framework to build your own HPA
functions that are scalable and distributable across your servers, clusters or database services. These
functions are known as Parallel External Memory Algorithms (PEMAs). They can work with chunked data
too big to fit in memory, and these chunks can be processed in parallel. The results are then combined
and processed, either at the end of each iteration or at the very end of computation.
As with the RevoScaleR HPA functions you have learned about in previous modules, PEMA functions can
be tested on a local session with a sample dataset, and then deployed to your high performance compute
resources.
Lesson Objectives
 Describe the structure of a PEMA class.
 Create a PEMA class using the RevoPemaR framework.
 Perform an analysis using a PEMA object.

 Debug a PEMA class.
The RevoPemaR framework

The PemaRevoR framework provides constructs
that enable you to process big data in a manner
that is similar to the traditional map-reduce
methodology—for a full description, see:
Use mapreduce in Hadoop with HDInsight

https://aka.ms/pvsf96
In the PEMA model, a master node splits the data

into chunks that can be processed in parallel and
distributes the processing to multiple client nodes.
Each client node can operate on its data in
isolation, and sends its results back to the master node which is responsible for combining the results from
individual nodes together before returning the result.
Note that the PEMA framework relies upon Reference classes, an Object-Oriented Programming (OOP)
paradigm that was introduced into the R language in R version 2.12. Reference (Ref) classes:
 Have methods that belong to objects rather than functions, unlike the standard S3 and S4 classes in R.
 Deal with mutable state better than traditional R S3 or S4 classes.
These features make them useful for computations that need to be updated repeatedly from different
chunks running on different servers or nodes on a cluster.
The structure of a PEMA class

You create a new PEMA class using the class
constructor function setPemaClass(). This is a
wrapper around the Ref class constructor function
setRefClass() and provides the basic class
infrastructure needed for the PEMA object to be
deployed in parallel. You provide it with a class
generator that builds your PEMA object. When
you create a new PEMA class generator, you write
code using the following methods:
 initialize. This method sets the initial field

values. You also pass up the initialized
variables to the superclass and set up the
critical methods for parallel computation here.
 processData. This method controls what the algorithm does to process the data in each chunk. It
produces intermediate results that are stored as fields.
 updateResults. This is the method used by the master node in the cluster or server to collect the
results from the other nodes together in one place. This method makes use of the mutable state of
Ref classes to update the same fields in multiple nodes.
 processResults. This method takes the combined intermediate results from the updateResults
method and performs whichever computations are needed to calculate the final result.
 getVarsToUse. This method specifies the names of the variables in the dataset to use.
Creating a PEMA class

When you write a new PEMA class generator, you
must provide the following information:
 The class name.
 The superclass from which your class will

inherit methods and fields. In this case, it will
be PemaBaseClass, or a child class of this
class.
 The fields, or class variables. Since Ref classes

are mutable, the contents of these items are
likely to change throughout the analysis run.
 The methods, or class functions, that perform

the analysis.
The following code snippet shows the essential structure of a PEMA class generator. In this example,
PemaMean is a generator for a PEMA object that calculates the mean values for a specified variable in a
dataset. Note that you must load and reference the RevoPemaR package:
The essential structure of a PEMA class generator

library(RevoPemaR)
PemaMean <- setPemaClass(
Class = "PemaMean",
contains = "PemaBaseClass",
fields = list(
… # See next example
),
methods = list(
… # See examples below
)
)
Specifying fields
You define the fields required by the PEMA object in the fields list. Each field has a name and a type, as
shown in the next example. The fields required by PemaMean are:
 The name of the variable for which the mean is being calculated (varName).
 The sum of all values in that variable in the dataset (sum).
 The total number of observations of this variable in the dataset (totalObs).
 The number of valid observations of this variable (totalValidObs).
 The final result of the algorithm; the mean value being calculated (mean).
The code below shows the list of fields for the PemaMean class
Defining the fields for the PemaMean class

fields = list(
sum = "numeric",
totalObs = "numeric",
totalValidObs = "numeric",
mean = "numeric",
varName = "character"
),
Implementing the initialize method

You use the initialize method to configure the PEMA object as it starts up. Specifically, you should
perform the following tasks in this method:
 Provide a doc string that documents the method. This is general good practice.
 Initialize the parent class from which the PEMA class inherits. Use the callSuper method to do this.
 Set up the various functions and infrastructure that the RevoPemaR framework uses to parallelize
instances of this class. You do this by calling usingMethods(.pemaMethods).
 Initialize the fields.

The following code shows the initialize method of the PemaMean class. The varName field is populated
with the parameter passed to initialize; the remaining fields are set to 0:
Initializing the PemaMean class

initialize = function(varName = "", ...) {
'sum, totalValidObs, and mean are all initialized to 0'
callSuper(...)
usingMethods(.pemaMethods)
varName <<- varName
sum <<- 0
totalObs <<- 0
totalValidObs <<- 0
mean <<- 0
}
Note: Important: if you fail to initialize a field, it can retain its value from a previous use of
a PEMA object constructed using this class generator. This can cause confusing results so you
should make sure that you initialize everything.
Processing data in a chunk

You use the processData method to performing the actual calculations and operations for a chunk of
data. This method is analogous to a transformation function used by HPA methods such as rxDataStep. It
takes a list of vectors, each providing the values for a variable in this chunk from the underlying dataset.
You can read this information and use it to perform any processing necessary, and update the fields with
the results. Note that, unlike a transformation function, you should not modify the vectors in the list that’s
passed in, and you do not return a value. All of the results are stored in the fields.
In the PemaMean class, the following code retrieves the data from the variable specified by the varName
field from the list of vectors, sums them, and then adds the result to the value in the sum field. The
method also updates the totalObs and totalValidObs fields with the total number of observations and
valid observations in the chunk.
Processing a chunk of data in the PemaMean class

processData = function(dataList) {
'Updates the sum and total observations from the current chunk of data.'
sum <<- sum + sum(as.numeric(dataList[[varName]]),
na.rm = TRUE)
totalObs <<- totalObs + length(dataList[[varName]])
totalValidObs <<- totalValidObs +
sum(!is.na(dataList[[varName]]))
invisible(NULL)
}
Note: Important: the processData method is called once for each chunk of data it is
assigned by the master node. It is therefore likely that the same node will be used many times.
Make sure that you accumulate (add to) the results for each run of the processData method in
the fields of the object, as shown in the previous example.
Aggregating results on the master node

A client node processes one or more chunks and accumulates its own set of results. The master node has
to gather the results from all client nodes and combine them. This is the purpose of the updateResults
method.
This method takes a reference to a PEMA object. You use the updateResults method to retrieve the data
in the fields of this PEMA object and add this data to the values represented by the local set of fields in
this node (the master). The PemaMean class uses the updateResults function to accumulate the values in
the sum, totalObs, and totalValidObs fields of each PemaMean object into the corresponding fields on
the master node. Like the processData method, you should not return a value from the updateResults
method.
The following code shows the updateResults method for the PemaMean class.
Updating results in the PemaMean class

updateResults = function(pemaMeanObj) {
'Updates sum and total observations from another PemaMean object.'
sum <<- sum + pemaMeanObj$sum
totalObs <<- totalObs + pemaMeanObj$totalObs
totalValidObs <<- totalValidObs + pemaMeanObj$totalValidObs
invisible(NULL)
}
Note: Important: Don't assume that the updateResults method will always run. In a
single-node environment, or if the master node does not distribute the work, then the
aggregated results should be available in the only running instance of the PEMA object—so there
is no need for the master node to call updateResults.
Generating the final result

The processResults method contains the logic that calculates the overall result from the values in the
fields on the master node. The PemaMean class uses this method to calculate the mean using the
accumulated values in the sum and totalValidObs fields. The method returns this value as the result of
the processing.
Generating the final results in the PemaMean class

processResults = function() {
'Returns the sum divided by the totalValidObs.'
if (totalValidObs > 0)
{
mean <<- sum/totalValidObs
}
else
{
mean <<- as.numeric(NA)
}
return( mean )
}
Specifying which variables to use

You can specify the names of the variables in the underlying dataset to use in the analysis when you
instantiate the PEMA class using the pemaCompute function (described later in this lesson). These
variables can be passed as parameters to the initialize method of the PEMA class. The PemaMean class
uses this mechanism to enable a user to indicate the variable for which the mean should be calculated
using the varName parameter. You can also specify the names of variables to use in the getVarsToUse
method. This is optional, but doing so can help to improve performance if you are processing a large
dataset held on disk rather than in memory.
The PemaMean class implements the getVarsToUse method very simply, as follows:
Specifying the variables to use in the PemaMean class

getVarsToUse = function()
{
'Returns the varName.'
varName
}
Using a PEMA object to perform an

analysis
After you have run the code that creates the generator for
the PemaMean class, you can instantiate a new PemaMean
object as follows:
Instantiating a PemaMean object

meanPemaObj <- PemaMean()
You can then use the pemaMeanObj object to perform an analysis and calculate the mean of a variable
in a dataset. To do this, you use the pemaCompute function. This function takes a PEMA object, a
dataset, and any parameters that you defined for the initialize method of the PEMA class.
The following example creates a sample data frame containing 1,000 random numbers in a field named x,
and uses this data frame as the dataset for the PemaMean object, which calculates the mean of the
values in field x.
Using the PemaMean object to calculate the mean value of a variable in a dataset
set.seed(12345)
pemaCompute(pemaObj = meanPemaObj,
data = data.frame(x = rnorm(1000)), varName = "x")
The value returned by pemaCompute is the result of the computation, in this case the mean of x.
The fields in the meanPemaObj object remain populated after the pemaCompute function has finished,
so you can examine the contents of any field by using the $ accessor. This code fragment retrieves the
values of the mean field, which should be the same as that returned by the pemaCompute function, and
the totalValidObs field, which contains the number of valid observations in the dataset; this should be
1000:
Displaying the results held in the PemaMean object

meanPemaObj$mean
meanPemaObj$totalValidObs
You can run the pemaCompute function over the same object again. By default, the initialize function
will be used to reset the fields in the object. However, if you specify the argument initPema = FALSE, the
initialize function will not be invoked, enabling the analysis to continue using the previously stored values
for the fields.
The following example uses this feature to run a further analysis over a new dataset, but generates a result
based on the aggregation of the data in this dataset and the previous results. If you display the value of
meanPemaObj$totalValidObs after running this statement, it should contain the value 2000.
Reusing the same PemaMean object

pemaCompute(pemaObj = meanPemaObj,
data = data.frame(x = rnorm(1000)), varName = "x",
initPema = FALSE)
Running a PEMA analysis in a different compute context

Like the ScaleR HPA functions, PEMA classes are designed so you can “write once and deploy anywhere”.
It is simple to run PEMA analyses on a remote compute resource without changing the underlying code in
the PEMA class. All you need to do is pass your cluster compute context to the computeContext
argument in the call to pemaCompute.
For more information about PEMA classes, see:
Getting started with PemaR functions in Microsoft R

https://aka.ms/cbqg7c
Debugging a PEMA class

Debugging distributed processing can be tricky.
You can use the standard R debugging tools and
the base Ref class also provides trace and untrace
methods. However, debugging with message flags
and print statements often does not work as
desired when the object is running on remote
nodes or within methods. To handle this situation,
the PemaBaseClass provides the outputTrace
method. When this function is called with a text
argument from within a method in the class, the
text is printed out to the console if the traceLevel
field is set to a value equal to or above that
specified by the outputTraceLevel argument of outputTrace. This means you can leave debug code in
your functions but hide their output by setting the traceLevel. You can also have multiple levels of
debugging (for example, an “info” and “warning” level of debugging).
You can set the trace level in the initialize method, as highlighted in the following code sample. Note
that the traceLevel field is part of the PemaBaseClass parent class:
Setting the debug trace level in the PemaMean class

initialize = function(varName = "", ...) {
'sum, totalValidObs, and mean are all initialized to 0'
callSuper(...)
…
traceLevel <<- as.integer(1)
}
In the following code, the processResults uses the outputTrace method to display the message
"outputting mean" if the traceLevel field is set to a value that is greater than or equal to 1.
Displaying a trace message

processResults = function()
{
'Returns the sum divided by the totalValidObs.'
if (totalValidObs > 0)
{
mean <<- sum/totalValidObs
}
else
{
mean <<- as.numeric(NA)
}
.self$outputTrace("outputting mean:\n", outTraceLevel = 1)
return( mean )
}
Demonstration: Creating and running a PEMA object

This demonstration walks through the example of the PEMA class described in the notes.
Demonstration Steps
Creating a PEMA class generator
2. Open the R script Demo3 - PemaMean.R in the E:\Demofiles\Mod05 folder.

3. Highlight and run the code under the comment # Create the PemaMean class. This code creates
the PemaMean class, which you can use to calculate the mean value of a specified variable in a
dataset. The class uses the RevoPemaR framework to perform the calculation in a distributed manner.
The class, fields, and methods are the same as those described in the notes in this lesson.
Testing the PemaMean class

1. Highlight and run the code under the comment # Instantiate a PemaMean object. This statement
creates a variable named meanPemaObj using the PemaMean class.
2. Highlight and run the code under the comment # Connect to R Server. This creates a remote
connection and sets up the environment for RevoPemaR.
3. Highlight and run the code under the comment # Copy the PemaMean object to the R server
environment for testing. This copies the PemaMean object from the local session to the remote
session running on R server.
4. Highlight and run the code under the comment # Create some test data. This code creates a data
frame with a single variable named x. The variable contains 1,000 random values. Note that the
random number generated is seeded with a specific value to enable the test to be repeatable.
5. Highlight and run the code under the comment # Run the analysis. This code uses the
pemaCompute function to deploy the meanPemaObj object to the distributed environment and
start it running. The parameters specify the dataset, and the variable for which the mean should be
calculated. The value returned is displayed. It should be 0.04619816.
6. Highlight and run the code under the comment # Examine the internal fields of the PemaMean
object. This code retrieves the data stored in the sum, mean, and totalValidObs fields of the object.
Note that the number of valid observations is 1000.
7. Highlight and run the code under the comment # Create some more test data. This code creates
another dataset of 1000 seeded random numbers.
8. Highlight and run the code under the comment # Run the analysis again, but include the previous
results. This repeats the analysis, but sets the initPema argument of the pemaCompute function to
FALSE. This prevents the PEMA framework from invoking the initialize function in the PEMA object,
so the fields are not reset. The result should be -0.006803199.
9. Highlight and run the code under the comment # Examine the internal fields of the PemaMean
object again. This time, note that the number of valid observations is 2000.
Using the PemaMean class to analyze flight delay data

1. Highlight and run the code under the comment # Perform an analysis against the flight delay
data. This code copies a test subset of the flight delay data to the remote session, and then uses the
PemaMean object to analyze the Delay variable in this data. The result is the mean delay for all
flights.

Question
You have written a PEMA class that performs a complex analysis in parallel. You
decide to test the class on a cluster with a single compute node. The data is divided
into 50 chunks. How many times does the updateResults method of the PEMA
object run?
50
It varies, depending on how the master node decides to distribute the work,
but it could be anywhere between 1 and 50.
2 (once at the start of the operation and once at the end)

Lab: Parallelizing analysis operations

Scenario
After pondering and examining the flight delay data, you decide that it is time to focus on a more specific
problem:
If I fly from A to B by airline C in the morning/afternoon/evening of day X in month Y, what is the

probability that the flight will be delayed, and for how long?
This is a complex question that depends on a number of variables, so you decide to break it down into
smaller elements. Initially, you focus on the phrase "If I fly from A to B by airline C … ?" To help solve this
part of the equation, you want to find the number of times these flights are delayed, and by how long.
You realize that you can perform the processing for this problem using parallel nodes, so you decide to
write a PEMA class to help you.
Objectives
In this lab, you will create a PEMA object that you can use to examine flight delay data.
Lab Setup
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
Exercise 1: Capturing flight delay times and frequencies

Scenario
The PEMA class that you will develop analyzes flights made between two selected airports by a specified
airline. The airports and airline will be passed to the PEMA object as parameters. The result generated by
the PEMA object will comprise a list containing:
 The total number of flights made by the selected airline from the first airport to the second.
 The number of these flights that were delayed.
 The delay times for the delayed flights.
1. Create a shared folder

2. Copy the data
3. Create the class generator for the PemaFlightDelays class
4. Test the PemaFlightDelays class using a data frame

5. Use the PemaFlightDelays class to analyze XDF data
 Task 1: Create a shared folder

If you have not already created the \\LON-RSVR\Data share in an earlier lab, perform the following
steps:
1. Log in to the LON-RSVR VM as Adatum\AdatumAdmin with the password Pa55w.rd.
3. Create a network share for the C:\Data folder. This share should provide read and write access for
Everyone. Name the share \\LON-RSVR\Data.
 Task 2: Copy the data

2. Copy the file FlightDelayData.xdf from the E:\Labfiles\Lab05 folder to the \\LON-RSVR\Data
share.
3. Close the command prompt window.
 Task 3: Create the class generator for the PemaFlightDelays class

file.
2. Install the dplyr package.
3. Bring the dplyr and RevoPemaR libraries into scope.

4. Create a new class generator named PremaFlightDelays using the setPemaClass function. This class
should inherit from the PemaBaseClass class.
5. Define the following fields:

 totalFlights: numeric
 totalDelays: numeric
 origin: character
 dest: character
 airline: character
 delayTimes: vector
 results: list
6. Add the initialize method to the methods list for the class. This method should take three
parameters: originCode, destinationCode, and airlineCode. These parameters will be used to
specify the origin, destination, and airline to use for retrieving flight delay information. They should all
default to the empty string.
In the body of the initialize method, perform the following tasks:

 Activate the PEMA framework:
callSuper(...) usingMethods(.pemaMethods)
 Initialize totalFlights and totalDelays to zero.

 Set delayTimes to be a zero-length numeric vector.
 Initialize origin using the originCode parameter.
 Initialize dest using the destinationCode parameter.
 Initialize airline using the airlineCode parameter.
7. Define the processData method.
This method should take a single parameter named dataList. This parameter will either contain a
data frame or a list of vectors, depending on whether the PEMA object is run against a data frame or
an XDF object.
In this method, perform the following tasks:
 Coerce the dataList parameter into a data frame if it is not there already.
 If the origin field contains an empty string, the user didn't provide this information as a
parameter when running the object. In this case, populate the origin with the first value found in
the Origin variable in the data frame.
 Likewise, populate the dest and airline fields using the first values found in the Dest and
UniqueCarrier variables of the data frame.
 Filter the data frame to find all flights that match the combination specified by the values in the
origin, dest, and airline fields.
 Count the number of flights matched and add this figure to the totalFlights field.
 From this dataset, find all flights with a delay time of more than zero minutes and append these
delay times to the delayTimes vector.
 Count the number of delayed flights and add this figure to the totalDelays field.
8. Define the updateResults method.
This method should take a single parameter named pemaFlightDelaysObj. This parameter contains
a reference to another instance of the PEMA object that has been running on another node. In this
method, perform the following tasks:
 Add the value found in pemaFlightDelaysObj$totalFlights to the totalFlights field.
 Add the value found in pemaFlightDelaysObj$totalDelays to the totalDelays field.
 Append the data found in the pemaFlightDelaysObj$delayTimes vector to the delayTimes

field.
9. Define the processResults method.
This method should not take any parameters. In this method, perform the following tasks:
 Construct the results list. This list should have three elements named NumberOfFlights,
NumberOfDelays, and DelayTimes. Store the value from the totalFlights field in the
NumberOfFlights element. Store the value from the totalDelays field in the NumberOfDelays
element. Copy the delayTimes vector to the DelayTimes element.
 Return the results list.
10. Run the code to create the PremaFlightDelays class.
 Task 4: Test the PemaFlightDelays class using a data frame

1. Create an instance of the PemaFlightDelays class in a variable named pemaFlightDelaysObj using
the PemaFlightDelays class generator.
2. Start a remote session on the LON-RSVR VM. When prompted, specify the username admin, and the
password Pa55w.rd.
3. At the REMOTE> prompt, temporarily pause the remote session and return to the local session
running on the LON-DEV VM.
4. Copy the local object, pemaFlightDelaysObj, to the remote session. Use the putLocalObject
function to do this.
6. In the remote session, install the dplyr package, and bring the dplyr and RevoPemaR libraries into
scope.
7. Create a data frame containing the first 50,000 observations from the FlightDelayData.xdf file in the
\\LON-RSVR\Data share.
8. Use the pemaCompute function to run an analysis of flight delays using the pemaFlightDelaysObj
object. Specify an origin of "ABE", a destination of "PIT", and the airline "US".
9. Display the results. They should indicate that there were 755 matching flights, of which 188 were
delayed. The delay times for each delayed flight should also appear.
10. Display the contents of the internal fields of the pemaFlightDelaysObj object. They should contain
the following values:
 delayTimes: a list of 188 integers showing the delay times for each delayed flight.
 totalDelays: 188
 totalFlights: 755
 origin: ABE
 dest: PIT
 airline: US
 Task 5: Use the PemaFlightDelays class to analyze XDF data

1. Create an XDF file containing the first 50,000 observations of the flight delay data. Specify a small
block size of 5000 rows. This will help to ensure that the data is presented to the PEMA object as
several chunks, for testing purposes. Note that in the previous case, when using a data frame,
chunking does not occur and the PEMA object receives all of the data in one go.
2. Repeat the analysis performed in the previous exercise using this XDF file. Verify that the results are
the same as those generated using the data frame.
3. Perform an analysis of flights from LAX to JFK made by airline DL using the entire dataset in the
FlightDelayData.xdf file.
environment.
Results: At the end of this exercise, you will have created and run a PEMA class that finds the number of
times flights that match a specified origin, destination, and airline are delayed—and how long each delay
was.
Question: How many flights were made from LAX to JFK by DL, and how many were
delayed? What was the longest delay?
Question: How could you verify that the results produced by the PemaFlightDelays object
are correct?

 Use the rxExec function with the RxLocalParallel compute context to run arbitrary code and
embarrassingly parallel jobs on specified nodes or cores, or in your compute context.
 Use the RevoPemaR package to write customized scalable and distributable analytics.
6-1
Module 6
Creating and Evaluating Regression Models
Contents:
Module Overview 6-1
Lesson 1: Clustering big data 6-2
Lesson 2: Generating regression models and making predictions 6-10
Lab: Creating and using a regression model 6-23
Module Overview
When you have refined your data and can process it effectively, you probably want to start extracting
insights from it. RevoScaleR has an extensive range of modeling tools and algorithms that allow you to
investigate almost any kind of data. The purpose of these models is to help you generate predictions for
future observations given the information held in the current dataset. The algorithms for building these
models can be broadly grouped into two categories:
1. Supervised learning: This kind of algorithm requires that every case in the data has a valid case label
(or value in the example of regression analysis) for the response variable. The model then iteratively
changes the parameter values to improve the fit of the model to the data. Supervised learning can be
used for predictive modeling and for estimating the effects of the different predictors on the
response variable.
2. Unsupervised learning: Here, there are no class labels and no response variable. The algorithm
attempts to split the data into “natural” groups (dependent on the algorithm used to determine these
groups). This is the only method available to you if you don’t have labeled data. It is also useful as an
exploratory step in your data analysis.
In this module, you will learn about the most common supervised and unsupervised analysis types:
clustering and linear regression. The RevoScaleR package has highly optimized versions of these
algorithms that can be very efficiently deployed on a cluster or server to analyze huge datasets.
Objectives
 Use clustering to reduce the size of a big dataset and perform further exploratory analysis.
 Fit data to linear and logit regression models, and use these models to make predictions.
6-2 Creating and Evaluating Regression Models
Lesson 1
Clustering big data
Clustering is an unsupervised learning algorithm that finds structure in a dataset by placing cases into
groups or “clusters” according to a distance metric based on the set of variables you choose to use.
Lesson Objectives
 Describe the purpose of clustering and when to use it.
 Perform k-means clustering over big data.
 Evaluate k-means models.
 Standardize data by transforming it as it is clustered.
 Optimize the process of creating clusters.
Why perform clustering?

Clustering is often used as exploratory data
analysis or as a preprocessing stage to further
analysis.
The most used clustering algorithm is k-means

based (also known as “partitioning clustering”, not
to be confused with partitioning models, covered
in Module 7: Creating and Evaluating Partitioning
Models). Here you are trying to divide “n” cases,
described by “p” explanatory variables into a small
number, “k”, of classes. In k-means clustering, you
must specify the number of classes into which you
want to partition your data.
In clustering, it’s important to understand that there is no concept of accuracy because there is no
baseline of labeled data to compare against to decide this. Instead, you must judge the usefulness of the
outcome in the context of what you are trying to achieve with a cluster analysis.
You should consider using clustering in the following situations:
 To find natural groups in your data. For example, in market segmentation analysis where you might
want to find out the rough groups of people in a population according to demographic, attitudinal
and purchasing data. Summary statistics are then calculated for each individual cluster.
 To reduce datasets into subsets of similar data. Here you are effectively converting a big data
problem into a set of smaller data problems. You might want to perform clustering to determine the
cluster that you are most interested in, and then run regression models on the data in that cluster.
 To reduce the dimensionality of your data. Here you are using the supplied clusters to summarize the
predictor variables. An example of this is color quantization, where you want to reduce the colors in
an image to a fixed number of colors.
 For nonrandom sampling. You can use k-means to select k groups and then sample randomly from
each group. Cluster sampling is often used in large-scale survey designs.
Performing k-means clustering

The RevoScaleR package provides the rxKmeans
function for performing k-means clustering. It can
scale to very large datasets arranged in chunks on
a cluster and implements the Lloyd algorithm for
heuristic clustering. This is an iterative algorithm
that starts with an initial random classification of
the data, and then moves items from one cluster
to another. This minimizes the sum of squares in
each cluster (the sum of squares in a dataset is a
measure of how each item in the dataset varies
from the mean).
The operation of the rxKmeans function should

be familiar if you have used the kmeans function in base R.
For more information about the rxKmeans function, see
Clustering
https://aka.ms/ce58yc
The simplest way to explain clustering is to show an example.
Note: The examples in this lesson use the diamonds dataset from the ggplot2 package.
This is the same package that contains the plot functions used in Module 3: Visualizing Big Data.
The diamonds dataset contains data on nearly 54,000 diamonds, including price, cut, color,
clarity, size, and other attributes.
The following code clusters the diamonds in the dataset into five groups, based on the values of the carat
(weight), depth, and price properties. The intent is that all diamonds that have similar values aggregated
across these properties will be in the same cluster:
Clustering the diamonds dataset

clust <- rxKmeans(~ carat + depth + price,
data = ggplot2::diamonds,
numClusters = 5, seed = 1979)
You specify the model with a one-sided formula because there is no response variable in unsupervised
learning tasks. You specify a value for k with the numClusters argument, and you can set a seed value
that is used for the initial random classification of data (this aids the ability to reproduce; reusing the same
seed will repeat the same initial classification).
Note that the analysis in this example is performed on an in-memory data frame, but you can also use the
rxKmeans function over an XDF file if you have a big dataset.
You can view the cluster to examine its properties:
Examining the cluster

clust
# Typical results
Call:
rxKmeans(formula = ~carat + depth + price, data = ggplot2::diamonds,
numClusters = 5, seed = 1979)
Data: ggplot2::diamonds
Number of valid observations: 53940
Number of missing observations: 0
Clustering algorithm:
K-means clustering with 5 clusters of sizes 7686, 2731, 26506, 4358, 12659
Cluster means:
carat depth price
1 1.1683737 61.78579 6377.410
2 1.9030172 61.65529 15675.241
3 0.4239082 61.71354 1103.074
4 1.4816843 61.65860 10444.441
5 0.8824015 61.85397 3598.578
Clustering vector:
…
Within cluster sum of squares by cluster:

1 2 3 4 5
7102511970 7469055720 7361342778 7719370506 7882354583
Available components:
[1] "centers" "size" "withinss" "valid.obs" "missing.obs"
[6] "numIterations" "tot.withinss" "totss" "betweenss" "cluster"
[11] "params" "formula" "call"
This gives you information on the number of cases and the means of the variables in each cluster. You can
see it makes intuitive sense, grouping the cases at different levels of carat and price. Note that depth
looks to be far less influential.
The “available components” section shows the different elements you can extract from the model object
using the $ operator. For example, clust$numIterations will tell you how many iterations the algorithm
performed.
Evaluating clusters
Although there is no real measure of accuracy in a cluster
analysis, you can get an idea of the percentage of
variation explained by the model by calculating the ratio
of the “between cluster sum of squares” (betweenss —a
measure of the variation between clusters) and total sum
of squares (totss—the total variation among the variables
in the model):
Calculating the variation between clusters

clust$betweenss / clust$totss
# Typical results:
0.9562775
In this analysis, you can see that 95 percent of the variation in the dataset is explained by the differences
between the cluster means. It seems that the clusters here are very informative!
Standardizing data
Clustering uses a distance metric to determine
membership of clusters. If you have two variables,
one of which is on a scale of 0 to 1 and the other
is on a scale of 0 to 1 million, the variation in the
second variable could swamp that of the first
variable. This means that the clustering algorithm
will be biased to take the variable with the higher
variance more into account—this often occurs
when variables with different units are used.
This is clearly the case with the clusters generated

over the diamonds data, if you look at the
standard deviations of the three variables in the
original dataset:
The effects of using variables with different scales to cluster data

apply(ggplot2::diamonds[,c("carat", "depth", "price")], 2, sd)
# Typical results
carat depth price
0.4740112 1.4326213 3989.4397381
The standard deviation of price measured in dollars is more than 8,000 times that of the weight in carats,
although this difference is not necessarily informative because the variables have different units.
One way to handle this problem is to standardize your variables before running the clustering algorithm.
The usual transformation is known as the z-transformation where you subtract the mean and divide by
the standard deviation. This gives a unitless measure with a mean of 0 and a standard deviation of 1, also
known as a z-score. Clustering on these transformed variables will give a truer reflection of the variation in
the data.
The following example uses a transform with the rxKmeans function to implement this technique:
Standardizing data as it is clustered

clust_z <- rxKmeans(~ carat_z + depth_z + price_z,
numClusters = 5, seed = 1979,
transforms = list(
carat_z = (carat - mean(carat)) / sd(carat),
depth_z = (depth - mean(depth)) / sd(depth),
price_z = (price - mean(price)) / sd(price))
)
If you examine the variance, you should see a difference:
Examining the variation in the clusters built using standardized values

clust_z$betweenss / clust_z$totss
# Typical results:
0.7356599
You can see that the percentage variance explained by the model has decreased to 74 percent. This is not
surprising because it reflects the increased relative contribution of the variables on a smaller scale. The
percentage was artificially inflated in the untransformed analysis because the variables were in different
units.
You can also examine how the influence of the different variables has changed in determining the
clusters:
Examining the influence of variables when clustering data

clust_z$centers
# Typical results
carat_z depth_z price_z
1 0.69267532 0.07945899 0.5541154
2 0.05006483 1.54316431 -0.1950409
3 2.07916635 -0.08380051 2.3937402
4 -0.20652536 -1.59163979 -0.2969734
5 -0.74473941 0.07436982 -0.6617408
Notice how there is obvious variation in the transformed depth variable, unlike with the clusters
generated by using the untransformed variables.
Optimizing k-means clustering

The main limitation with k-means clustering is that
you need to provide k, the number of clusters to
split your data into. The problem is that you might
not know how many clusters you want without
exploring the data first. Ideally, you want to have
the smallest number of clusters possible that
explain as much of the variation in the data in a
useful way. There are few hard and fast rules to
determine the value of k: it relies on your
judgement and what you are trying to explain
with your model.
Theoretically, the proportion of the total variation

explained by “between cluster sum of squares” (betweenss) increases with each cluster added—the
extreme case is that, if you have one cluster for each data point, 100 percent of the variation is between
clusters. Clearly though, this would not be a very useful model.
At a certain point, adding more clusters will only add a marginal amount of explanatory power to your
model. With standard k-means, it is common to run a series of models over a range of values of k and
look at, or plot, the “between cluster sum of squares” versus “total sums of squares” ratios for each value.
At some point, there will be an elbow on the plot where the increase in explanatory power tails off as you
add more clusters. This is a good indicator of the number of clusters to choose for k. With large data on a
cluster, however, this can be a very costly and time consuming way to find the optimum value of k.
With very large data, it is best to take a representative sample of the full dataset and run multiple models
for different values of k on this. If the sample is representative, you should see the same patterns as for
the full data.
After you have decided on a value of k from your sample, you can also take the centroids from this
analysis and pass them to the centers argument when you run the analysis on your full dataset. This will
greatly speed up your analysis because the algorithm will have a starting point that is likely to be already
very close to the optimum. Without setting the centers argument, the algorithm will choose random
starting points and will waste time getting to a point close to the optimum.
Demonstration: Creating and examining a cluster

This demonstration walks through the example described in the notes. The purpose of the demonstration
is to create clusters of data in the diamonds dataset and analyze them to determine an optimal value to
use for k.
Demonstration Steps
Cluster the data

2. Open the R script Demo1 - clustering.R in the E:\Demofiles\Mod06 folder.
3. Highlight and run the code under the comment # Cluster the data in the diamonds dataset into 5
partitions by carat, depth, and price. This code runs the rxKmeans function to generate the
clusters.
4. Highlight and run the code under the comment # Examine the cluster. These statements display the
distribution of values in the clusters, and the number of iterations that were performed to generate
the clusters. Note that the values of the depth variables used in each cluster are all very similar
compared to the variation in the carat and price variables. It took 51 iterations of the algorithm to
create these clusters.
5. Highlight and run the code under the comment # Assess the variation between clusters. This code
calculates the ratio of the sums of squares between clusters and the total sums of squares for all
clusters. The result shows that 95.6% (0.9562775) of the differences are accounted for between
clusters, so each cluster appears to be relatively homogenous.
6. Highlight and run the code under the comment # Examine the standard deviations between the
cluster variables. These statements highlight the different scales of each variable, and why the price
and carat variables are more influential than depth: the deviation in these variables is relatively large
compared to the deviation of values in depth.
Create a new cluster with standardized data

1. Highlight and run the code under the comment # Create a cluster with standardized data. This
code runs the rxKmeans function with a transform to ensure that all variables have the same scale.
2. Highlight and run the code under the comment # Examine the cluster sums of squares. This time,
the ratio of the sums of squares between clusters and the total sums of squares is only 73.6%
(0.7356599). There is a bigger variation in each cluster, due to the increased influence of the depth
and carat variables compared to price.
3. Highlight and run the code under the comment # Examine the influence of each variable. The
code shows the centroids for each cluster. This time, depth has a much greater variation than before,
and all variables have the same order of magnitude.
Determine an optimal value for k

1. Highlight and run the code under the comment # Run a series of models over a range of values of
k. This code runs the kMeans function with different values of k from 1 to 20. The code captures the
sums of squares ratio for each iteration.
2. Highlight and run the code under the comment # Plot the results. This code creates a data frame of
values of k and the sums of squares ratios, and then plots this data on a line graph. You can see the
decreasing effects of increasing k. When k reaches 17, the line becomes more or less flat. This
suggests that it would not be worth specifying a k value of more than 17. In fact, values from 13
upwards add little variation to the clusters, so 13 might be an optimal value to use.

Question
You have used the rxKmeans function to cluster data. The ration of the "between
cluster sum of squares" and the "total sum of squares" across the cluster is very
high (99.8%). What does this indicate?
Clustering has been ineffective as each cluster contains vastly differing data.
Clustering has been effective as most of the clusters contain highly

homogenous data.
You cannot draw any conclusions about the effectiveness of the clustering
using this measure.
The data values are too disparate to be clustered effectively.
All the data values in the entire dataset are nearly identical.
Lesson 2
Generating regression models and making predictions
Linear regression is perhaps the most commonly used tool in the data scientist’s toolkit. In regression
analysis, you try to fit a straight line to the relationship between a continuous response variable and one
or more continuous predictors. It is part of the more general group of linear models that also includes
ANOVA (continuous response variable and categorical predictors), ANCOVA (continuous response
variable and a combination of categorical and continuous predictors) and MANOVA (multiple response
variables). There are also generalized linear models that apply a transformation to the linear model—for
example, logistic regression for modeling a binary response variable.
Linear regression is a supervised learning technique, in that you need to have labeled response variables
to train it on. After you have run the model on your training data, you can then make predictions from
the model, given a set of predictor variables. A big advantage over more complex machine learning
algorithms, such as decision forests, boosting and neural networks (covered in Module 7), is that the
model is easily interpretable: you get effect levels for the predictor variables that tell you the effect on the
response for each increase in that predictor. The more complicated algorithms often operate as black
boxes—these produce good predictions but are less easily interpretable.
The problem with linear regression is that it assumes that:
 The predictors are roughly normally distributed.
 The relationship between the predictors and response is linear.

The further your data deviates from these assumptions, the less valid the predictions or effect levels of the
model. This limits the usefulness of linear regression as a tool, because a great many problems you might
encounter will involve nonnormally distributed data and nonlinear relationships—although, in some cases,
you can use data transformations to mitigate these problems.
Often, you might begin your analysis with linear regression, and then move on to something more
complex if it does not fit your data well.
Lesson Objectives
 Explain how linear regression works.
 Run a linear regression model.
 Examine a regression model.
 Make predictions using a regression model.
 Perform logistical regression.
 Create generalized linear models.
 Perform cube regressions.

Running a linear regression model

The RevoScaleR package provides the rxLinMod function
for performing linear regression. The function uses an
updating algorithm, is scalable and can be deployed on a
cluster or server for very large, chunked datasets. For
more information, see:
Fitting Linear Models

https://aka.ms/o9tx5d
The following example uses the rxLinMod function to determine the effect of diamond weight (carat) on
the price. This model helps to determine whether there is a linear relationship between the two variables:
Running the rxLinMod function to perform a linear regression

mod1 <- rxLinMod(price ~ carat, data = ggplot2::diamonds,
covCoef = TRUE)
Note that the response variable is on the left-hand side of the formula and the predictor variables are on
the right-hand side. The covCoef = TRUE argument ensures that the covariance matrix for the
coefficients is calculated as part of the regression. You can examine the results directly:
Displaying the results of the regression

mod1
# Typical results
Call:
rxLinMod(formula = price ~ carat, data = ggplot2::diamonds)
Linear Regression Results for: price ~ carat

Dependent variable(s): price
Total independent variables: 2
Coefficients:
price
(Intercept) -2256.361
carat 7756.426
This output should be familiar if you have used the lm function in base R. You can see the effect estimates
for the carat predictor variable is 7756.43. This is the effect on price for an increase of 1 carat, so a single
carat diamond would be predicted to cost (-2256.36 + 7756.43) = $5500.07. The intercept reflects the
theoretical cost of a zero carat diamond.
Note: This example shows the possible dangers of blindly over-interpreting the results.
According to this model, diamonds become worthless at 0.29 carats, and you would have to pay
someone $2,256.36 to take a zero carat diamond away. Clearly the model is nonsense at this
extreme but, as the weight of diamonds increases, it can become more realistic.
The R-squared value shows that the carat variable accounts for approximately 85 percent of the variation
in price.
Using a categorical predictor variable

You can run other types of linear models by varying the type of the predictor variables.
You can use the F() function within the formula to convert a continuous variable to a factor with integer
levels as shown below:
Converting a continuous variable into a factor

mod2 <- rxLinMod(price ~ F(carat),
dropFirst = TRUE,
covCoef = TRUE)
It’s important to note that rxLinMod operates somewhat differently to the lm function in base R when
dealing with factors. By default, rxLinMod uses the last factor level as the baseline for comparisons while
lm uses the first. The example specifies the arguments dropFirst = TRUE and covCoef = TRUE. These
argument values will return a similar output to an lm fit.
The results generated by this model look like this:
The results of the linear regression based on a factor response variable

mod2
Call:
rxLinMod(formula = price ~ F(carat), data = ggplot2::diamonds,
dropFirst = TRUE, covCoef = TRUE)
Linear Regression Results for: price ~ F(carat)

Total independent variables: 7 (Including number dropped: 1)
Coefficients:
price
(Intercept) 1632.641
F_carat=0 Dropped
F_carat=1 5655.627
F_carat=2 13214.308
F_carat=3 12676.065
F_carat=4 14825.359
F_carat=5 16385.359
The coefficients table shows the estimated price for each factorized value of the weight. In this output,
you can see that, according to this model, a 1-carat diamond should cost $5,655.63, a 2-carat diamond
should cost $13,241.31, and so on. Again, you should beware of extrapolating these results to their
extremes.
Transforming variables
You can transform model variables within the rxLinMod function in the same way as in many of the other
rx* functions, by using the transforms argument.
The next example uses the log of the weight as the predictor variable:
Transforming data as part of the regression operation

mod3 <- rxLinMod(price ~ log_carat,
transforms = list(log_carat = log(carat)),
data = ggplot2::diamonds)
The advantage of using transforms rather than transforming the variable directly is that the
transformation is performed on chunks of data, so it is more efficient on large datasets.
Note: The rxLinMod function is not restricted to single predictor variables. In the same
way as with lm, you can include multiple variables, interactions between variables, and nested
terms.
Examining a regression model

The earlier examples showed that you can see useful
information just by examining the model directly.
However, you can obtain more extended information by
using the summary function:
Summarizing the regression model

summary(mod1)
# Typical results
Call:
rxLinMod(formula = price ~ carat, data = ggplot2::diamonds)
Linear Regression Results for: price ~ carat

Total independent variables: 2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2256.36 13.06 -172.8 2.22e-16 ***
carat 7756.43 14.07 551.4 2.22e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1549 on 53938 degrees of freedom

Multiple R-squared: 0.8493
Adjusted R-squared: 0.8493
F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
Condition number: 1
Now you can see not only the parameter estimates, but also the standard errors and p-values for the
coefficients, residual standard error, and R-squared statistics. Note that, with very large datasets, you are
likely to get significant p-values, so it is sensible to pay more attention to the effect levels and standard
error—and to run some visualizations to view the results of your model.
The RevoScaleR package provides several other functions that you might find useful for examining your
variables and your rxLinMod models. These functions can all be run on chunked data on a cluster or
server, and can transform data in the same way as the rxLinMod function:
 rxCov: calculates the covariance matrix for a set of variables.
 rxCor: calculates the correlation matrix for a set of variables.
 rxSSCP: calculates the sum of squares or cross-product matrix for a set of variables.
 rxCovCor: calculates either a correlation or covariance matrix.
 rxCovCoef: returns the covariance matrix for the regression coefficients in a model object.
 rxCorCoef: calculates the correlation matrix for the regression coefficients in a model object.
 rxCovData: calculates the covariance matrix for the predictor variables in a model object.
 rxCorData: calculates the correlation matrix for the predictor variables in a model object.
For more information on these functions and their uses, see:
Estimating correlation and variance/covariance matrices

https://aka.ms/cejiqb
Making predictions using a linear regression model

You can use a regression model to make
predictions about data. You do this using the
rxPredict function. This function takes a model,
and a dataset containing the predictor variables
used to create the model. If you want to test the
accuracy of a linear model, you can generate a
dataset from the same data you used to create the
model and compare the results to the real values.
The following example adopts this approach.
First, generate a dataset for a subset of the

original data containing only the predictor
variable (carat). The dataset also retains the price
variable so you can easily compare the results of the predictions against the real data:
Generating a dataset for making predictions

testData = rxDataStep(ggplot2::diamonds,
rowSelection = color == "H",
varsToKeep = c("carat", "price"))
Next, generate predictions using the rxPredict function. This function produces a result set containing
row-by-row predictions for the response variable. You can include the original variables from the dataset
by specifying writeModelVars = TRUE as an argument:
Using the regression model to make predictions

predictions <- rxPredict(mod1, testData, writeModelVars = TRUE)
The predictions dataset contains a price_Pred variable containing the predicted value for each price. You
can assess these predictions against the real prices.
Examining the accuracy of predictions

head(predictions, 50)
# Typical results
price_Pred price carat
1 -239.689919 337 0.26
2 -472.382688 338 0.23
3 -472.382688 353 0.23
4 148.131362 402 0.31
5 225.695618 403 0.32
6 225.695618 403 0.32
7 225.695618 403 0.32
8 -6.997151 404 0.29
9 70.567105 554 0.30
10 70.567105 554 0.30
…
45 3560.958633 2806 0.75
46 3948.779914 2808 0.80
47 70.567105 554 0.30
48 70.567105 554 0.30
49 70.567105 554 0.30
50 70.567105 554 0.30
In this example, there is clearly some discrepancy between the real and predicted prices. In such cases,
you might find it useful to calculate confidence intervals around your predictions:
Calculating the confidence interval for the predictions

predictions_se <- rxPredict(mod1, data=testData, computeStdErrors=TRUE,
interval="confidence", writeModelVars = TRUE)
head(predictions_se, 50)
# Typical results
price_Pred price_StdErr price_Lower price_Upper price carat
1 -239.689919 10.085469 -259.45752 -219.92232 337 0.26
2 -472.382688 10.405828 -492.77819 -451.98718 338 0.23
3 -472.382688 10.405828 -492.77819 -451.98718 353 0.23
4 148.131362 9.569076 129.37590 166.88683 402 0.31
5 225.695618 9.468687 207.13692 244.25432 403 0.32
6 225.695618 9.468687 207.13692 244.25432 403 0.32
7 225.695618 9.468687 207.13692 244.25432 403 0.32
8 -6.997151 9.772834 -26.15198 12.15768 404 0.29
9 70.567105 9.670468 51.61291 89.52130 554 0.30
10 70.567105 9.670468 51.61291 89.52130 554 0.30
…
45 3560.958633 6.701669 3547.82331 3574.09396 2806 0.75
46 3948.779914 6.667718 3935.71113 3961.84869 2808 0.80
47 70.567105 9.670468 51.61291 89.52130 554 0.30
48 70.567105 9.670468 51.61291 89.52130 554 0.30
49 70.567105 9.670468 51.61291 89.52130 554 0.30
50 70.567105 9.670468 51.61291 89.52130 554 0.30
It can also be instructive to generate a line plot to visualize your predictions:
Visualizing the predictions

rxLinePlot(price + price_Pred + price_Upper + price_Lower ~ carat,
data = predictions_se, type = "b",
lineStyle = c("blank", "solid", "dotted", "dotted"),
lineColor = c(NA, "red", "black", "black"),
symbolStyle = c("solid circle", "blank", "blank", "blank"),
title = "Data, Predictions, and Confidence Bounds",
xTitle = "Diamond weight (carats)",
yTitle = "Price", legend = FALSE)
In this case, the results look like this:

Demonstration: Fitting a linear model and making predictions

This demonstration continues through the example described in the notes. The purpose of the
demonstration is to build a linear model for predicting the price of diamonds.
Demonstration Steps
Fit a linear model
2. Open the R script Demo2 - modelling.R in the E:\Demofiles\Mod06 folder.
3. Highlight and run the code under the comment # Fit a linear model showing how price of a
diamond varies with weight. This code runs the rxLinMod function to generate a linear regression
model on the diamond data, using the weight of diamonds to derive their prices.
4. If the Updating Loaded Packages dialog box appears, click Yes.
5. Highlight and run the code under the comment # Examine the results. These statements display
information about the model created by the rxLinMod function. The coefficients suggest that each
carat in weight adds $7,756.43 to the price.
6. Highlight and run the code under the comment # Use a categorical predictor variable. This code
performs another regression using discrete values for the weight. The results show the prices for each
weight. It is clear from these results that the relationship between price and weight is not particularly
linear.
Make predictions using the model

1. Highlight and run the code under the comment # Generate a subset of the data for testing
predictions made by using the model. This code creates a dataset containing only diamonds with
the color "H". The dataset contains the carat and price variables (price is included so you can
compare the values predicted using this model against the real data).
2. Highlight and run the code under the comment # Make price predictions against this dataset. This
code uses the rxPredict function to run predictions using the linear model against the sample
dataset.
3. Highlight and run the code under the comment # View the predictions to compare the predicted
prices against the real prices. This code displays the first 50 rows of the results. Compare the
price_Pred values against the price variable. The discrepancy shows how the linear model is not
particularly accurate for this data. You should note that the lower the weight, the greater the
discrepancy.
4. Highlight and run the code under the comment # Calculate confidence intervals around each
prediction. This code calculates and displays the upper and lower range for each prediction, and the
degree of confidence in terms of the standard deviation.
Plot the predicted values

1. Highlight and run the code under the comment # Visualize the predicted prices as a line plot. This
code uses the rxLinePlot function to visualize the predictions. You should notice the following points:
o The data plot is very funnel shaped, indicating a wide variation in price as the weight increases.
This indicates that price is not solely dependent on weight, and that other factors might be
important.
o The plot tapers up away from the origin for small values of the weight. This makes sense, because
even very small diamonds have some value.
2. Close the script Demo2 - modelling.R, but leave your R development environment of choice open
for the next demonstration.
Using logistic regression

Logistic regression is a variant of linear regression where the
response variable is binary categorical. Examples might be
predicting whether a potential customer will respond to a
piece of direct mail, or whether someone will default on a
loan.
In logistic regression, the linear regression model is

transformed to a logistic curve (an S-shaped curve) to
determine in which class to place each case. The rxLogit
function implements logistic regression optimized for large
datasets.
The following example creates a logit model to determine whether a person is likely to default on a
mortgage loan, based on a dataset that includes the credit score, number of years in employment, and
amount of credit card debt. You can see that the parameters are much the same as for linear models.
Performing a logit regression

mortXdf <- file.path(rxGetOption("sampleDataDir"), "mortDefaultSmall")
logitOut1 <- rxLogit(default ~ creditScore + yearsEmploy + ccDebt,
data = mortXdf, blocksPerRead = 5)
Note: The mortDefaultSmall dataset is provided with the ScaleR package.
Evaluating a logit model

You might want to investigate what effect removing some of the predictors has on the model
performance. One way to do this is to visualize the receiver operating characteristic (ROC) curves for
models.
A ROC curve plots the true positive rate (the number of correctly predicted TRUE responses divided by the
actual number of TRUE responses) against the false positive rate (the number of incorrectly predicted
TRUE responses divided by the actual number of FALSE responses), at various thresholds. The area under
the ROC curve indicates the predictive power of the model. Because the area under the ROC curve (AUC)
is scaled to 1, an AUC of 1 represents perfect prediction, an AUC of 0 represents 0 predictive power, and
an AUC of 0.5 is what is expected with random guessing.
The following example shows how to investigate what difference credit card debt makes to predictions of
mortgage defaults. First, you run the predictions for the first model (including credit card debt data):
Using the logit model to make predictions with an initial set of variables
predFile <- "mortPred.xdf"
predOutXdf <- rxPredict(modelObject = logitOut1, data = mortXdf,
writeModelVars = TRUE, predVarNames = "Model1", outData =
predFile)
You can then build a second model with the ccDebt variable removed, and add the predictions for this
model to the original predictions data file:
Using the logit model to make predictions with an amended set of variables
logitOut2 <- rxLogit(default ~ creditScore + yearsEmploy,
data = predOutXdf, blocksPerRead = 5)
predOutXdf <- rxPredict(modelObject = logitOut2, data = predOutXdf,
predVarNames = "Model2")
Finally, you can run the rxRoc function to generate the ROC curves for both models.
Generating the ROC curves for both models

rocOut <- rxRoc(actualVarName = "default",
predVarNames = c("Model1", "Model2"),
data = predOutXdf)
You can visualize the ROC curves with the plot function:
Displaying the ROC curves

plot(rocOut)
The results look like this:
You can see that the area under the ROC curve for the first model (including credit card debt) is far
greater than the second model (with credit card debt removed). In fact, the predictive power of the
second model is barely greater than that of random guessing (this is the faint white diagonal line). This
shows that credit card debt is an important predictor of mortgage defaults.
For more information on logistic regressions in the RevoScaleR package, see:
Fitting Logistic Regression Models

https://aka.ms/ok5zg5
Demonstration: Modeling using logistic regression

This demonstration walks through the mortgage example described in the notes. The purpose of the
demonstration is to build a logit model for predicting whether a customer is likely to default on their
mortgage.
Demonstration Steps
Create a logit regression model to predict a binary response variable

1. In your R development environment of choice, open the R script Demo3 - logistic.R in the
E:\Demofiles\Mod06 folder.
2. Highlight and run the code under the comment # Create a logit regression model. This code runs
the rxLogit function to generate a logistic model on the mortgage data, using a customer's credit
score, number of years in employment, and amount of credit card debt, to assess whether the
customer is likely to default on their mortgage.
Evaluate the effects of predictor variables

1. Highlight and run the code under the comment # Generate predictions that include credit card
debt as a predictor variable. These statements use rxPredict to run predictions about mortgage
default using the logit model and the mortgage dataset.
2. Highlight and run the code under the comment # Display the results. This statement shows the first
50 predictions. The values in the default variable indicate whether the customer is likely to default on
their payments (1) or not (0).
3. Highlight and run the code under the comment # Generate predictions a model and predictions
that exclude credit card debt. This statement creates another logit model that excludes credit and
runs predictions using this model. The results are appended to the previous results.
4. Highlight and run the code under the comment # Display the results. This statement shows the first
50 predictions again. Examine the Model1 and Model2 columns. The values in these columns show
the actual likelihoods of default with and without taking credit card debt into account.
Generate a ROC curve

1. Highlight and run the code under the comment # Generate the ROC curves for both models. This
code uses rxRoc to create the ROC curve for both models created in the previous task.
2. Highlight and run the code under the comment # Visualize the ROC curve. The graph shows the
true positive rate versus the false positive rate for both models. Note that the area under the curve for
the first model is far greater than that of the second model, indicating that it has much more
predictive power. The faint white line shows how making random guesses would fare (a straight
diagonal line from the origin), and the second model is actually not much better than this. Therefore,
it would seem that taking credit card debt into account is an important factor in this model.
3. Close your R development environment of choice.
Creating generalized linear models

Generalized linear models (GLMs) relax the
assumptions for a standard linear model. This
allows for effective modeling of a wide range of
problems where either the variance is not constant
with respect to the mean, or where the errors are
not normally distributed.
GLMs have two main differences from a standard

linear model:
1. You can specify a functional form for the

conditional mean of the predictor. This is
referred to as the “link” function. It relates the
mean value of the response variable to its
linear predictor. Effectively, it transforms the response variable to allow for modeling of non-normally
distributed data.
2. You can specify a distribution for the response variable and the errors. This is referred to as the
“family”.
The rxGlm function in the RevoScaleR package provides the ability to estimate generalized linear models
on large datasets.
All the link/family combinations available to glm in base R are also available to rxGlm, but the following
combinations have been optimized for high performance on large datasets:
 binomial/logit: this is the logistic regression for binary classification and has been covered in the
previous topic. Use this combination when:
o Data is strictly bounded (zeros and ones).
o Variance is not constant.

o Errors are not normally distributed.
 Gamma/log: use this combination when:
 Data is non-negative.
o Data has a positive skew.
o Variance is near constant on the log scale.
 poisson/log: use this combination for modeling count data when:
o Data is bounded by zero.
o Variance increases with the mean.
o Errors are not normally distributed.
o All values must be integers.
 Tweedie: use this to produce a generalized linear model family object with any power variance
function, and any power link
The rxGlm function works in the same way as rxLinMod and rxLogit. In fact, rxLinMod and rxLogit are
just convenience functions wrapped around rxGlm. You specify the glm type using the family argument.
For more information on GLMs, see:
Generalized linear models

https://aka.ms/yx93rd
Performing cube regressions

A cube is a type of multidimensional array. You
can use the rxCube function to create efficiently
represented contingency tables from cross-
classifying factors. The linear model and GLM
functions in the RevoScaleR package can use
cubes to run large regression models very
efficiently. To take advantage of this property, you
need to set the cube = TRUE argument and ensure
that the first predictor variable is categorical. The
regression is fitted using the partitioned inverse
method, where the intercept term is dropped and
coefficients are computed for each level of the first
categorical predictor. Essentially, you are partitioning the analysis according to the first categorical
predictor. This can be faster and use less memory than the usual regression method.
For more information on the use of cubes for performing regression, see:
Cubes and Cube Regression
https://aka.ms/qaxttm
Statement Answer
You have generated a linear model using the rxLinMod function over a
large dataset. You have tested the model by making predictions using this
model and comparing them to a set of known results. You have plotted a
ROC curve displaying the accuracy of the known results to the predicted
values. The ROC curve shows a diagonal straight line from the point (0.0)
to the point (1,1). This indicates that the model is making very accurate
predictions. True or False?
Lab: Creating and using a regression model

Scenario
You have a perception (based on earlier visualizations and observations) that flight delay time might be a
function of the time of day when the aircraft departed. You have decided to build a regression model that
you can use to test this hypothesis. Due to the large amount of data available, you have also decided to
cluster the data by departure time and delay. You will then use this cluster to make predictions about how
long a flight is likely to be delayed, given the departure time.
Objectives
 Create and evaluate clusters by using k-means clustering.
 Fit a linear model and make predictions against clustered data.
 Fit another linear model against the entire dataset and compare predictions.
 Use logit analysis to assess whether a flight is likely to be delayed.
Lab Setup
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
Exercise 1: Clustering flight delay data

Scenario
You want to cluster flight delay data by departure time, but you are not sure how many clusters you
should create. You need to establish an optimal number.
2. Examine the relationship between flight delays and departure times
3. Create clusters to model the flight delay data

1. Log on to the LON-DEV VM as Adatum\AdatumAdmin with the password Pa55w.rd.
2. Copy the FlightDelayData.xdf file from the E:\Labfiles\Lab06 folder to the \\LON-RSVR\Data shared
folder.
 Task 2: Examine the relationship between flight delays and departure times
 session: TRUE
 diff: TRUE
 username: admin
3. Create a test file containing a random 10 percent sample of the flight delay data. Save this sample in
the file \\LON-RSVR\Data\flightDelaySample.xdf.
4. Create a scatter plot that shows the flight departure time on the X-axis and the delay time on the Y-
axis. Use the local departure time in the DepTime variable, and not the departure time recorded as
UTC. Add a regression line to the plot to help you spot any trends.
5. Create a histogram that shows the number of flights that depart during each hour of the day. Note
that you will have to factorize the departure time to do this; create a factor for each hour.
Results: At the end of this exercise, you will have determined the optimal number of clusters to create,
and built the appropriate cluster model.
Question: What do the graphs you created in this exercise tell you about flights made from
6:01 PM onwards?
 Task 3: Create clusters to model the flight delay data

1. As an initial starting point, cluster the sample data into 12 partitions based on the DepTime and
Delay variables.
2.
Calculate the ratio of the between clusters sums of squares and the total sums of squares for this
model. How much of the difference between values is accounted for between the clusters?
Examine the cluster centers to see how the clusters have partitioned the data values.
3. You don't yet know whether this is the best cluster model to use. Generate models with 2, 4, 6, 8, 10,
12, 14, 16, 18, 20, 22, and 24 clusters.
 Maximize parallelism by creating a parallel compute context.
 Register the RevoScaleR parallel back end with the foreach package (run the registerDoRSR
function).
 Use the %dopar% operator with a foreach loop that creates different instances of the model in
parallel.
Note: At the time of writing, there was still some instability in R Server running on
Windows. Placing it under a high parallel load can cause it to close the remote session. If this
happens, resume the remote session and switch back to the RxLocalSeq compute context.
4. Calculate the ratio of the between clusters sums of squares and the total sums of squares for each
model.
5. Generate a scatter plot that shows the number of clusters on the X-axis and the sums of squares ratio
on the Y-axis. Which value for the number of clusters does this graph suggest you should use?
Exercise 2: Fitting a linear model to clustered data

Scenario
You have clustered the flight delay data, and now you want to fit a linear model that you can use to make
predictions about flight delays. To test the predictions, you will use another subset of data that includes
delay times and departure times, run predictions against the departure times, and compare the results to
the delay times.
1. Create a linear regression model
2. Generate test data and make predictions

3. Evaluate the predictions
 Task 1: Create a linear regression model

 Using your selected cluster model, fit a linear regression model that describes how delay varies with
departure time. Include the variance-covariance matrix of the regression coefficients in the results (set
the covCoef argument to TRUE).
 Task 2: Generate test data and make predictions

1. Create a test dataset that comprises 1% of the data from the original flight delay dataset. Discard all
variables except for DepTime and Delay.
2. Run the rxPredict function to predict the delays for the test dataset. Record the standard error and
confidence level for each prediction (set the computeStdErr argument to TRUE, and set the interval
argument to "confidence").
3. Examine the first few predictions made. Compare the Delay and Pred_Delay values. Pay attention to
the confidence level of each prediction.
 Task 3: Evaluate the predictions

1. Create a scatter plot of predicted delays against departure time, for comparison with the earlier graph
showing actual delays against departure time. What do you notice about the graph?
Note: The graph showing the actual departure times and delays is based on a much bigger
dataset. To get a fair comparison between the two graphs, regenerate the earlier graph showing
the data for the entire dataset and set the alpha level of the points to 1/50. Both graphs should
look very similar.
2. Create a scatter plot that shows the difference between the actual and predicted delays for each
observation. Again, what do you notice about this graph?
Results: At the end of this exercise, you will have created a linear regression model using the clustered
data, and tested predictions made by this model.
Question: What conclusions can you draw about the predictions made by the linear model
using the clustered data?
Exercise 3: Fitting a linear model to a large dataset

Scenario
You surmise that one reason for the underestimation of delay times could be that clustering a small set of
values might not give you enough information to make accurate predictions (a set of 18 data points is an
incredibly small amount of information on which to estimate values in a dataset containing millions of
rows). You decide to investigate this possibility by creating a regression model over a much larger dataset.
1. Fit a linear regression model to the entire dataset
2. Compare the results of the regression models
 Task 1: Fit a linear regression model to the entire dataset

1. Fit a linear regression model that describes how delay varies with departure time.
 Use the entire flight delay dataset.
 Include the variance-covariance matrix of the regression coefficients in the results.
2. Make predictions about the delay times. Use the same test dataset that you used to make predictions
for the cluster model.
 Task 2: Compare the results of the regression models

1. Create a scatter plot of the predicted delays for comparison with the earlier predictions and the actual
delays. Include a regression line. What do you notice about this graph?
2. Create a scatter plot that shows the difference between the actual and predicted delays for each
observation. How does this graph compare to that, based on the results of the previous regression
model?
environment.
Results: At the end of this exercise, you will have created a linear regression model using the entire flight
delay dataset, and tested predictions made by this model.
Question: This lab analyses the flight delay data to try and predict the answer to the
question, "How long will my flight be delayed if it leaves at ‘N’ o'clock?" The linear regression
analysis shows that, although it is nearly impossible to answer this question accurately for a
specific flight (the departure time is clearly not the only predictor variable involved in
determining delays), it is possible to generalize across all flights. What might be a better
question to ask about flight delays, and how could you model this to determine a possible
answer?

 Use clustering to reduce the size of a big dataset and perform further exploratory analysis.
 Fit data to linear and logit regression models, and use these models to make predictions.
7-1
Module 7
Creating and Evaluating Partitioning Models
Contents:
Module Overview 7-1
Lesson 1: Creating partitioning models based on decision trees 7-3
Lesson 2: Evaluating models 7-14
Lesson 3: Using the MicrosoftML package 7-24
Lab: Creating a partitioning model to make predictions 7-28
Module Overview
Partitioning models, or tree-based models, are a type of supervised learning algorithm that can be used
for either regression or classification. Partitioning models with a response variable that is a factor are
known as classification trees; partitioning models with a continuous response variable are known as
regression trees.
The basic classification tree algorithm is relatively simple:

1. All observations in your data begin in one group or node (not to be confused with parallel compute
nodes).
2. All your predictor variables are sorted and examined to find a split (partition) that best separates out
the classes. In a binary classification tree, each split produces two nodes.
3. The nodes are examined to see if they should be partitioned again.
4. If the algorithm determines that a node should not be partitioned further, this becomes a terminal
node and all the observations in it are classified as belonging to the same group.
Despite their simplicity, partitioning models have some important advantages over linear model-based
approaches that make them preferable in some situations:
 A tree model is easily interpretable—you just need to look at the variable splits for each node.
 They don’t rely on a predefined scheme—instead, they use induction to generalize from the data to
build a knowledge model. This makes them more suited than linear models to predicting the results
from stochastic or semi-stochastic processes, such as stock market prices. They aim to learn what the
output of a process might be, given a set of inputs—rather than extrapolating results from a derived
set of mathematical equations.
 They are nonparametric and do not require assumptions of statistical normality or linearity of the
data. The variables do not require statistical transformations to be used effectively. Also, different data
types can be easily combined in the same model.
7-2 Creating and Evaluating Partitioning Models
 Variables can be reused in different parts of the tree. This means that complex interactions between
predictor variables can be learned from the data—they do not need to be prespecified as they do in
linear models.
 Tree models are robust to outliers, which can be isolated in their own node so they do not affect the
remainder of the analysis.
However, partitioning models are not suited to every situation and have a number of disadvantages:
 If an individual tree is too small (if it has too few partitions), it can have a low prediction accuracy.
 If a tree is too large and complex, it might predict well on the training set but be prone to
overfitting—and so lose generality.
 Because they rely on sorting the data, which is a time-consuming process, they can be problematic
with very large datasets.
Objectives
 Use the three main partitioning models in the ScaleR™ package, and how to tune the models to
reduce bias and variance.
 Validate, visualize and make predictions from partitioning models.
 Use the MicrosoftML package for using advanced machine learning algorithms to create predictive
models.
Lesson 1
Creating partitioning models based on decision trees
The ScaleR package includes several functions for building and examining partitioning models. This lesson
describes how to use these functions, and how to tune your models to improve accuracy and speed.
Lesson Objectives
 Describe the rxDTree algorithm.
 Explain the purpose of ensemble algorithms.
 Split data into training and test datasets.
 Create a partitioning model using decision trees, decision forests, and gradient boosted decision
trees.
 Tune a partitioning model to improve performance and reduce errors.
 Convert a decision tree model built using the ScaleR functions to an rpart model.
The rxDTree algorithm

The standard partitioning model in the ScaleR
package is rxDTree. This is a parallel
implementation of decision tree and regression
tree methods such as those found in the base R
rpart package. The type of model run is
determined by the type of the response variable;
you will get a classification tree with a factor
response variable and a regression tree with a
numeric response variable.
A decision tree algorithm generally needs to sort
all the continuous variables to determine where to
partition the data. However, sorting is a time-
intensive operation and can be prohibitive when dealing with large data. The rxDTree function solves this
problem by using histograms to get an approximate compact representation of the data. Because
histograms can be quickly constructed in parallel on a server or cluster, trees can be quickly built, even for
very large datasets. Essentially, each compute node or core gets only a subset of the observations of the
data, but has a view of the complete tree built so far. It builds a histogram from the observations it sees,
which compresses the data to a fixed amount of memory. This approximate description of the data is then
sent to a master that determines which terminal tree nodes to split, and how.
When you build a partitioning model, you need to strike a balance between minimizing complexity and
maximizing predictive power. You can do this by specifying the maximum number of bins in the
histograms:
 Using a larger number of bins enables a more accurate description of the data and reduces bias, at
the expense of the increased likelihood of overfitting.
 Using fewer bins reduces time complexity and memory usage, at the expense of predictive power.
For more information on the rxDTree algorithm, see:
Estimating decision tree models

https://aka.ms/m8lvqs
Ensemble algorithms
The ScaleR package provides two more tree-based
algorithms that mean you can take advantage of
the benefits of tree models while reducing the
downsides associated with overfitting and
overgeneralizing. They are both ensemble methods,
which means that they effectively generate many
individual tree models and combine the results.
However, the two methods take very different
approaches to combining individual trees.
Because ensemble models combine many individual

trees, they are far more difficult to interpret
intuitively, but their predictive power is often much higher than single decision trees.
The rxDForest algorithm: decision forest models

This algorithm implements a decision forest, which produces many individual decision trees (or regression
trees, depending on the response variable type), each fitted using the rxDTree algorithm. Each tree is
grown on a bootstrap replicate of some proportion (typically around a third) of the full dataset. The
results from the trees are aggregated together and the final prediction is an average of the predictions of
the individual trees. This technique is known as bagging, or bootstrap aggregating.
The individual trees in a decision forest have many nodes, a low bias, a high chance of overfitting, and
there is a large variance between trees. The trees each overfit the data in a different way and the bagging
process reduces error by decreasing the variance, effectively averaging out these differences.
Because the trees in a decision forest are bootstrap replicates that are not dependent on each other, the
tree growing phase of the algorithm becomes an embarrassingly parallel problem. However, individual
trees take a long time to process because they are large, although the histogram-based sorting method
can run across nodes.
For more information on the rxDForest algorithm, see:
Estimating decision forest models

Https://aka.ms/ikc0yy
The rxBTrees algorithm: stochastic gradient boosted trees

The rxBTrees algorithm also produces many rxDTree trees, but it operates in a very different way from
the rxDForest algorithm. Rather than growing many trees in parallel and averaging them, boosting builds
on shallow trees with few nodes, each having high bias but low variance between them. New trees are
added one at a time, so the next tree is trained to boost the already trained ensemble. At each iteration, a
regression tree is effectively fitted to the residuals of the previous tree. This reduces error by decreasing
the bias in the ensemble.
Individual trees are added sequentially, so boosting is not an embarrassingly parallel problem. However,
the time to run each individual tree model is generally less because the trees are shallower.
Gradient boosted trees have more hyperparameters (parameters controlling the fit of the model) that
need tuning than forest models—and are also more likely to overfit the training data. However, with
careful tuning, they can give better predictions.
For more information on the rxBTrees algorithm, see:
Estimating models using stochastic gradient boosting

https://aka.ms/a4w74q
Splitting data into training and test sets

All partitioning models are supervised learning
algorithms—you need to provide them with a
training set that contains cases with class labels.
You then make predictions on a different set of
data—the test set. The difference in the
predictions for the test and the training set
indicates the level of overfitting.
Note: The examples in this lesson use the

diamonds dataset from the ggplot2 package that
was also used in Module 6: Clustering and Regression Models. In this example, diamonds are
categorized as high value (worth $4,000 or more) and low value (worth less than $4,000). The
example uses the partitioning models available in RevoScaleR to try and predict the category
where a diamond should belong, given the quality and size attributes (for example, cut, clarity,
carat, color).
To prepare your data for analysis, use the rxDataStep function to:
1. Randomly assign the data to a training and test set. In this example, approximately 5 percent of the
data will be assigned to the test set and the remainder will be used to train the models.
2. Create a binary factor named value, indicating whether a diamond is high value (>=$4,000) or low
value (<$4,000).
3. Retain the columns required by the analysis (cut, clarity, carat, color), and drop the others.
The following code shows an example.
Preparing the data for analysis

diamondData <- rxDataStep(inData = ggplot2::diamonds,
transforms = list(
set = factor(ifelse(runif(.rxNumRows) >= 0.05, "train",
"test")),
value = factor(ifelse(price >= 4000, "high", "low"))),
varsToKeep = c("cut", "clarity", "carat", "color"))
You can then split the data into the test and training subsets, based on the set variable.
This example creates two data frames, because the dataset is relatively small; for a large dataset, you
should generate XDF files.
Splitting the data into the test and training datasets

diamondDataList <- rxSplit(diamondData, splitByFactor = "set")
You can see exactly how many diamonds are in each set as follows:
Counting the number of rows in each dataset

lapply(diamondDataList, nrow)
# Typical output
# $diamondData.set.test
# [1] 2678
#
# $diamondData.set.train
# [1] 51262
Creating a partitioning model

The next step is to fit a partitioning model to the
training dataset to investigate the effect of cut,
carat (weight), color and clarity on whether a
diamond is classified as high or low value.
Depending on your requirements, you can choose
to fit a decision tree model, a decision forest
model, or a gradient boosted tree model.
Fitting a decision tree model

You fit a decision tree model with the rxDTree
function, as follows:
Fitting a DTree model

diamondDTree <- rxDTree(value ~ cut + carat + color + clarity,
data = diamondDataList$diamondData.set.train,
maxDepth = 4)
The formula input to the model is the same as for a standard linear model in R. The maxDepth argument
determines how deep you allow the tree to grow.
This example limits maxDepth to 4, which means that each node in the tree can have a maximum of four
splits. Note that the rxDTree function automatically fits a decision tree because the dependent variable,
value, is a factor. If the dependent variable was continuous, then the rxDTree function would fit a
regression tree model instead.
You can view the results of the rxDTree function like this:
Viewing the DTree model

diamondDTree
# Typical results
# Call:
# rxDTree(formula = value ~ cut + carat + color + clarity, data =
diamondDataList$diamondData.set.train,
# maxDepth = 4)
# Data: diamondDataList$diamondData.set.train
# Tree representation:
# n= 51262
# node), split, n, loss, yval, (yprob)

# * denotes terminal node
# 1) root 51262 18377 low (0.358491670 0.641508330)

# 2) carat>=0.97 18099 1168 high (0.935466048 0.064533952)
# 4) clarity=SI2,SI1,VS2,VS1,VVS2,VVS1,IF 17595 902 high (0.948735436 0.051264564)
*
# 5) clarity=I1 504 238 low (0.472222222 0.527777778)
# 10) carat>=1.3 238 20 high (0.915966387 0.084033613) *
# 11) carat< 1.3 266 20 low (0.075187970 0.924812030) *
# 3) carat< 0.97 33163 1446 low (0.043602810 0.956397190)
# 6) carat>=0.895 2589 1162 low (0.448821939 0.551178061)
# 12) clarity=SI1,VS2,VS1,VVS2,VVS1,IF 1753 703 high (0.598973189 0.401026811)
# 24) color=D,E,F,G,H 1444 422 high (0.707756233 0.292243767) *
# 25) color=I,J 309 28 low (0.090614887 0.909385113) *
# 13) clarity=I1,SI2 836 112 low (0.133971292 0.866028708) *
# 7) carat< 0.895 30574 284 low (0.009288938 0.990711062) *
The first few lines show the details of the model formula and data. You can then see the representation of
the tree itself and the way in which the branches correspond to the various decision points:
 Diamonds of more than 0.97 carats are high value, except for the lowest clarity diamonds—these
need to be greater than 1.32 carats to be high value.
 Diamonds between 0.895 and 0.97 carats are high value if they are not low clarity or low color
quality—otherwise they are low value.
 Diamonds less than 0.895 carats are low value.
Fitting a decision forest model

You fit a decision forest model to a decision tree model, but you use the rxDForest function instead:
Fitting a DForest model

diamondDForest <- rxDForest(value ~ cut + carat + color + clarity,
maxDepth = 4,
nTree = 50, mTry = 2, importance = TRUE)
This example sets the maxDepth to the same value as for the decision tree model in the previous
example. The mTry argument is the most important hyperparameter in a forest model; it determines the
number of variables you want to consider for each split.
The default is the square root of the number of predictor variables. The nTree variable is the number of
trees in the forest. The importance argument determines if importance values for the predictor variables
should be calculated. These are useful for visualization—you will learn about this in the next lesson.
Note that building a decision forest model consumes significantly more resources than a decision tree
because it is effectively creating many trees behind the scenes.
You can view the results as follows:
Viewing a DForest model

diamondForest
# Typical results
# Call:
# rxDForest(formula = value ~ cut + carat + color + clarity, data =
# maxDepth = 4, nTree = 50, mTry = 2, importance = TRUE)
# Type of decision forest: class

# Number of trees: 50
# No. of variables tried at each split: 2
# OOB estimate of error rate: 3.62%

# Confusion matrix:
# Predicted
# value high low class.error
# high 17871 506 0.02753442
# low 1352 31533 0.04111297
You can see the out-of-bag (OOB) error rate is 3.62 percent. This means that 3.62 percent of cases were
wrongly classified in the training data. The confusion matrix below this figure shows where the errors were
made.
Fitting a gradient boosted tree model

You use the rxBTrees function to fit a gradient boosted tree model:
Fitting a BTree model

diamondDBoost <- rxBTrees(value ~ cut + carat + color + clarity,
lossFunction = "bernoulli",
maxDepth = 3,
learningRate = 0.4,
mTry = 2,
nTree = 50)
You specify the lossFunction according to the response variable. For a binary factor response variable,
you should use “bernoulli”. The learningRate (or “shrinkage”) is a variable that helps reduce overfitting.
Lower values are less likely to overfit but require more trees in the model to provide a good prediction.
The resulting model looks like this:
Viewing the BTree model

diamondDBoost
# Typical results:
# Call:
# rxBTrees(formula = value ~ cut + carat + color + clarity, data =
# maxDepth = 3, nTree = 50, mTry = 2, lossFunction = "bernoulli",
# learningRate = 0.4)
# Loss function of boosted trees: bernoulli

# Number of boosting iterations: 50
# No. of variables tried at each split: 2
# OOB estimate of deviance: 0.1154285
For these parameters, the model has an error rate of more than 11.5 percent.
Tuning a partitioning model

You can tune you model to achieve the following
goals:
1. Improve predictive power
2. Reduce overfitting
3. Reduce computation time
Different models will require tuning in different

ways. For example, decision trees are much
quicker to fit but are liable to overfitting, while
forest models are much less likely to overfit but
can run for a long time.
Different models provide different tuning parameters.
Tuning parameters for decision trees

Decision trees include the following tuning parameters:
 xVal: this controls the number of folds used to perform cross-validation. The default is two folds.
 maxDepth: this sets the maximum depth of any node of the tree. This is the most intuitive way to
control the size of your trees. Computation becomes more expensive very quickly as the depth
increases. A deeper tree will have a reduced bias but the variance between trees will be greater.
Deeper trees are also more likely to overfit the training data.
 maxNumBins: this controls the maximum number of bins used for sorting each variable. The default
is to use whatever is the larger—101 or the square root of the number of observations. For very large
datasets, you might need to set this higher.
 cp: this is a complexity parameter and sets a limit for how much a split must reduce the complexity
before being accepted. You can use this to control the tree size instead of maxDepth.
 minSplit: this determines how many observations must be in a node before a split is attempted.
 minBucket: this determines how many observations must remain in a terminal node.
In practice, you can often leave a lot of these parameters at their default values. For relatively small
datasets, you could visualize the scree plot of the complexity change with tree size, and then prune the
tree to the point at the elbow of the plot.
The following example illustrates this technique using the plotcp function (this function plots the
complexity parameter table for a model fit). The plot shows the line flattening at a cp value of
approximately 0.0025, so the tree is pruned at this point. After this point, the splits have little importance
because they do not add much to the model, and retaining them just makes the model more complex for
little reward:
Generating the complexity plot for a DTree model

dt1 <- rxDTree(value ~ cut + carat + color + clarity,
data = diamondDataList$diamondData.set.train)
plotcp(rxAddInheritance(dt1))
dt2 <- prune.rxDTree(dt1, cp=0.0025)
A typical result for this model looks like this.

Tuning ensemble methods

rxDForest and rxBTrees have further hyperparameters that you can tune, in addition to the parameters
for individual trees:
 mTry: this is the main hyperparameter for forest models and controls how many variables contribute
to each split in a tree.
 nTree: this controls the number of trees in the forest. Theoretically, the higher the better, but the
improvements reduce after a certain number and it’s computationally costly to run more trees.
 learningRate: this is important for controlling overfitting in boosting models. A value around 0.1 will
do this but more trees will be required to predict successfully.
 Weighting: If your dataset is unbalanced (that is, one response level is much more common than
another) you can set higher weights to the less common level to “oversample” that level, relative to
the others. You can do this by either creating a weight variable and setting the pweights argument
to the name of the weighting variable, or passing a loss matrix to the parms argument. For more
information, see the documentation page in R.
Interoperability with base R partitioning functions

If you have used the base R rpart partitioning
model functions before, you will have noticed the
similarity between this and the rxDTree function.
One major difference is that the rx* modelling
functions do not store the data in the objects
themselves, because they are designed to work
with very large datasets.
You might want to share your models with

coworkers using computers not running the
Microsoft® R stack. To assist with this, you can use
the as.rpart function to convert a rxDTree object
to an rpart object:
Converting an rxDTree object to an rpart object

library(pmml)
rpTree <- as.rpart(diamondDTree)
You can also use the pmml function in the pmml package convert models into a sharable, XML-based
PMML (Predictive Model Markup Language) format.
Converting Revo ScaleR model objects to use PMML

https://aka.ms/fg4nat
Demonstration: Building partitioning models

This demonstration walks through the example described in the notes. The purpose of the demonstration
is to build a partitioning model that can be used to predict the value of a diamond given its cut, clarity,
carats, and color. The dataset is initially categorized into diamonds of high value and diamonds of low
value. High value diamonds are those with a price of $4,000 or more. The dataset is then trimmed so that
only the cut, clarity, carat, and color variables remain. This demonstration fits three models to the data: a
DTree, a DForest, and a BTree.
Demonstration Steps
Categorize the data

2. Open the R script Demo1 - partitioning.R in the E:\Demofiles\Mod07 folder.
3. Highlight and run the code under the comment # Examine the diamond data. This code displays
the first 20 rows in the diamonds dataset. Notice that each diamond has variables that include the
number of carats, the cut, the color, and the clarity, together with other attributes that concern the
geometry of a diamond.
4. Highlight and run the code under the comment # Generate a dataset containing just the columns
required. This code uses the rxDataStep function to:
 Add a factor variable named value with the values high and low. Diamonds are categorized
according to their price.
 Add a factor variable named set with the values train and test. Note that 95 percent of the data
is selected at random and placed into the train category; the remainder is placed in the test
category.
 Remove variables not required by the model, leaving only cut, clarity, carat, and color.
5. Highlight and run the code under the comment # Divide the dataset into training and test data.
This code uses the rxSplit function to separate out the observations in the dataset according to the
value of the set category. The result is two datasets named Diamonds.set.test.xdf and
Diamonds.set.train.xdf.
Fit models over the training data

1. Highlight and run the code under the comment # Fit a DTree model. This code uses the rxDTree
function to fit the model to the training data. The formula specifies that the value depends on the cut,
carat, color, and clarity of a diamond. The tree has a maximum node depth of four levels.
2. Highlight and run the code under the comment # Show the results. This code displays the model fit.
The top level node summarizes the split into high and low value diamonds. The subsequent nodes
show how the decision to classify diamonds was made, based on the other variables.
3. Highlight and run the code under the comment # For comparision, fit a DTreeForest model. This
code uses the rxDForest function to fit the model to the training data. The forest generates 50
decision trees and takes significantly longer to run in consequence. The results display how the data
in the value field of each diamond compares to that generated by the model.
4. Highlight and run the code under the comment # … and a BTree model. This code uses the
rxBTrees function to fit the model to the training data. Again, this model generates 50 trees and
takes a while to run. The results don't include specific details of the categorization, but rather display
the error rate for the recorded value of a diamond compared to its assessed value, in terms of the
deviance.
Prune the decision tree to reduce complexity

1. Highlight and run the code under the comment # Assess the complexity of the DTree model. This
generates a complexity plot of the model that is displayed as a scree plot. The point at which the plot
flattens out shows where increasing the complexity adds little value, so you should remove these
elements from the decision tree to retain efficiency.
2. Highlight and run the code under the comment # Prune the tree to remove unnecessary
complexity from the model. This code removes the least important splits from the tree, based on
the value selected from the scree plot.
3. Leave your R development environment open.

Question
When might you consider constructing a partitioning model rather than a linear model, to make
predictions?
If the dataset is too small to build a linear model.
If the dataset is too large to build a linear model.
If the relationship between the predictor variables and the dependent variable are non-
linear.
To avoid the overhead of sorting a large dataset first.
If the data in the dataset is not uniformly distributed.

Lesson 2
Evaluating models
Once you have trained your model and tuned the parameters to optimize the fit, you will want to run it
on your test dataset. This will give a truer picture of the predictive power of the model and whether it is
overfitting. You want to have a model that is general enough to perform well on your test set, but has
enough predictive power to be useful.
Lesson Objectives
In this lesson, you will learn how to:
 Use a model to run predictions.
 Test the predictions generated by a partitioning model.
 Visualize the trees behind your decision models.
Running predictions
Running predictions from partitioning models
works in an almost identical way to running
predictions from linear models. For more details,
see Module 6: Creating and Evaluating Regression
Models. You:
1. Construct a data frame containing the
different combinations of predictor variables
you want to predict the response for. You will
often get this from the test dataset. Note that
this data must include all the predictor
variables in the original model. Use the
rxDataStep function to select the variables in
a large dataset.
2. Use the function rxPredict to generate your predictions. This function takes the dataset you want to
predict for and the original model object. Note that predictions on rxDForest models can take
considerably longer than predictions on rxDTree models for large datasets.
Running predictions using an rxDTree model

This example uses the model and data from the previous lesson.
First, construct the data frame from the test data, then run the predictions. The following code shows both
steps:
Running predictions against the DTree model

predictData <- rxDataStep(diamondDataList$diamondData.set.test,
varsToKeep = c("cut", "carat", "color", "clarity"))
pDTree <- rxPredict(diamondDTree, predictData, type = "class")
pDForest <- rxPredict(diamondDForest, predictData, type = "class")
Note that you can supply “class” to the type argument to specify that you want the actual class
predictions, rather than the probability of class membership.
For the full list of arguments to these functions, see the R documentation for rxPredict.rxDTree and
rxPredict.rxDForest.
You use the results generated by the rxPredict function to test the accuracy of the predictions made by
the model against the training data.
Testing predictions
After you have constructed your predictions, you
can investigate how well your model has
performed on the test data.
Your first test is to summarize the value variable in

the test dataset (this is the quantity that the model
is predicting) against the actual predicted value
from the model:
Summarizing the response variable in the

predictions
rxSummary(~ value, data = diamondDataList$diamondData.set.test)
# Typical results
# Call:
# rxSummary(formula = ~value, data = diamondDataList$diamondData.set.test)
…
# Category Counts for value
# Number of categories: 2
# value Counts
# high 1003
# low 1675
rxSummary(~ value_Pred, data = pDTree)
# Typical results:
# Call:
# rxSummary(formula = ~value_Pred, data = pDTree)
…
# Category Counts for value_Pred
# Number of categories: 2
# value_Pred Counts
# high 1059
# low 1619
In this example, the data in the test dataset revealed that 1,003 diamonds were high value and 1,675 were
low value. The decision tree predicted that 1,059 of the diamonds in this dataset would be high value and
1,619 would be low value.
You can also compare the totals grouped by the levels in the predictor variables:
Grouping totals by combinations of the predictor variables

predictData1 <- rxMerge(predictData, pDTree, type = "oneToOne")
rxSummary(~ value : (color + clarity + F(carat)), data =

diamondDataList$diamondData.set.test)
# Typical results:
rxSummary(formula = ~value:(color + clarity + F(carat)), data =
diamondDataList$diamondData.set.test)
Summary Statistics Results for: ~value:(color + clarity + F(carat))

Data: diamondDataList$diamondData.set.test
Category Counts for value

Number of categories: 14
Number of valid observations:
Number of missing observations:
value color Counts

high D 87
low D 230
high E 124
low E 359
high F 159
low F 311
high G 201
low G 357
high H 225
low H 212
high I 133
low I 137
high J 74
low J 69

value clarity Counts

high I1 13
low I1 17
high SI2 234
low SI2 222
high SI1 294
low SI1 369
high VS2 219
low VS2 379
high VS1 142
low VS1 287
high VVS2 60
low VVS2 192
high VVS1 31
low VVS1 154
high IF 10
low IF 55

value F_carat Counts

high 0 84
low 0 1611
high 1 803
low 1 64
high 2 113
low 2 0
high 3 3
low 3 0
rxSummary(~ value_Pred : (color + clarity + F(carat)), data = predictData1)
# Typical results:
Rows Read: 2678, Total Rows Processed: 2678, Total Chunk Time: 0.009 seconds
Computation time: 0.014 seconds.
Call:
rxSummary(formula = ~value_Pred:(color + clarity + F(carat)),
data = predictData1)
Summary Statistics Results for: ~value_Pred:(color + clarity +

F(carat))
Data: predictData1
Category Counts for value_Pred

value_Pred color Counts

high D 83
low D 234
high E 124
low E 359
high F 162
low F 308
high G 215
low G 343
high H 247
low H 190
high I 144
low I 126
high J 84
low J 59

value_Pred clarity Counts

high I1 13
low I1 17
high SI2 268
low SI2 188
high SI1 322
low SI1 341
high VS2 223
low VS2 375
high VS1 142
low VS1 287
high VVS2 57
low VVS2 195
high VVS1 27
low VVS1 158
high IF 7
low IF 58

value_Pred F_carat Counts

high 0 88
low 0 1607
high 1 855
low 1 12
high 2 113
low 2 0
high 3 3
low 3 0
You should calculate the percentage prediction accuracy. This is the percentage of test cases that were
accurately predicted by the model:
Calculating the percentage accuracy of predictions

sum(pDTree$value_Pred == diamondDataList$diamondData.set.test$value)/
length(diamondDataList$diamondData.set.test$value)
# Typical results
[1] 0.961165
The result from the last line shows that the model has a 96.1 percent prediction accuracy on the test data.
You might also want to calculate the error rate, based on the training set, and compare this to the error
rate on the test set. A large difference between the two would indicate overfitting against the training
data.
Visualizing trees
An intuitive way to understand a partitioning
model is to visualize the tree itself. The
RevoTreeView package can be used to plot
rxDTree decision or regression trees as an
interactive HTML page that you can view in a
browser. You can also share the HTML page with
other people or display it on different machines
using the zipTreeView function.
Viewing trees in a browser

You can use the createTreeView function as a plot to display a decision tree:
Visualizing a decision tree in a web browser

library(RevoTreeView)
plot(createTreeView(diamondDTree))
This command will produce an HTML representation of the decision tree model from the last lesson,
and then open it in the browser. You can interact with the plot by clicking on the nodes to expand the
next nodes out.
If you hover your mouse pointer over a node, you will see the number of cases at that split and the rule
that defines the split. The terminal nodes are labeled with the predicted classes.
Producing tree plots

You might want to produce standard line plots of your tree models for reports and presentations. The
functions to do this are provided by the rpart package. You need to use the rxAddInheritance function
to provide rpart inheritance:
Generating line plots of a DTree model

library(rpart)
plot(rxAddInheritance(diamondDTree))
text(rxAddInheritance(diamondDTree))
In the example, the results look like this:
Visualizing importance plots for forest models

Unlike single trees, forest models cannot be visualized in a simple way because they are combinations of
many trees. However, a useful visualization for forest models is a dotchart of predictor variable
importance.
To do this, use the rxVarImpPlot function with an rxDForest model, as follows:
Generating the importance plot of variables in a DForest model

rxVarImpPlot(diamondDForest)
The results for the example look like this. You can see that the carat variable provides the most weight to
the decisions made by the trees in the model:
You might find these other utility functions useful for exploring rxDForest models:
 rxLeafSize: returns the size of the terminal nodes for the trees in the decision forest.
 rxTreeDepth: returns the depth of all trees in the decision forest.
 rxTreeSize: returns the number of nodes of all trees in the decision forest.
 rxVarUsed: returns how many times each predictor variable is used in the decision forest.
 rxGetTree: extracts a single decision tree from the forest. You can then examine this tree graphically
by using the createTreeView function in a plot, as described earlier in this topic.
Demonstration: Running predictions against partitioning models

This demonstration continues the example started in the previous demonstration—the purpose is to use
the partitioning models to predict the value of diamonds.
Demonstration Steps
Run predictions and assess the results

1. Open the R script Demo2 - predicting.R in the E:\Demofiles\Mod07 folder.
2. Highlight and run the code under the comment # Create a copy of the test set without the value
and set variables. This code creates an in-memory data frame containing the data from the test set
but with the value and set variables removed.
3. Highlight and run the code under the comment # Predict the value of each diamond in the test
set using the DTree model. This code uses the DTree model to predict the value of each diamond in
the test set. The results are stored in an rxPredict object with a variable, named Value_Pred, for each
row in the test set. This variable will contain the value high or low. Note that, if you want to see the
probability of each decision, you can omit the type argument from the rxPredict function. You can
also compute residuals if necessary.
4. Highlight and run the code under the comment # Assess the results against the values recorded
in the test set. This code generates a summary of the split between high and low values in the test
data, and in the predicted results, for comparison. The predicted number of high and low values
should be within a few percent of the actuals.
5. Highlight and run the code under the comment # Repeat using the DForest model. This code uses
the DForest model to generate predictions using the test dataset, and assesses the accuracy of the
results. Ideally, these results should be closer to the actual values than those predicted by the DTree.
6. Highlight and run the code under the comment # Add the predicted value of each diamond to
the in-memory data frame. This code merges the test data used to generate the predictions with
the predicted values for each diamond.
7. Highlight and run the code under the comment # Compare the predicted results against the
actual values by variable. This code generates summaries of the original test data and the data
frame with the predicted results. You can browse this data to determine which factors lead to
discrepancies.
Visualize the models

1. Highlight and run the code under the comment # Visualize the DTree model using RevoTreeView.
This code uses the createTreeView function to display a representation of the tree in a web browser.
You can click the nodes in the tree to see how the classification of diamonds was made based on the
variables. Close the browser when you have finished.
2. Highlight and run the code under the comment # Generate a line plot of the DTree model. This
code uses the rpart library to display a line plot of the tree. The structure of the plot should mirror
that shown by using RevoTreeView.
3. Highlight and run the code under the comment # Show an importance plot from the DForest
model. This code generates a dotchart showing the importance of each variable in classifying
diamonds as used by the DForest model. The carat variable is clearly the most significant.
4. Close your R development environment.

Statement Answer
You should always test the accuracy of predictions made by a

model against the training dataset. True or False?
Lesson 3
Using the MicrosoftML package
The MicrosoftML package extends the machine learning capacity already provided in RevoScaleR. It adds
new algorithms and a set of data transform functions to increase the speed, performance and scalability
of your data science pipelines.
Lesson Objectives
 Explain the uses of the MicrosoftML package.
 Describe the transformation functions provided by the MicrosoftML package.

 Describe the algorithms available in the MicrosoftML package.
Overview of the MicrosoftML package

The MicrosoftML package is aimed at machine
learning. This is a form of artificial intelligence
based on predictions, decision-making, analysis,
and adaptation. You can build predictive models
based on large datasets and use them to enable
systems to make decisions about how to behave
or react to particular situations and circumstances.
The system can analyze the results, compare the
actions taken to other possible outcomes, and
build this information into the model to adapt the
behavior to learn from this experience.
Machine learning is an emerging technology. It

often requires access to significant processing power and large datasets to improve the accuracy of the
knowledge that a system gains, and is frequently used with the Microsoft Azure ML cloud service.
Azure Machine Learning

https://aka.ms/myrv3c
The MicrosoftML package is also available in both R Client and R Server, and the functions are designed
to complement the RevoScaleR package.
You can use MicrosoftML for:
 Handling large bodies of text data.
 Working with high-dimensional categorical data.
 Training deep neural nets on GPU processors.

 Training support vector machine models.
 Providing faster classification and regression algorithms for very large datasets.
 Providing a set of common and useful data transformations that can be run on chunked data.
Introduction to MicrosoftML
https://aka.ms/r89gc8
Data transformation functions

The MicrosoftML package provides a set of
transform functions that are particularly suited to
working with either high-dimensional categorical
data or text processing. They can be used to build
features from your dataset before splitting into
test and training sets, and running analyses. These
functions are commonly used as an input to the
mlTransforms argument in the MicrosoftML
machine learning algorithm functions.
The main transformation functions are:
 concat: this function combines several

columns into a single vector-valued column.
When you concat variables, you must ensure they are of the same type. Combining features in this
way can significantly reduce the time taken to run a model, particularly when you have 100s or even
1000s of columns.
 categoricalHash: this function is useful for categorical variables with a large number of different
possible values—for example, text data. It converts the variable into an indicator array using hashing.
Hashed values are much quicker to look up than values in a standard categorical variable.
 categorical: this function converts a categorical value into an indicator array using a dictionary. It is
useful when the number of categories is smaller or fixed, and hashing is not required.
 selectFeatures: this transformation function selects features from the specified variables using one of
two modes: count or mutual information.
o The count feature selection mode selects a feature if the number of examples have at least the
specified count examples of nondefault values in the feature. This mode is useful when applied
together with a categorical hash transform.
o The mutual information feature selection mode selects the features based on the mutual
information. It keeps the top numFeaturesToKeep features with the largest mutual information
with the label. Mutual information is similar to a correlation and specifies how much information
you can get about one variable from another.
 featurizeText: This is a feature selection function that provides a wide range of tools to process text
data for analysis. It provides:
o Counts of n-grams (sequences of consecutive words) from a given text.
o Language detection (English, French, German, Dutch, Italian, Spanish and Japanese available as
default).
o Tokenization (splitting into words).

o Stopword removal (removal of common words not useful for analysis).
o Punctuation removal.
o Feature generation.
o Term weighting by term frequency and other methods.
MicrosoftML algorithms
The MicrosoftML package provides a range of
machine learning algorithms, each suited to a
particular use case.
mxFastLinear
This algorithm is based on the stochastic dual
coordinate ascent (SDCA) method, a state-of-the-
art optimization technique for convex objective
functions. It is designed for both binary
classification (two classes—for example, spam
filtering) and linear regression analysis (for
example, predicting mortgage defaults) and
combines the advantages of both logistic
regression and SVM algorithms. The algorithm scales well on large, out-of-memory datasets and supports
multithreading. You can also use the SDCA algorithm to analyze potentially billions of rows and columns.
Because the algorithm involves a stochastic (random) element, you will not always get consistent results
from one run to the next, although the error between runs should be very small.
The mxFastLinear algorithm is not suitable for nonlinear regression problems.
rxOneClassSvm
This algorithm is most useful for anomaly detection—that is, to identify outliers that don’t belong to a
target class. It’s a support vector machine (SVM) algorithm that attempts to find the boundary that
separates classes by as wide a margin as possible. rxOneClassSvm is a “one-class” algorithm because the
training set contains only examples from the target class. It infers the “normal” properties for the objects
in the target class and, from these properties, predicts whether cases are like the normal examples. This
works because typically there are very few anomalies. Anomaly detection is often used for network
intrusion, fraud, or to identify problems in long-running system processes. The data in anomaly detection
problems is not usually very large, and this algorithm is designed to run single threaded on in-memory
data.
rxFastTrees
This is a boosting algorithm, like the rxBTrees algorithm in the RevoScaleR package. The algorithm uses
an advanced sorting method that makes it faster, but it is limited to working with in-memory data.
However, it can be multithreaded to make use of multiple processors. It is suitable for datasets of up to
around 50,000 columns. It can run both binary classification trees and regression trees.
rxFastForest
Like rxFastTrees, the rxFastForest function is also optimized for in-memory modeling. It has similar
limitations and advantages. It implements the Random Forest algorithm and Quantile Regression
Forests.
rxNeuralNets
This function implements feed-forward neural networks for regression modeling and for binary and
multinomial (multiple classes) classification. Neural nets are inspired by the neural network in the brain,
which is composed of many interconnected, but independent neurons. The neurons in a neural net model
are arranged in layers, where neurons in one layer are connected by a weighted edge to neurons in the
next layer. The values of the neurons are determined by calculating the weighted sum of the values of the
neurons in the previous layer and applying an activation function to that weighted sum. A model is
defined by the number of layers, the number of neurons in each layer, the choice of activation function,
and the weights on the graph edges. The algorithm tries to learn the optimal weights on the edges based
on the training data.
Neural nets perform well when data structures are not well understood and for problems where standard
regression based models fail—such as check signature recognition and optical character recognition
(OCR). The rxNeuralNets algorithm is capable of “deep learning” over potentially millions of columns
and an effectively infinite number of rows of data. It can also run on multiple cores or even GPUs. The
trade-off for this power is that neural nets have many control parameters and can take a long time to
train.
rxLogisticRegression
This is a highly optimized logistic regression algorithm for binary and multinomial classification over large
datasets. The algorithm can be either run single-threaded, where it can make use of out-of-memory data,
or multithreaded but loading all data into memory at once. It performs well for datasets up to
approximately 100 million columns of data. It uses linear class boundaries, so your data should, at least
approximately, meet the criteria of statistical normality.
For more information on the MicrosoftML algorithms, see:

Overview of Microsoft ML functions
https://aka.ms/syosce
Statement Answer
The functions in the MicrosoftML package are only available to R Server

running on VMs using Azure, due to the amount of processing power
involved. True or False?
Lab: Creating a partitioning model to make predictions

Scenario
In addition to departure times, you suspect that flight delays might also be caused by factors that are
dependent on the interactions between variables, such as the day of the week, the month of the year, the
carrier, and the departure or arrival airport. This could be due to regular fluctuations in the business
pattern of the working week, seasonal weather conditions, or contractual or political reasons. Perhaps an
airport has a bias to prefer flights made by one carrier over another, and will allocate resources such as
gate slots, loading and unloading facilities, and refueling stations more readily. You decide to create
partitioning models to test this hypothesis.
Objectives
 Create a DTree partitioning model using the departure time, arrival time, month, and day of the week
as predictor variables, and use this model to predict delay times.
 Create a DForest model to see how this affects the quality of the predictions made.
 Create another DTree model based on a different set of predictor variables.
 Create a further DTree model that combines the variables from the previous models to judge the
effects on the accuracy of the predictions.
Lab Setup
Before you start this lab, ensure that the following VMs are all running:
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
Exercise 1: Fitting a DTree model and making predictions

Scenario
Initially, you decide to focus on the day of the week and month, in addition to the departure and arrival
times, to see what these factors taken together can tell you about flight delays. You use a DTree model to
generate a set of predictions that you can use as a comparison with other models, and with decision trees
constructed using additional variables.
2. Split the data into test and training datasets
3. Create the DTree model
4. Make predictions using the DTree model


2. Copy the FlightDelayData.xdf file from the E:\Labfiles\Lab07 folder to the \\LON-RSVR\Data shared
folder.
 Task 2: Split the data into test and training datasets

 session: TRUE
 diff: TRUE
 username: admin
3. Transform the contents of the flight delay data file as follows:
 Convert the DepTime and ArrTime variables from character to numeric.

 Add a factor variable named DataSet that contains the value "train" or "test", selected at
random; 95 percent of the observations should be marked as "train", and the remaining 5 percent
marked as "test".
 Retain the variables Delay, Origin, Month, DayOfWeek, DepTime, ArrTime, Dest, Distance,
UnqiueCarrier, OriginState, and DestState from the original dataset. Discard the other
variables.
 Save the new dataset in a file named \\LON-RSVR\Data\PartitionedFlightDelayData.xdf.
4. Split the data in the PartitionedFlightDelayData.xdf file into separate test and training files using
the DataSet field.
5. Verify the number of observations in each file. The training file should contain 19 times more
observations than the test file.
 Task 3: Create the DTree model

1. Fit a DTree model to the training data. Do not limit the number of levels in the DTree. Use the
following formula:
Delay ~ DepTime + ArrTime + Month + DayOfWeek
2. Examine the model and note the complex interplay between the variables that drive the decisions
made. If time allows, copy the DTree object to the local session running on R Client and use the
RevoTreeView package to step through the model. Return to the R Server session when you have
finished.
3. Use the plotcp function to generate a scree plot of the tree. Notice how the large number of
branches adds complexity, but does not necessarily improve the decision making process; this is an
overfit model.
4. Examine the cptable field in the model to ascertain the point of overfit—it will probably be around
seven levels of branching.
5. Prune the tree back to seven levels.
 Task 4: Make predictions using the DTree model

1. Extract a data frame from the test dataset that contains only the predictor variables DepTime,
ArrTime, Month, and DayOfWeek.
2. Run predictions against the variables in the data frame using the DTree model.
3. Summarize the results of the Delay_Pred variable in the predictions. Compare the statistics of the
Delay variable in the test dataset with these values. How close are the mean values of both datasets?
4. Merge the predicted values for Delay_Pred into a copy of the test dataset (not the data frame).
Perform a oneToOne merge.
5. Analyze the results and answer the following questions.
Note: You might find it useful to write the code that performs this analysis as a function
because you will be repeating it several times over different sets of predictions in subsequent
exercises.
Results: At the end of this exercise, you will have constructed a DTree model, made predictions using this
model, and evaluated the accuracy of these predictions.
Question: How many predicted delays were within 10 minutes of the actual reported delays?
What proportion of the observations is this?
Question: How many predicted delays were within 5 percent of the actual delays?
Exercise 2: Fitting a DForest model and making predictions

Scenario
Having established a baseline set of predictions using a DTree model, you decide to repeat the process
using a DForest model to see how the predictions change. You will use the same training and test datasets
as before.
1. Create the DForest model
2. Make predictions using the DForest model
3. Test whether overfitting of the model is an issue
 Task 1: Create the DForest model

 Fit a DForest model to the training data. Use the same formula as before. Restrict the depth of the
trees in the forest to seven levels.
 Task 2: Make predictions using the DForest model

1. Generate predictions for the delay times against the DForest model. Use the same data frame
containing the test data that you constructed in the previous exercise.
2. Merge the predicted values into a copy of the test data frame. Perform a oneToOne merge.
3. Analyze the results and answer the following questions.
Question: How many predicted delays were within 10 minutes of the actual reported delays?
What proportion of the observations is this? How does this compare to the predictions made
using the DTree model?
Question: How many predicted delays were within 5 percent of the actual delays? How does
this compare to the predictions made using the DTree model?
Question: How many predicted delays were within 10 percent of the actual delays? How
does this compare to the predictions made using the DTree model?
Question: How many predicted delays were within 50 percent of the actual delays? How
does this compare to the predictions made using the DTree model?
Question: Was the DForest model more accurate at predicting delays than the DTree
model? What conclusions can you draw?
 Task 3: Test whether overfitting of the model is an issue

 Repeat the previous two tasks but set the maximum depth of the trees in the DForest to 5,
analyze the results, and then answer the following questions.
Question: Was the DForest model with a reduced depth more or less accurate than the
previous model? What conclusion to you reach?
Exercise 3: Fitting a DTree model with different variables

Scenario
You now decide to see whether factors such as the origin, destination, airline, and flight distance have any
bearing on delay times—you elect to construct another model using these variables.
1. Create the new DTree model
2. Make predictions using the new DTree model
 Task 1: Create the new DTree model

 Fit a DTree model to the training data. Restrict the depth of the tree to seven levels. Use the following
formula:
Delay ~ Origin + Dest + Distance + UniqueCarrier + OriginState + DestState
 Task 2: Make predictions using the new DTree model

1. Extract a data frame from the test dataset that contains only the predictor variables Origin, Dest,
Distance, UniqueCarrier, OriginState, and DestState.
2. Generate predictions for the delay times against the new DTree model. Use the data frame containing
the test data that you just constructed.
4. Analyze the results and answer the following question.

Results: At the end of this exercise, you will have constructed a DTree model using a different set of
variables, made predictions using this model, and compared these predictions to those made using the
earlier DTree model.
Question: How do the predictions made using the new set of predictor variables compare to
the previous set? What are your conclusions?
Exercise 4: Fitting a DTree model with a combined set of variables

Scenario
You have decided to construct one final model that includes all of the variables used previously. You will
then see whether including all of these variables improves the accuracy of delay predictions.
1. Create the DTree model
2. Make predictions using the DTree model

 Fit a DTree model to the training data. Restrict the depth of the tree to seven levels. Use the following
formula:
Delay ~ DepTime + Month + DayOfWeek + ArrTime + Origin + Dest + Distance +

UniqueCarrier + OriginState + DestState

1. Extract a data frame from the test dataset that contains all of the variables except Delay.
2. Generate predictions for the delay times against the DTree model. Use the data frame containing the
test data that you just constructed.
4. Analyze the results and answer the following question.
Results: At the end of this exercise, you will have constructed a DTree model combining the variables
used in the two earlier DTree models, and made predictions using this model.
Question: What do the results of this model show about the accuracy of the predictions?

 Use the three main partitioning models in the ScaleR package, and tune the models to reduce bias
and variance.
 Validate, visualize and make predictions from partitioning models.
 Use the MicrosoftML package for using advanced machine learning algorithms to create predictive
models.
8-1
Module 8
Processing Big Data in SQL Server and Hadoop
Contents:
Module Overview 8-1
Lesson 1: Integrating R with SQL Server 8-2
Lab A: Deploying a predictive model to SQL Server 8-13
Lesson 2: Using ScaleR functions with Hadoop on a Map/Reduce cluster 8-21
Lesson 3: Using ScaleR functions with Spark 8-30
Lab B: Incorporating Hadoop Map/Reduce and Spark functionality into the

ScaleR workflow 8-35
Module Overview
In this module, you will learn how to process big data by using R Server with Microsoft® SQL Server®,
and Hadoop. You will see how R Server is incorporated into SQL Server to enable you to analyze data held
in a database efficiently, making use of SQL Server resources. You will also learn how R Server can be used
to handle big datasets stored in HDFS by using Hadoop Map/Reduce and Spark functionality.
Objectives
 Use R in conjunction with SQL Server to analyze data held in a database.
 Incorporate Hadoop Map/Reduce functionality, together with Pig and Hive, into the ScaleR workflow.
 Utilize Hadoop Spark features in a ScaleR workflow.

8-2 Processing Big Data in SQL Server and Hadoop
Lesson 1
Integrating R with SQL Server
A key principle of using R is to move the processing close to the data that is being processed. In this way,
you reduce the overhead and memory requirements of relocating large datasets across networks and
hardware. A SQL Server database can act as a source for R data, holding massive amounts of data.
Microsoft have integrated R into SQL Server to enable you to perform R processing directly from the
database server. Using SQL Server R Services, you can create stored procedures that run R functions,
including ScaleR operations. These functions can have access to the data held in your databases. SQL
Server R Services takes advantage of the parallelism available with SQL Server to help maximize
throughput.
Lesson Objectives
 The features of SQL Server R Services.
 How to create a SQL Server compute context.
 How to access SQL Server data.

 How SQL Server R Services maps SQL Server datatypes to R.
 How to run R code from within SQL Server.
 How to store R objects in a SQL Server database.
What is SQL Server R Services?

SQL Server R Services provides an environment for
the secure execution of R scripts on the SQL Server
computer. It includes a version of R server,
optimized for use with SQL Server.
SQL Server R Services consists of a number of
extensions that are installed in the database engine
to enable you to run R code. It is aimed at using
SQL Server as the primary source of data, and so
abides by the convention that you should keep
your R processing close to the data upon which it
depends. SQL Server R Services includes a complete
distribution of the base R packages. However, SQL
Server R Services also provides the RevoScaleR package and the RevoPemaR package (amongst other
proprietary packages). You use these packages to perform scalable analytics and implement
embarrassingly-parallel tasks within SQL Server.
The SQL Server Trusted Launchpad manages security and communications. R tasks execute outside the
SQL Server process, to provide security and greater manageability. An additional service named BxlServer
(Binary Exchange Language Server) enables SQL Server to communicate efficiently with external processes
running R, and provides access to data in a SQL Server database to these external processes. Each R
request from the database is handled as a separate Windows® job that runs in its own R session.
The RxInSqlServer compute context also uses the BxlServer to enable you to run R code that is not
stored inside a SQL Server database, but that executes within the context of the SQL Server engine (and
access data in a SQL Server database) from environments such as R Client and remote R Server sessions.
For more information about the new components in SQL Server that support R services, see:
Components in SQL Server to Support R

https://aka.ms/gwxcj3
You can run R code in SQL Server either by setting the RxInSqlServer compute context from an R client
session, or directly from SQL Server by using stored procedures. In all cases, the SQL Server Trusted
Launchpad verifies that the process running the R code has the appropriate privileges and access rights to
the data that it uses. Depending on how SQL Server is configured, you can connect by using SQL Server
authentication (a SQL Server login and password), or you can utilize Windows authentication. Additionally,
the user or account must be granted the right to execute external stored procedures. For more
information about how SQL Server Trusted Launchpad manages security, see:
Security Overview (SQL Server R Services)

https://aka.ms/tcmtyi
Accessing SQL Server data

The topic Working with SQL Server data sources in
Module 2: Exploring Big Data, provides an overview
showing how to query and modify data held in a
SQL Server database using ScaleR functions. To
recap, you access data by creating a data source
object. You can use either of the following:
 RxSqlServerData. This is the preferred type of

data source to use for SQL Server. The
parameters to this function specify the server to
which you connect, the database to use, login
information, and other connection parameters.
You can use this function to connect directly to
a table, or you can reference a SQL query that retrieves data.
 RxOdbcData. This is a more generic data source for accessing data through the ODBC interface by
using the RODBC package. It is less optimal than the rxSqlServerData data source for retrieving SQL
Server data, but is currently the only option available if you need to query data held in Azure SQL
Database.
In both cases, you must provide the details of the SQL Server connection by using a connection string that
specifies the address of the database server, the database to connect to, and logon or security information
that identifies the SQL Server account to use.
Having established a connection to SQL Server through a data source, you can then read and write data
using R functions. Note that not all R functions are supported; for example, you can use head to display
the first few rows from a table or query, but the tail function is not available.
The following example shows how to create a connection to the flightdelaydata table in a SQL Server
database named FlightDelays, and then use this connection to display the first few rows from the table.
Retrieving data from SQL Server

SQLR;Database=FlightDelays;Trusted_Connection=Yes"
connection <- RxSqlServerData(connectionString = sqlConnString,
table = "dbo.flightdelaydata", rowsPerRead = 1000)
head(connection)
# Results
Origin Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime …
1 ORD 2000 2 10 4 557 600 843 854
2 ORD 2000 2 20 7 1718 1700 2009 1958
3 ORD 2000 2 15 2 1646 1650 1929 1934
4 ORD 2000 2 23 3 905 906 1153 1146
5 ORD 2000 2 28 1 817 820 1032 1043
6 ORD 2000 2 3 4 556 600 845 854
Remember that you can also use the following functions that operate directly on tables within a SQL
Server database:
 rxSqlServerDropTable. This function removes a table (and its contents) from a SQL Server database.
 rxSqlServerTableExists. Use this function to test whether a specified table exists in a SQL Server
database.
 rxExecuteSQLDDL. This function performs Data Definition Language (DDL) operations, enabling you
to perform operations such as creating new tables.
Creating a SQL Server compute context

You use the RxInSqlServer compute context to run
code directly within the SQL Server database
engine. Using this compute context, you can exploit
the computing power on which the database
engine is hosted. The SQL Server host environment
could be running locally as part of an on-premises
operation, but it could also be running in the cloud.
Note that SQL Server R Services are not currently
available for Azure SQL Database. If you need to
access data held in this environment, you should
use the RxLocalSeq compute context with an
RxOdbcData data source that connects to the
database, as described in the previous topic.
Note: The RxInSqlServer compute context only supports the RxSqlServerData data
source. You cannot access text files, XDF files, or files held in an HDFS file system in this compute
context.
The following example shows how to connect to SQL Server R Services using an RxInSqlServer compute
context object:
Connecting to SQL Server R Services

sqlWait <- TRUE
sqlConsoleOutput <- FALSE
sqlCompute <- RxInSqlServer(

connectionString = sqlConnString,
wait = sqlWait,
consoleOutput = sqlConsoleOutput)
rxSetComputeContext(sqlCompute)
In this case, the code connects to a database named FlightDelays hosted by a SQL Server instance
running on a server named LON-SQLR. The connection utilizes Windows authentication. The
rxSetComputeContext function in this example switches your session to run within the context of the
SQL Server database. At this point, your R code runs using SQL Server R Services on the database server.
The code that you run is performed as a series of Windows jobs, managed by SQL Server. This approach
helps to prevent SQL Server from being overwhelmed by a sudden influx of work. You can monitor these
jobs using the sp_help_jobactivity stored procedure in SQL Server. The wait parameter indicates whether
your code is blocked when the job is created and waits until the job has finished and returned any results,
or whether each request is simply queued and your session is allowed to continue. This is known as
waiting or nonwaiting.
If the wait parameter is false, each request is given a unique identifier, and it is your responsibility to
check the status of the job and retrieve the results. To find the status of a running nonwaiting job, you can
call rxGetJobStatus with the job identifier. You can obtain the identifier for the most recent job from the
rxgLastPendingJob variable. To retrieve the results of a finished nonwaiting job, you can call
rxGetJobResults with the job identifier as the argument.
To cancel a nonwaiting job, use the rxCancelJob function with the job name as the argument.
Mapping types between R and SQL Server

R and SQL Server have a different set of data types.
Many data types can be mapped directly between
the two environments, and SQL Server R services
will perform automatic conversions wherever it can.
The following table summarizes some of the common mappings:
SQL Server Type R Class
varchar(n) (n <= 8000) character
char(n) (n <= 8000) character
varbinary(n) (n <= 8000) raw
int integer
smallint integer
float numeric
real numeric
bigint numeric
money numeric
datetime POSIXct
date POSIXct
bit logical
Note: The mappings for binary and character data shown in the table apply for conversions
from SQL Server to R. R character data is converted to varchar(max) on output to SQL Server,
and R raw data is converted to varbinary(max).
You should note that SQL Server has some data types that are not available in R, including image, xml,
table, timestamp, and all spatial types. In these situations, you must write your own code to extract the
information held in this data and reformat it as types that are compatible with R. You can do this using
the CAST and CONVERT Transact-SQL functions. Additionally, some types might be converted in an
unexpected manner, so you might need check the results carefully.
Note: If you want to remove columns that have incompatible types from a dataset, note
that the RxSqlServerData data source does not support the varsToKeep and varsToDrop
options of the rxDataStep function.
For detailed information on the type mappings performed by SQL Server R Services, see:
R Libraries and R Data Types

https://aka.ms/jq4tvw
Running R code from SQL Server

You can invoke R code from within SQL Server by
using the sp_execute_external_script stored
procedure. This stored procedure expects you to
provide an R script as the @script parameter, the
input data, and a specification for how any results
should be returned. The input data can be provided
as one or more named arguments. However, if you
need to provide a set of input data (such as the
results of a Transact-SQL query), you can pass this
information as a data frame to the stored
procedure. Similarly, the output data can be
returned as output parameters or another data
frame.
The following code sample shows how to use the sp_execute_external_script stored procedure.
Using the sp_execute_external_script stored procedure to run R code

exec sp_execute_external_script
@language = N'R',
@script = N'
summaryResults <- rxSummary(~., InputDataSet,
rowSelection = (OriginState == originState),
transformObjects = list(originState = origin))
print(summaryResults)
OutputDataSet <- rxDataStep(inData = InputDataSet,
rowSelection = (DestState == "NY"))
',
@input_data_1 = N'SELECT Origin, OriginState, Dest, DestState, Delay, UniqueCarrier
FROM dbo.flightdelaydata',
@params = N'@origin nvarchar(max)',
@origin = 'MA'
WITH RESULT SETS((Origin nchar(3), OriginState nchar(2), Dest nchar(3), DestState
nchar(2), Delay int, UnqiueCarrier nchar(2)))
In this example, the script uses the rxSummary function to summarize a dataset. The results are printed,
before a new dataset is generated containing a subset of the observations from the original dataset by
using the rxDataStep function. The new dataset is returned by the stored procedure. The key points to
notice are:
 The @language parameter. You should always set this to 'R' (this is currently the only external
language supported by the sp_execute_external_script stored procedure).
 The @script parameter. This contains the R code to run.
 The @input_data_1 parameter. This parameter specifies a dataset to be passed in to the stored
procedure. It appears in the stored procedure as the variable InputDataSet (although you can
change this name by specifying the @input_data_1_name parameter). In the example, the dataset is
the result of executing a SQL query.
 The @params parameter. This parameter specifies a comma-separated list of variables that are
passed in to the stored procedure and that can be referenced by the R code in the stored procedure.
You must specify the type as a recognized SQL Server type for each variable. SQL Server R Services
will convert the variables into the equivalent R data types, subject to the rules specified in the
previous topic. In the example, the origin variable is referenced by the rxSummary function.
 The @origin variable. Each input variable mentioned in the @params list should be specified with a
value to be assigned to the variable. In the example, the @origin variable is assigned the text "MA",
and this value is used by the rxSummary function in the R code to limit the observations being
summarized to those where the OriginState field matches "MA".
 The WITH RESULT SETS clause. This clause specifies the fields to include if the R code returns a
results set, such as a data frame (currently, a data frame is the only supported type of result set). In
the R code, you reference the result set using the OutputDataSet variable, and the contents of this
variable are returned from the stored procedure. You must ensure that the fields in the
OutputDataSet variable match those specified by the WITH RESULT SETS clause.
The sp_execute_external_script stored procedure also enables you to specify processing hints, such as
whether to parallelize the processing for the R code, and whether to use result-set streaming for handling
datasets too big to fit into memory. For more information, see:
sp_execute_external_script (Transact-SQL)
https://aka.ms/d0e0uz
Note: You must enable external scripts before using the sp_execute_external_script
stored procedure. An administrator can perform this task by using the following commands while
connected SQL Server.
sp_configure 'external scripts enabled', 1;

RECONFIGURE;

External scripts enabled Server Configuration Option
https://aka.ms/ewkbbo
Creating stored procedures from R functions

The sqlrutils package provides a set of utility functions that you can use to create a SQL Server stored
procedure from an R function. In the code below, the testProc function simply adds 99 to the integer
value passed in and returns the results. The StoredProcedure function creates a StoredProcedure object
that can be used to create a stored procedure that wraps the testProc function. You can optionally save
the Transact-SQL code for the stored procedure if you specify a directory in the filepath argument. If the
stored procedure takes any parameters, you must specify these as InputParameter or OutputParameter
objects to the StoredProcedure function. After you have created the stored procedure object, you can
save it to SQL Server using the registerStoredProcedure function:
The example below shows how to perform these tasks:
Creating a stored procedure from an R function

# A simple R function
testProc <- function(data) {
return(as.integer(data) + 99)
}
# Create a StoredProcedure object that wraps the function

param <- InputParameter(name = "data", type = "integer")
testProcSP <- StoredProcedure(testProc, "SPTestProc", param, filePath = "E:\\Data",
dbName = "FlightDelays")
# Save the stored procedure to SQL Server. The sqlConnString variable holds a connection
string
registerStoredProcedure(testProcSP, sqlConnString)
The stored procedure created by the preceding code looks like this. You can see how the R code is
embedded in a call to the sp_execute_external_script stored procedure:
The stored procedure generated by the sqlrutils package

IF (OBJECT_ID('SPTestProc') IS NOT NULL)
DROP PROCEDURE SPTestProc
GO
CREATE PROCEDURE SPTestProc
@parallel_outer bit = 0,
@data_outer int
AS
BEGIN TRY
exec sp_execute_external_script
@language = N'R',
@script = N'
testProc <- function(data) {
return(as.integer(data) + 99)
}
result <- testProc(data = data)
if (is.data.frame(result)) {
OutputDataSet <- result
} else if (is.list(result) && length(result) == 1
&& is.data.frame(result[[1]])) {
OutputDataSet <- result[[1]]
} else if (!is.null(result)) {
stop(paste0("the R function must return either NULL,",
" a data frame, or a list that ",
"contains a single data frame"))
}
',
@parallel = @parallel_outer,
@params = N'@data int',
@data = @data_outer
END TRY
BEGIN CATCH
THROW;
END CATCH;
GO
Storing R objects in a SQL Server database

One purpose of R code is to create analytical
models and other objects that you can use to
examine data. Some of these objects can be large—
perhaps too large to save as part of an R session. It
might take a considerable amount of processing
power to construct them, so rebuilding them each
time you start a new R session could be very time
consuming and resource intensive. The ideal place
to store these objects is in a database.
The ScaleR package provides the rxWriteObject

and rxReadObject functions that you can use to
store R objects in a database. These functions
require an ODBC connection to the database, and a table in which to store objects.
This table must have a structure similar to that shown below:
The structure of a table for persisting R objects

-- Create a table for storing charts:
CREATE TABLE [dbo].[charts]
(
id VARCHAR(200) NOT NULL PRIMARY KEY,
value VARBINARY(MAX) NOT NULL
)
The id column is a character-based primary key that you use to identify objects. The object itself is
serialized and stored in a binary format in the value column.
After you have created the table, you can save objects to it. The following example creates a histogram
which it stores in the database:
Saving an R object to a database

# Create and display a histogram of delay data by airline
chart <- rxHistogram(~UniqueCarrier, data = airlineData)
# Save the histogram to SQL Server

# Create an ODBC data source that connects to the charts table
rxSetComputeContext(RxLocalSeq())
chartsTable <- RxOdbcData(table = "charts", connectionString = sqlConnString)
# Save the chart to the charts table, and give it a unique name to identify it later
rxWriteObject(dest = chartsTable, key = "chart1", value = chart)
Later, you can retrieve the histogram using chart1 as the key. You can then display the histogram:
Retrieving an R object from a database

# Retrieve the persisted chart
chartObject = rxReadObject(src = chartsTable, key = "chart1" )
# Display the chart

print(chartObject)
Note that the rxWriteObject and rxReadObject functions expect an ODBC data source and do not work
in the SQL Server compute context. If you need to perform similar operations inside a SQL Server stored
procedure, you must:
1. Create a SQL Server table with a varbinary column.
2. Serialize the object manually. You can use the serialize and paste base R functions to perform these
tasks.
3. Execute a SQL Server INSERT operation to save the serialized object to the table. It is recommended
that you wrap this operation in a stored procedure that you call from your R code. In this way, you
can reduce dependencies between the structure of the table and your R code.
To retrieve an object, you can:
1. Execute a SQL Server SELECT statement that fetches the serialized object from the database.
2. Deserialize the object back into its original form. You can use the as.raw and unserialize functions to
perform this task.
Note: You will use this second approach in the lab at the end of this lesson.
Demonstration: Storing and retrieving R objects from a database

This demonstration shows how to save an R object to a SQL Server database, and then retrieve it later.
Demonstration Steps
Save an R object to SQL Server

2. Open the R script Demo1 - persisting R objects.R in the E:\Demofiles\Mod08 folder.
3. Highlight and run the code under the comment # Create a SQL Server compute context. This code
creates a compute context that connects to the AirlineData database.
4. Highlight and run the code under the comment # Create a data source that retrieves airport
information.
5. Highlight and run the code under the comment # Create and display a histogram of airports by
state. This code uses the rxHistogram function to generate the histogram. Notice that the histogram
object itself is stored in the chart variable.
Note: Make sure that you display the Plots window in the lower right pane.
6. In the toolbar above the Plots window, click Clear all Plots, and then click Yes to confirm.
7. On the Windows desktop, click Start, type Microsoft SQL Server Management Studio, and then
press Enter.
8. In the Connect to Server dialog box, log in to LON-SQLR using Windows authentication.
9. On the File menu, point to Open and then click File.
10. Move to the E:\Demofiles\Mod08 folder, click the Demo1 - persisting R objects SQL Server Query
File, and then click Open.
11. In the Query window, notice that this script creates a table named charts with two columns; id and
value. This table will be used to hold R objects. The id column is the primary key, and the value
column will hold a serialized version of the object.
12. In the toolbar, click Execute.
13. Return to your R development environment.
14. Highlight and run the code under the comment # Create an ODBC data source that connects to
the charts table. This code switches back to the local compute context and creates an ODBC data
source.
15. Highlight and run the code under the comment # Save the chart to the charts table, and give it a
unique name to identify it later. This code uses the rxWriteObject function to store the histogram
object with the key chart1.
16. Switch back to SQL Server Management Studio.
17. In Object Explorer, expand LON-SQLR, expand Databases, expand AirlineData, right-click Tables,
and then click Refresh.
18. Expand Tables, right-click dbo.charts, and then click Select Top 1000 Rows.
You should see a single row. The value is a hexadecimal string that is a binary representation of the
histogram object.
Retrieve an R object from SQL Server

2. Highlight and run the code under the comment # Retrieve the persisted chart. This code uses the
rxReadObject to read the data for the chart1 object from the database and reinstate it as an R
object in memory.
3. Highlight and run the code under the comment # Display the chart. This code prints the object. The
histogram should appear in the Plots window.
4. Close your R development environment without saving any changes.
Statement Answer
When you use the sp_execute_external_script stored procedure to run R

code from SQL Server, the R code is executed by SQL Server. True or False?
Lab A: Deploying a predictive model to SQL Server

Scenario
You are developing a number of predictive models against various datasets to ascertain the most likely
circumstances for flights being delayed. You want to store these models in a SQL Server database so that
you can reuse them without having to rebuild them. You have also decided to relocate the data used by
these models to SQL Server.
Objectives
 Upload the flight delay data to SQL Server and examine it.
 Fit a DForest model to the data to help predict flight delays.

 Save the model to SQL Server and create a stored procedure that enables a user to make predictions
using this model.
Lab Setup
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
Exercise 1: Upload the flight delay data

Scenario
You have a copy of several samples of the flight delay data as XDF files. You want to upload this data to
SQL Server where you can examine it more easily. In particular, you want to see whether delays caused by
bad weather are predictable, based on the time of year.
1. Configure SQL Server and create the FlightDelays database
2. Upload the flight delay data to SQL Server
3. Examine the data in the database
 Task 1: Configure SQL Server and create the FlightDelays database

2. Start SQL Server Management Studio. Log on to the LON-SQLR server using Windows
authentication.
3. Reconfigure SQL Server to enable external scripts.
4. Stop and restart SQL Server, and then create a new database named FlightDelays. Use the default
options for this database.
5. Leave SQL Server Management Studio open.
 Task 2: Upload the flight delay data to SQL Server

1. Copy the FlightDelayDataSample.xdf and file from the E:\Labfiles\Lab08 folder to the \\LON-
RSVR\Data shared folder.
3. Ensure that you are running in the local compute context; this is because you cannot read or write
XDF data in the SQL Server compute context.
4. Create an RxSqlServerData connection string and data source that connects to a table named
flightdelaydata in the FlightDelays database. The database is located on the LON-SQLR server. You
should use a trusted connection.
5. Use the rxDataStep function to upload the database in the \\LON-

RSVR\Data\FlightDelayDataSample.xdf file to the flightdelaytable in the SQL Server database.
Add the following transformations:
 Create an additional column named DelayedByWeather. This column should be a logical factor
that is true if the WeatherDelay value in an observation is non-zero. For this exercise, treat NA
values as zero.
 Create another column called Dataset. You will use this column to divide the data into training
and test datasets for the DForest model. The column should contain the text "train" or "test",
selected according to a random uniform distribution. Five percent of the data should be marked
as "test" with the remainder labelled as "train".
Note that the data file contains 1158143 (1.158 million) rows.
 Task 3: Examine the data in the database

1. Create a SQL Server compute context and use it to connect to SQL Server. You can reuse the
connection string that you defined earlier, for creating the RxSqlServerData data source. Set the
wait parameter of the compute context to TRUE.
2. Use the following Transact-SQL query to create another RxSqlServerData data source. This query
retrieves the Month, MonthName, OriginState, DestState, Dataset, and DelayedByWeather
columns, and also generates a calculated column named WeatherDelayCategory that partitions the
data into categories according to the length of any weather delay:
SELECT Year, Month, MonthName, OriginState, DestState,

DelayedByWeather, WeatherDelayCategory = CASE
WHEN CEILING(WeatherDelay) <= 0 THEN 'No delay'
WHEN CEILING(WeatherDelay) BETWEEN 1 AND 30 THEN '1-30 minutes'
WHEN CEILING(WeatherDelay) >= 181 THEN 'More than 180 minutes'
END
FROM flightdelaydata
The RxSqlServerData data source should convert the following columns to factors:
 Month
 OriginState
 DestState
 DelayedByWeather
 WeatherDelayCategory
 MonthName
The Dataset column should be character.
3. Run the rxGetVarInfo function over the data source and verify that it contains the correct variables.
Note that the factors may be reported as having zero factor levels. This is fine as you have not yet
retrieved any data, so the data source does not know what the factor levels are.
4. Summarize the data by using the rxSummary function. This might take a while as this is the point at
which the data source reads the data from the database.
5. Create and display a histogram that shows the number delays in each value of
WeatherDelayCategory conditioned by MonthName.
6. Create and display another histogram that shows the number delays in each value of
WeatherDelayCategory conditioned by OriginState.
Results: At the end of this exercise, you will have imported the flight delay data to SQL Server and used
ScaleR functions to examine this data.
Question: According to the first histogram, is there a pattern to weather delays?
Question: Using the second histogram, which states appear to have the most delays as a
proportion of the flights that depart from airports in those states? Proportionally, which state
has the fewest delays?
Exercise 2: Fit a DForest model to the weather delay data

Scenario
You surmise that weather delays might be related to the month, origin state, and possibly the destination
state (by extension) for each flight. For accuracy, you decide to fit a decision tree forest to the data and
score it against a different sample of the flight delay data. You will save the scored results in the database.
1. Create a DForest model
2. Score the DForest model
 Task 1: Create a DForest model

1. Create a DForest model that uses the following formula to fit the weather delay data:
DelayedByWeather ~ Month + OriginState + DestState
 Only use the observations where the Dataset variable contains the value "train" to fit the model.
 Specify a complexity parameter limit (cp) of 0.0001.

Note: Make sure you are still using the SQL Server compute context. It should take
approximately five minutes to construct the model. However, if you are running in the local
compute context, it can take more than 30 minutes to perform the same task. This is one
advantage of keeping the computation close to the data, and exploiting the parallelism available
with R Server rather than R Client.
Also note that you will receive a warning message stating that the "Number of observations
not available for this data source" when the process completes. You can ignore this warning.
2. Inspect the model and examine the trees that it contains. Notice the forecast accuracy of the model
based on the training data.
3. Use the rxVarImpUsed function to see the influence that each predictor variable has on the decisions
made by the model.
Results: At the end of this exercise, you will have created a decision tree forest using the weather data
held in the SQL Server database, scored it, and stored the results back in the database.
Question: What is the Out-Of-Box (OOB) error rate for the DForest model?
Question: Are there any discrepancies between flights being forecast as delayed versus
those being forecast as on-time? If so, how could you adjust for this?
Question: Which predictor variable had the most influence on the decisions made by the
model?
 Task 2: Score the DForest model

1. Modify the data source that you used to retrieve the training data for the model, and limit it to fetch
only test data. To do this, you will need to append a Transact-SQL WHERE clause to the @sqlQuery
property of the data source.
Note: The rxDForest function has a rowSelection argument that you can use to limit the
rows retrieved to those labeled as "training". The rxPredict function that you will use to score
does not have this capability, so you need to restrict the data by modifying the data source
instead.
2. Create another RxSqlServerData data source. This data source should connect to a table named
scoredresults in the FlightDelays database (this table doesn't exist yet).
3. Temporarily switch back to the local compute context and run the rxPredict function to make
predictions about weather delays using the new dataset in SQL Server. Save the results in the
scoredresults table. Include the model variables in the scored results. Specify a prediction type of
prob to generate the probabilities of a match/nomatch for each case, and use the preVarNames
agument to record these probabilities in columns named PredictDelay and PredictNoDelay in the
scoredresults table. The rxPredict function also generates a TRUE/FALSE value that indicates, based
on these probabilities, whether the flight will be delayed. Save this data in a column named
PredictedDelayedByWeather.
Note that the rxPredict function creates the scoredresults table.
When the rxPredict function has finished, return to the SQL Server compute context.
Note: The rxPredict function does not currently work as expected in the SQL Server
compute context, which is why you need to switch back to the local compute context.
4. Run the following code to test the accuracy of the weather delay predictions in the scoredresults
table against the real data:
install.packages('ROCR')
library(ROCR)
# Transform the prediction data into a standardized form
results <- rxImport(weatherDelayScoredResults)
weatherDelayPredictions <- prediction(results$PredictedDelay,
results$DelayedByWeather)
# Plot the ROC curve of the predictions
rocCurve <- performance(weatherDelayPredictions, measure = "tpr", x.measure = "fpr")
plot(rocCurve)
This code performs the following tasks:

 It creates a data frame containing the scored results.
 It uses the prediction function to compare the probability of a weather delay recorded in the
PredictedDelay column of the scored results with the flag (TRUE=1, FALSE = 0) that indicates
whether the flight was actually delayed by weather for each observation.
 It runs the performance function to measure the ratio of true positive results against false
positive results.
 It plots the results as a ROC curve.
Question: What does the ROC curve tell you about the possible accuracy of weather delay
predictions? Is this what you expected?
Exercise 3: Store the model in SQL Server

Scenario
You want to save the model so that you can reuse it to make predictions against other datasets, for
comparison purposes. You decide to store the model in SQL Server. You also decide to create a stored
procedure that others can use to run predictions using your model.
1. Save the model to the database
2. Create a stored procedure that runs the model to make predictions
 Task 1: Save the model to the database

1. Serialize the DTree model as a string representation of binary data. Use the serialize and paste
functions to do this.
2. Using SQL Server Management Studio, create the following table in the FlightDelays database.
This table will hold the serialized model:
CREATE TABLE [dbo].[delaymodels]

(
modelId INT IDENTITY(1,1) NOT NULL Primary KEY,
model VARBINARY(MAX) NOT NULL
)
3. Add the following stored procedure. You can run this stored procedure from R to save the DTree
model to the database:
CREATE PROCEDURE [dbo].[PersistModel] @m NVARCHAR(MAX)

AS
BEGIN
SET NOCOUNT ON;
INSERT INTO delaymodels(model) VALUES (CONVERT(VARBINARY(MAX),@m,2))
END
4. In your R development environment, create an ODBC connection to the database and use the
sqlQuery ODBC function to run the PersistModel stored procedure. Specify the serialized version of
the DTree object as the parameter to the stored procedure.
Note that the sqlQuery function is part of the RODBC library, and you must use the
odbcDriverConnect function to create the ODBC connection. You can reuse the same connection
string as before.
 Task 2: Create a stored procedure that runs the model to make predictions
1. In SQL Server Management Studio, create the following stored procedure:
CREATE PROCEDURE [dbo].[PredictWeatherDelay]

@Month integer = 1,
@OriginState char(2),
@DestState char(2)
AS
BEGIN
DECLARE @weatherDelayModel varbinary(max) = (SELECT TOP 1 model FROM
dbo.delaymodels)
EXEC sp_execute_external_script @language = N'R',
@script = N'
delayParams <- data.frame(Month = month, OriginState = originState, DestState
= destState)
delayModel <- unserialize(as.raw(model))
OutputDataSet<-rxPredict(modelObject = delayModel,
data = delayParams,
outData = NULL,
predVarNames = c("PredictedDelay", "PredictedNoDelay",
"PredictedDelayedByWeather"),
type = "prob",
writeModelVars = TRUE)',
@params = N'@model varbinary(max),
@month integer,
@originState char(2),
@destState char(2)',
@model = @weatherDelayModel,
@month = @Month,
@originState = @OriginState,
@destState = @DestState
WITH RESULT SETS (([PredictedDelay] float, [PredictedNoDelay] float,
[PredictedDelayedByWeather] bit, [Month] integer, [OriginState] char(2), [DestState]
char(2)));
END
This stored procedure takes three input parameters: Month, OriginState, and DestState. These
parameters equate to the predictor variables used by the DTree model.
The body of the stored procedure creates a variable named @weatherDelayModel that will be
used to retrieve the model from the delaymodels table.
The remainder of the stored procedure uses the sp_execute_external_script stored procedure to
run a chunk of R code. This R code takes four parameters: @model which is used to reference the
model to run, and @month, @originState, and @destState which specify the predictor values
passed in. The assignments below the @params definition shows how these parameters are
populated using the @weatherDelayModel, @Month, @OriginState, and @DestState variables
respectively.
The R code uses these variables to construct a data frame containing predictor values and also to
retrieve the model from the database. The rxPredict function in the R code generates a prediction
from this data indicating whether a flight from the specified origin to destination in the given month
is likely to be delayed by weather. The results are output as another data frame (containing a single
row). The WITH RESULT SETS clause specifies the fields in this data frame.
2. Return to your R development environment and run the following code to test the stored procedure:
cmd <- "EXEC [dbo].[PredictWeatherDelay] @Month = 11, @OriginState = 'GA', @DestState

= 'NY'"
sqlQuery(connection, cmd)
This code asks about the probability of a flight from Michigan to New York in October being delayed
due to weather.
3. Save the script as Lab8_1Script.R in the E:\Labfiles\Lab08 folder, and close your R development
environment.
4. Close SQL Server Management Studio, without saving any changes.
Results: At the end of this exercise, you will have saved the DForest model to SQL Server, and created a
stored procedure that you can use to make weather delay predictions using this model.
Question: According to the DForest model, what is the probability of a flight from Georgia
(GA) to New York (NY) in November being delayed by weather? What about a flight in June?
Lesson 2
Using ScaleR functions with Hadoop on a Map/Reduce
cluster
The RevoScaleR package provides the RxHadoopMR compute context to enable you to perform data
analysis operations using a Hadoop cluster. The operations performed by many of the ScaleR functions in
this compute context have been adapted to take advantage of the Hadoop Map/Reduce mechanism to
analyze and refine big datasets. The RevoScaleR package also includes a set of command line helper
functions that you can use to interact directly with Hadoop and HDFS data.
The RevoScaleR package provides the same ScaleR functions for the RxHadoopMR compute context as it
does for other nonclustered contexts. This approach enables you to develop and test your R code locally
on small datasets before deploying it on a Hadoop cluster against massive volumes of data.
Lesson Objectives
 Describe how the ScaleR functions operate with Hadoop.

 Create and use the RxHadoopMR compute context.
 Describe the data sources available in the RxHadoopMR compute context.
 Describe when to perform local processing over HDFS data.

 Incorporate a Pig script into the analysis workflow.
 Describe configuration options for optimizing Hadoop for use with R.
How the ScaleR functions work with Hadoop Map/Reduce

Hadoop provides a distributed environment for
performing computations over big data. It exploits
the parallelism available when using processes
running on many machines by splitting the data up
into chunks, and allocating each process a chunk of
the data on which to work. Each process generates
its own local results which are then combined to
produce a final result. In some cases, Hadoop can
construct the final result in a single pass, but in
other cases it might need to perform further
iterations, handing back chunks of data containing
partial results to processes to perform additional
calculations and refinements.
The structure of XDF data and the way in which the ScaleR functions makes them a natural fit for working
with Hadoop. The implementation of the ScaleR functions for Hadoop breaks operations down into
parallel pieces, each of which can be run independently by separate Hadoop processes; a ScaleR function
initiates a Map/Reduce job, and the separate processes run as tasks within that job. You can monitor and
trace the progress of these jobs using the standard Hadoop job utilities.
Additionally, the ScaleR functions are optimized to operate on chunked data. XDF files stored in HDFS are
actually structured as composite files, as described in Module 2: Exploring Big Data. Each element of the
composite file can be allocated to a process in isolation, and there are no issues with locking or
contention.
HDFS is a shared file system, and the RevoScaleR package depends on you to maintain the security of the
files in HDFS. You can do this by using the Hadoop HDFS utilities (such as the hadoop fs command in
Linux). Although handling security is out of the scope of the ScaleR functions, the RevoScaleR package
does depend on a particular directory structure in HDFS. Specifically, as part of the ScaleR installation
process, you must create the following folders in HDFS:
 /user/RevoShare
 /user/RevoShare/username (where username is the name of each account that can run ScaleR
functions)
These directories must be assigned the appropriate read/write permissions for each user. You can do this
using the hadoop fs -chmod command in Linux.
Additionally, you must also create the /var/RevoShare directory in the Linux file system, together with
/var/RevoShare/username subdirectories (again where username is the name of each account that can run
ScaleR functions).
For more information about using the ScaleR functions with Hadoop, see:
Get started with ScaleR on Hadoop MapReduce

https://aka.ms/qaiosp
Creating a HadoopMR compute context

You use the RxHadoopMR compute context to
enable the ScaleR functions to work with Hadoop
Map/Reduce. The RxHadoopMR compute context
actually creates a secure shell (SSH) session in the
cluster, so you must enable SSH first, both on the
cluster machines and your R client machine. On a
Windows computer this will involve installing and
configuring either PuTTY or Cygwin.
If you are already connected to R server running on

a computer that is part of the Hadoop cluster, such
as an edge node (you might have opened a remote
session on the edge node, for example), you can
switch to the RxHadoopMR compute context very easily, as follows:
Creating an RxHadoopMR compute context inside a Hadoop cluster

hadoopContext <- RxHadoopMR()
rxSetComputeContext(hadoopContext)
If you are currently located outside of the cluster, you must provide additional information that specifies:
 Your user name (Linux login name on the cluster).
 The cluster name node or host name of an edge node in the cluster.
 The port number for incoming connections on the Hadoop cluster (if a nonstandard port is used).
 Any SSH additional switches required. This includes the location of the private key file required to
authenticate your login.
The following code shows how to connect to a Hadoop cluster located at LON-HADOOP-
01.ukwest.cloudapp.azure.com from a non-Hadoop client computer. The user is logged in as
student01, and the hadoop.ppk file contains the private key that authenticates the users for the SSH
session. The default port is used to connect to the cluster.
Creating an RxHadoopMR compute context for a remote Hadoop cluster

hadoopContext <- RxHadoopMR(sshUsername = "student01",
sshHostname = "LON-HADOOP-01.ukwest.cloudapp.azure.com",
port = 0,
sshSwitches = "-i hadoop.ppk"
)
If you are using PuTTY, you can wrap this information up and save it in a PuTTY session configuration file
instead. In this case, you can omit the port and sshSwitches parameters, and replace the sshHostname
parameter with the name of the session configuration file.
Note: The configuration used by the labs in this module follow this approach. See the
About This Course document for information on how this is set up.
Hadoop jobs can be long running. You can configure the Hadoop compute context to be nonwaiting by
setting the wait parameter to FALSE. As with SQL Server, all requests will return immediately, but they will
report a job id that you can use to monitor the job from the Hadoop Jobtracker console. You can test the
status of a job with the rxGetJobStatus function, and when it has finished you can retrieve the results
with the rxGetJobResults function. You can halt a job by using the rxCancelJob function, or block while
a job completes with the rxWaitForJob function. For more information about waiting and nonwaiting
jobs, refer to Module 5: Parallelizing Analysis Operations.
One further parameter that can be instructive is consoleOutput. If you set this to TRUE, you can see how
the ScaleR operations are split into Map/Reduce tasks. The information displayed also includes the
Hadoop job and task ids, and you can use the Hadoop Jobtracker console to monitor these items.
There’s a raft of other optional arguments that you can specify. These include switches that enable you to
control how Hadoop processes your operations. For more information, see:
RxHadoopMR: Generate Hadoop Map Reduce Compute Context

https://aka.ms/lmvx73
Code running in the RxHadoopMR can utilize the full range of R packages for Hadoop, including rmr2 (R
Map/Reduce) which enables you to implement custom map/reduce functionality in R. Additionally,
remember that you can invoke arbitrary functions from ScaleR functions such as rxDataStep by using a
custom transformation with the transformFunc argument.
Data sources for Hadoop

The RxHadoopMR compute context defaults to
using the HDFS file system for the cluster. This file
system supports text data through the RxTextData
data source, in addition to XDF data through the
RxXdfData data source. XDF files stored in HDFS
are held as composite files, whereas by default text
files are held “as-is”. Operations that process text
files therefore offer fewer opportunities for
parallelization without causing file contention
problems. However, you can create composite CSV
files using the rxDataStep function with the HDFS
file system.
You cannot connect to SQL Server or ODBC data sources from within the RxHadoopMR compute
context. If you need to access data in one of these repositories, you can adopt an approach such as:
 Retrieving it using a different compute context, and then transfer the data to HDFS.
 Using a technology such as Apache Sqoop to import the data into HDFS on the cluster (Sqoop is
specifically designed to transfer data efficiently between Hadoop and structured stores such as
relational databases).
You can store and access data in the local Linux file system rather than HDFS. To do this, you can use the
rxSetFileSystem function and specify the RxNativeFileSystem file system. You can switch back to HDFS
by specifying the RxHdfsFileSystem file system.
Note: Remember that you can access data held in HDFS from outside of the RxHadoopMR
compute context by connecting directly to the file system with the rxHdfsConnect function.
The RxHadoopMR compute context provides a collection of helper functions for interacting with the
HDFS file system. Module 2 describes these functions, which are also summarized below:
 rxHadoopCopyFromClient. Use this function to copy a file from a remote client to the HDFS file
system in the Hadoop cluster.
 rxHadoopCopyFromLocal. Use this function to copy a file from the native file system to the HDFS
file system in the Hadoop cluster.
 rxHadoopCopy. Use this function to copy an HDFS file.
 rxHadoopMove. Use this function to move a file around the HDFS file system.
 rxHadoopRemove. This function removes a file from the HDFS file system.
 rxHadoopRemoveDir. Use this function to remove a directory from HDFS.
 rxHadoopListFiles. This function generates a list of files in a specified HDFS directory.
 rxHadoopFileExists. This function tests whether a specified file exists in an HDFS directory.
Another important function is rxHadoopCommand. You can use this function to perform any Hadoop
operation from within R, including submitting Hadoop Map/Reduce jobs.
Accessing HDFS locally for performing smaller computations

Hadoop is intended for performing processor
intensive operations against large volumes of data.
If you are performing small-scale computations
against HDFS data, you might find that using the
local client or R server rather than the
rxHadoopMR compute context is faster. This is
because the overhead of using Hadoop will likely
exceed the time taken to actually perform the
operation.
You can access HDFS data from a non-Hadoop

context by using the RxHdfsFileSystem function to
connect to HDFS, and then making HDFS the
default file system for your session with the rxSetFileSystem function. You can then read and write files in
HDFS directly from your session, as shown in the following example:
Switching to the HDFS file system

hdfsFileSystem <- RxHdfsFileSystem(hostname = …)
rxSetFileSystem(filesystem = hdfsFileSystem)
localCensusData <- RxTextData(file = "censusdata.txt")
For more information, see the section Use a Local Compute Context at:

Working with Hive

Hive is a data warehouse infrastructure built on top
of Hadoop. It provides a SQL-like interface to query
data stored on the Hadoop file system. Hive queries
are translated and run as Map/Reduce jobs behind
the scenes. Because of its resemblance to SQL, Hive
is often used for exploring and reporting large
tables of structured data for distributed data
warehouse tasks. It doesn’t work directly on raw
text files, but rather reads them as a table with a
provided schema, or a distributed database such as
HBase.
You might find that some queries can be expressed

more easily in Hive, for example a multitable join across a very large data warehouse, but then want to do
further analysis or modelling on the returned data in R. It is not possible to connect with Hive directly
using the RxHadoopMR compute context, but there are other ways to use R server to access and use
data from Hive, depending on the scale of data you will be returning from your Hive job.
1. You can use the system function to issue a Hive job on your cluster, saving your data as a text file on
HDFS. You can run further analyses on this data using either the RxHadoopMR or RxSpark compute
contexts.
2. If the result of your Hive job is relatively small, it might be better to use the RxOdbcData data source
to connect a remote client to Hive through ODBC. You can then either stream the results to the
remote client or download them as XDF in the local file system.
Running Hive on a cluster

If the data returned by a Hive query is too large to load directly on to the client machine, you will need to
log in to R Server on the cluster and run the Hive query directly using the system function.
The following example creates a Hive table and loads data into it from a text file in HDFS. Note that you
need to supply the schema when working with data in Hive. When the table has been created, the
example returns the first 100 lines.
Retrieving data from Hive

hive_query <- 'hive -e "CREATE TABLE censusNames(name string, rank int); LOAD DATA INPATH
'censusNames.txt' OVERWRITE INTO TABLE censusNames; SELECT * FROM censusNames LIMIT 100"
> myCensusData.txt'
system(hive_query)
This code will dump the text file on to HDFS on the cluster. You can then switch back to your client and
connect to the cluster using the RxHadoopMR compute context and RxTextData data source.
Accessing Hive data by using ODBC

You can use this approach if the data returned from Hive can fit in memory on your client machine. You
will first need to follow your Hadoop vendor’s recommendations for accessing Hive via ODBC from a
remote client, and install any software you need for this. When you are set up, accessing data in Hive from
R Server is just like accessing data from any other data source: the connection information and the Hive
query itself need to be supplied as character strings to the RxOdbcData constructor.
The next example shows how to read the census data from the previous example into a table, and then
return the top 100 rows of the table as a local XDF file.
Using ODBC to connect to Hive

connectStr <- "DSN=HiveODBC"
mySQL = "SELECT * FROM censusNames LIMIT 100"
myDS <- RxOdbcData(sqlQuery = mySQL, connectionString = connectStr)
xdfFile <- RxXdfData("dataFromHive.xdf")
rxImport(myDS, xdfFile, stringsAsFactors = TRUE, overwrite=TRUE)
For more information, see the section Using data from Hive for Your Analyses at:

Incorporating Pig into the analysis workflow

Pig is like Hive in that it sits on top of Hadoop and
provides a more intuitive interface than
MapReduce, while compiling down to Map/Reduce
code in the background. It differs in that, rather
than looking like SQL, it provides a procedural
programming language interface more familiar to
programmers than analysts. Also, because Pig does
not require data to fit in a predefined schema, it is
better able to deal with semi-structured data than
Hive.
You might have Pig scripts that need to be

incorporated in your ScaleR workflows. As with
Hive, you cannot connect to Pig directly through a compute context, so you need to use the system
function to run Pig scripts on your cluster.
The following Pig script loads a file and performs some filtering and transformations before saving the
results to HDFS:
An example Pig script

# names.pig
A = load 'censusNames.txt' USING PigStorage() AS (f1:chararray, f2:chararray, f3:int);
X1 = FILTER A BY f3 >=100;
X2 = FILTER X1 BY f3 <200;
X3 = FOREACH X2 GENERATE f1, LOWER(f2);
STORE X3 INTO 'pig_data.txt';
In the R session, while logged into the cluster, you can run the Pig script like this:
Running the Pig script

system(“pig -f names.pig”)
The results file is then accessible to R through the rxTextData data source. You can either work with this
on the cluster, using the RxHadoopMR compute context, or, if the data is not too large, pull the data
onto a remote client (see “Accessing HDFS locally for performing smaller computations”).
Note: You can also run Hive scripts in the same way, using the “-f” flag.
Demonstration: Running an analysis as a Map/Reduce job

This demonstration shows how to use the RxHadoopMR compute context to analyze data using ScaleR
functions.
Demonstration Steps
Upload data to HDFS in the Hadoop compute context

2. Open the R script Demo2 - analysis on Hadoop.R in the E:\Demofiles\Mod08 folder.
statements connect to the LON-HADOOP-01 server as instructor. Note that the host name is actually
the name of a PuTTY configuration file (LON-HADOOP) and not the name of the remote server. The
PuTTY configuration file contains the details and keys required to establish an SSH session on the
server.
4. Highlight and run the code under the comment # Copy FlightDelayData.xdf to HDFS
(/user/RevoShare/instructor). These statements remove the FlightDelayData.xdf file from HDFS (if
it exists), and then copy a new version of this file from the E:\Demofiles\Mod08 folder on the client
VM.
Note: If the rxHadoopRemove command reports No such file or directory then the file
didn't exist. You can ignore this message.
5. Highlight and run the code under the comment # Verify that the file has been uploaded. This
statement uses the rxHadoopCommand function to display the contents of the
/user/RevoShare/instructor folder in HDFS. You should see the FlightDelayData.xdf file included in
the output.
Analyze the data in Hadoop

1. Highlight and run the code under the comment # Connect to the R server running Hadoop,
replacing the example Hadoop URL with the URL of the Hadoop VM in Azure. This code creates a
remote session connecting to R Server on the Hadoop machine.
2. Highlight and run the code under the comment # Create a Hadoop compute context on this
server. Note that you don't have to specify the host name this time because Hadoop is running on
the same machine as your session.
3. Highlight and run the code under the comment # Examine the structure of the data. This code
runs the rxGetVarInfo function to view the variables in the data file. Notice that you have to switch
to the HDFS file system; the default file system for the Hadoop compute context is the native Linux
file system. Additionally, you will see a few messages reported by the compute context before the
results are displayed. This is because the consoleOutput flag in the compute context is set to TRUE.
In a production environment, you would typically disable this feature, but it is useful for debugging in
a test and development environment.
Note: The output will also include the messages WARN util.NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java classes where
applicable, and WARN shortcircuit.DomainSocketFactory: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
You can ignore these messages.
4. Highlight and run the code under the comment # Read a subset of the data into a data frame.
This code runs the rxImport function to fetch a sample of 10 percent of the data into memory.
5. Highlight and run the code under the comment # Summarize the flight delay data sample. This
code runs the rxSummary function over the data frame. Notice that Hadoop displays the message
Warning: Computations on data sets in memory cannot be distributed. Computations are
being done on a single node. The task is performed as a single-threaded operation rather than a
Map/Reduce job.
6. Highlight and run the code under the comment # Break the data file down into composite pieces.
The ScaleR functions in the Hadoop Map/Reduce compute context are optimized to work with
composite files. This code stores the composite version of the data file in the
/user/RevoShare/instructor/DelayData directory in HDFS. This directory must exist before creating
the file, so this code creates it using the rxHadoopMakeDir function.
7. Highlight and run the code under the comment # Examine the composite XDF file. This block of
code uses the rxHadoopListFiles function to display the contents of the DelayData folder and data
subfolder.
Note: The rxHadoopListFiles function does not work properly with R Server 9.01 or earlier,
which is why a previous step in this demonstration used the rxHadoopCommand to run the
Hadoop fs -ls command. However, the Hadoop server is running R Server 9.1.0, which has fixed
this issue.
8. Highlight and run the code under the comment # Perform a more complex analysis - compute a
crosstab. This code generates a crosstab of airlines and the airports that they serve, counting the
number of flights that have departed from each airport for that airline.
Track Map/Reduce jobs

1. On the desktop, open Microsoft Edge.
2. Navigate to the URL http://fqdn:8188/, where fqdn is the fully qualified domain name of the
Hadoop server, (such as LON-HADOOP-01.ukwest.cloudapp.azure.com). The Hadoop Job Tracking
page should appear, showing the details of each Map/Reduce job that you have performed during
the demonstration.
3. In the ID column, click the link for the most recent job. This should be the job that ran when you
created the crosstab. The details for the job should appear on a new page.
4. In the Logs column, click the Logs link. This page shows the trace for the job. This information is
useful for debugging purposes, if a ScaleR function fails for some reason.
5. Click the Back button in the toolbar to return to the previous page, and then click the Back button
again to return to the Job Tracking page listing all the recent jobs.
7. Close your R development environment without saving changes.

Statement Answer
All ScaleR operations running in the RxHadoopMR compute context are

performed as Hadoop Map/Reduce jobs. True or False?
Lesson 3
Using ScaleR functions with Spark
Spark is an open source big data processing framework on Hadoop, similar to the Map/Reduce
framework. It differs from Map/Reduce in that most of the data is copied into RAM on individual nodes in
Spark, and operations are then carried out in memory. This contrasts with Map/Reduce that writes all the
data to disk on the nodes after every operation. This difference makes Spark considerably faster than
Map/Reduce. Spark can also access data held in diverse sources including HDFS, Cassandra, HBase, and
S3.
You can work with Spark interactively using different programming languages: Scala, Java, Python and R.
One of the main advantages of Spark for data analytics is that it has a comprehensive data frames API that
is modelled on R data frames. For the R user, working with Spark data frames should be immediately
intuitive. Spark also has an extensive machine learning library, MLlib, which can deal with batch or
streaming applications and has a very active community.
R has been tightly integrated with Spark development from early on, and the ScaleR functions improve on
this still further. The Spark compute context enables you to conduct high performance analytics on Spark,
making use of the Hadoop HDFS in code that is very close to the code you would write in for in-memory
data.
The ScaleR functions are also complemented by sparklyr, an open source R package developed by
RStudio. This package enables users to apply the popular dplyr methods of data manipulation directly to
Spark data frames. ScaleR and sparklyr can be used in tandem in the same R server session.
Finally, the Spark compute context enables you to access data directly in Hive and Parquet format using
data sources that are not available tin the RxHadoopMR compute context.
Lesson Objectives
 Create and use the RxSpark compute context.
 List the data sources available in the RxSpark compute context
 Use sparklyr code with ScaleR functions.

 Integrate SparkR operations into the ScaleR workflow.
Creating the Spark compute context

The RevoScaleR package provides the RxSpark
compute context to enable you to interact with
Spark on a Hadoop cluster. You can also use the
rxSparkConnect function, which creates the
compute context object with RxSpark, and then
immediately starts a remote Spark application. The
arguments to the RxSpark compute context enable
you to define how to connect to the Spark process
and provide fine control over how the Spark
process works.
If you are logged into an R session running on an edge node of your Hadoop cluster, you can connect to
Spark by creating the compute context with the default values:
Connecting to a local Spark cluster

mySpark <- RxSpark()
rxSetComputeContext(mySpark)
If you are connecting from R client, you can set up a compute context that will run distributed Spark jobs
remotely on your cluster. In this case, the compute context creates a remote SSH session on the cluster.
You will need to supply additional arguments to RxSpark to create the compute context. Specifically, you
must specify your user name, the file-sharing directory where you have read and write access, the
publicly-facing host name or IP address of your Hadoop cluster’s name node or an edge node that will
run the master processes, and any additional switches to pass to the SSH session (such as the -i flag if you
are using a pem or ppk file for authentication).
For example:
Connecting to a remote Spark cluster

myHadoopCluster <- RxSpark(
hdfsShareDir = "path/to/yourHdfsShareDir",
shareDir = " path/to/yourShareDir",
sshUsername = "yourSshUsername",
sshHostname = "yourSshHostname",
sshSwitches = "-i /home/yourName/user1.pem")
Note that, as with the RxHadoopMR compute context, you can save many of the security parameters for
the SSH session as a PuTTY or Cygwin session configuration file, and then reference this configuration in
the sshHostname argument of RxSpark. For examples showing various R client Spark configuration
setups, see:
Get started with ScaleR on Apache Spark

https://aka.ms/nhx9iy
All the startup parameters for a Spark job are available through the RxSpark compute context. Amongst
many others, the following tuning options are available as arguments to RxSpark. They enable you to
control how memory and processors in the cluster are allocated to your Spark job:
 numExecutors. The number of individual processes to be set up for the job. Typically this will be one
executor for each node in the cluster, although you might want to reduce this if you are working on a
shared cluster since the default behavior is to launch as many executors as possible, which might use
up all resources and prevent other users from sharing the cluster.
 executorCores. The number of processor cores to assign to each executor.
 executorMem. A character string specifying the amount of memory to assign to each executor (for
example, 1000M, 3G).
 driverMem. A character string specifying the amount of memory to assign to the driver, or edge,
node.
 executorOverheadMemory. Specifies the memory overhead to assign to each executor. Increasing

this value will allocate more memory for the R process and the ScaleR engine process in the
executors, so it might help resolve job failure.
See the R help file for RxSpark for a full list of the options available.
Data sources available for the RxSpark compute context

The RxSpark compute context is compatible with
the following data sources. All of these data sources
can be used with data stored on HDFS:
 RxTextData. A comma delimited text data

source.
 RxXdfData. Data in XDF format.
 RxHiveData. Data in Hive tables.
 RxParquetData. Data in Parquet format.

Parquet is an open source columnar data
format, similar to XDF, that enables efficient
querying of data with a large number of
columns.
Note that the RxHiveData and RxParquetData data sources can only be used within an RxSpark
compute context, and not with RxHadoopMR.
Incorporating ScaleR code into a SparkR session

SparkR ships with all versions of Spark above 1.4.x
and provides a lightweight front end to use Apache
Spark from R. It provides a distributed DataFrame
implementation that supports operations like
selection, filtering, aggregation, and so on. This is
very like R data frames but on large, out-of-
memory datasets. SparkR also supports distributed
machine learning using the MLlib package.
Both SparkR and ScaleR can run on top of Hadoop’s

Spark execution engine, and you can use the
SparkR data frame functions alongside ScaleR high
performance analytics tools. However, at the time
of writing, it is not possible to share data directly from one to the other. This is because SparkR and ScaleR
are blocked from in-memory data sharing by each requiring their own Spark session.
Until this is addressed in an upcoming version of R Server, the workaround is to maintain different
compute contexts for SparkR and for ScaleR. You can then exchange data through intermediate files. For
example, you might start a Spark context in SparkR to select some data stored in a data lake as Parquet,
perform some filtering, transformation and merging to construct a useable dataset, and then save this to
a CSV file on HDFS. You could then open an RxSpark compute context, read the CSV file into an XDF
object, split into a training and a test set, and build a logistic regression or boosting model on the
extracted data using the ScaleR modeling functions.
For a comprehensive example of a workflow involving sharing data between SparkR and ScaleR, see:
Combining ScaleR and SparkR in HDInsight

https://aka.ms/du2qsp
Note: Future releases of R Server are expected to enable SparkR and ScaleR to share data
within the same compute context.
Using ScaleR functions with sparklyr code

Sparklyr is an open source R package, developed
by RStudio that enables users to directly manipulate
Spark data frames using the dplyr API. The
sparklyr package lets you treat Spark data frames
in a very similar way to R data frames and make use
of the dplyr data manipulation verbs like filter,
mutate, select and inner_join.
Although R Server does not yet provide native

access to the RSpark context from the ScaleR
functions, it does provide interoperability with
sparklyr. This means that you only need to keep
one compute context open to make use of the
powerful data manipulation capabilities of dplyr on Spark alongside the modeling and advanced high
performance analytics functions in ScaleR.
Note that, although you can use the same compute context for your ScaleR and sparklyr code, the ScaleR
analysis functions can still only work with data in one of the four data sources listed previously (text files,
XDF, Hive and Parquet) and not directly with Spark data frames. However, it is simple to use the ScaleR
data source functions to convert to one of these data sources.
To integrate an RxSpark compute context with a sparklyr session, you specify the interop argument. You
then register a new sparklyr session within that compute context:
Creating a new sparklyr session

library(sparklyr)
con <- rxSparkConnect(reset = TRUE,

executorCores = 2,
executorMem = "1g",
driverMem = "1g",
interop = "sparklyr")
sc <- rxGetSparklyrConnection(con)
You can then run sparklyr functions in the compute context.
The following example uses the diamonds dataset from the ggplot2 package. The code uses dplyr
functions to copy the local diamonds dataset onto the cluster as a Spark data frame and then to perform
some data manipulation and partition into test and training sets. When the manipulation code is run, you
need to “register” the data frames. This forces the execution of the data manipulation code, which is
needed because Spark operates “lazily” and doesn’t perform any computations until it is explicitly
instructed to do so. Without doing this, ScaleR would not be able to read the data frames into a data
source.
Using dplyr to filter and mutate data in a sparklyr session

library(dplyr)
diamonds <- ggplot2::diamonds

diamonds_tbl = copy_to(sc, ggplot2::diamonds)
trainTest <- diamonds_tbl %>%
filter(price >= 2000) %>%
mutate(ideal = cut == "Ideal") %>%
select(carat, ideal, color, price) %>%
sdf_partition(training = 0.9, test = 0.1, seed = 12345)
sdf_register(trainTest$training, "diamonds_train")
sdf_register(trainTest$test, "diamonds_test")
When the data is in the correct form for modeling, you can use the ScaleR functions to upload the data
into a persistent data source (such as Hive), and then use the rxLinMod function to run a linear regression
model over that data in Hive. Note that, when you are creating Hive tables, you need to specify the types
of the variables you are reading in:
Uploading data to Hive and running a regression model

d_train_hive <- RxHiveData(table = "diamonds_train",
colInfo = list(carat = list(type = "numeric"),
ideal = list(type = "logical"),
color = list(type = "factor"),
price = list(type = "numeric")))
ctest_hive <- RxHiveData(table = "cars_test",

colInfo = list(cyl = list(type = "factor")))
lmMod <- rxLinMod(price ~ carat + ideal + color, data = d_train_hive)
For more information about using ScaleR and SparklyR together, see:
Learn how to use R Server with sparklyr
https://aka.ms/i9r9t7

Question
Which data source can you use to connect to a Hive database when using the
RxHadoopMR compute context?
RxOdbcData
RxHiveData
RxHadoopData
You don't need to use a specific data source. You can start a sparklyr session to
read the Hive data.
RxSpark
Lab B: Incorporating Hadoop Map/Reduce and Spark

functionality into the ScaleR workflow
Scenario
You want to analyze flights caused by airline delays rather than bad weather. Another Hadoop developer
has created a Pig script that captures this information, and you want to examine and process the data
using R. You want to save the results of the processing to a Hive database to enable other developers to
perform their own ad-hoc queries. Finally, you want to extract specific information from this data to list
the delays for a specific airline.
Objectives
 Use R code in the Hadoop Map/Reduce compute context to run a Pig script that generates data and
use ScaleR functions to examine the results.
 Use ScaleR functions to store data in a Hive database.
 Run sparklyr code in the Spark compute context to retrieve data from the Hive database, and filter
this data.
Lab Setup
 20773A-LON-DC
 20773A-LON-DEV
 20773A-LON-RSVR
 20773A-LON-SQLR
Exercise 1: Using Pig with ScaleR functions

Scenario
The Pig script summarizes flight delay data held in CSV format, and generates a dataset that contains the
origin airport, destination, airline code, and the duration of any delays due to the airline (carrier delays
and late aircraft delays). This script also retrieves the airline name from another CSV file which it joins with
the flight delay data. You want to generate some charts that illustrate how the different airlines compare
for delays.
1. Examine the Pig script
2. Upload the Pig script and flight delay data to HDFS
3. Run the Pig script and examine the results
4. Convert the results to XDF format and add field names
5. Create graphs to visualize the airline delay data

6. Investigate the mean airline delay by route
 Task 1: Examine the Pig script

2. Using WordPad, open the file carrierDelays.pig in the E:\Labfiles\Lab08 folder.
This file is the Pig script that you will run. This script performs the following tasks:
 It reads a CSV file named carriers.csv held in the /user/RevoShare/loginName directory in HDFS
(where loginName is your login name). This file contains airline codes and their names.
 It reads another file called FlightDelayDataSample.csv which contains flight delay information.
This file contains a subset of the information about flight delays that you have been using
throughout the course.
 It filters the flight delay information to find all delays that have a positive value in the
LateAircraftDelay or CarrierDelay fields. These fields indicate delays that are caused by the
airline.
 It joins this data with the airline name in the carriers.csv file.
 It writes a dataset containing the origin airport, destination, airline code, airline name, carrier
delay, and late aircraft delay fields to a file in Pig storage (located in the
/user/RevoShare/loginName/results directory in HDFS).
3. Edit the first line of the script, and change the text {specify login name} to your login name, as
shown in the following example:
%declare loginName 'student01'
4. Save the file and close Wordpad.
 Task 2: Upload the Pig script and flight delay data to HDFS
2. Create an RxHadoopMR compute context. Specify a login name for the sshUsernme argument, set
the sshHostname to "LON-HADOOP", and set the consoleOutput argument to TRUE (this is so you
can view the messages that Hadoop displays).
3. Use the rxHadoopCopyFromClient function to copy the

E:\Labfiles\Lab08\FlightDelayDataSample.csv file to the /user/RevoShare/studentnn directory in
HDFS (replace studentnn with your login name).
Note: You might also want to remove any existing file with the same name in the HDFS
folder first. You can use the rxHadoopRemove function to do this.
4. Copy the E:\Labfiles\Lab08\carriers.csv file to the /user/RevoShare/studentnn directory in HDFS.
5. Close the RxHadoopMR compute context, and use the remoteLogin function to create a remote
connection to R Server running on the Hadoop VM. The deployr_endpoint argument should be
http://fqdn:12800, where fqdn is the URL of the Hadoop VM in Azure (for example, LON-HADOOP-
01.ukwest.cloudapp.azure.com). The username is admin, and the password is Pa55w.rd.
6. Load the RevoScaleR library in the remote session.
7. Pause the remote session, use the putLocalFile function to copy the carrierDelays.pig file to the
remote session, and then resume the remote session.
 Task 3: Run the Pig script and examine the results

1. In the remote session, use the system function to run the Pig script as follows:
result <- system("pig carrierDelays.pig", intern = TRUE)
If the command fails, you can examine the reason code for the failure in the result variable.
2. The Pig script saves the output to a directory named results in HDFS. Use the following code to verify
that this directory has been created. Replace studentnn with your login name:
rxHadoopCommand("fs -ls -R /user/RevoShare/studentnn", intern = TRUE)
3. Verify that the results directory contains two files: _SUCCESS and part-r-00000. The data is actually in
the part-r-00000 file. The _SUCCESS file should be empty; it is simply a flag created by the Pig script
to indicate that the data was successfully saved.
4. Use the rxHadoopRemove function to delete the _SUCCESS file from the results folder.
5. Use the rxGetVarInfo function to examine the structure of the data in the results directory. To do
this, switch to the HDFS file system, and create an RxTextData data source that references the
results directory (not the part-r-00000 file).
Note that the data does not include any useful field names in the schema information.
 Task 4: Convert the results to XDF format and add field names
1. Create a colInfo list that can be used by the rxImport function to add schema information to a file as
it is imported. Map the fields in the existing results file as follows:
 V1: type = "factor", nameName = "Origin"
 V2 type = "factor", nameName = "Dest"
 V3: type = "factor", nameName = "AirlineCode"
 V4: type = "character", nameName = "AirlineName"
 V6: type = "numeric", nameName = "CarrierDelay"
 V7: type = "numeric", nameName = "LateAircraftDelay"
2. Use the rxImport function to create an XDF file from the results data using the column information
mapping you just defined. Save the XDF file as the CarrierData composite XDF file in your directory
under /user/RevoShare in HDFS.
The new file should contain 120,827 rows.
3. While in the remote session, create a new RxHadoopMR compute context. Note that you do not
need to specify a host name as the context will run on the same computer as the remote session.
4. Run the rxGetVarInfo and rxSummary functions to verify the new mappings in the CarrierData XDF
file. Note that the rxSummary function is run as a Map/Reduce job.
 Task 5: Create graphs to visualize the airline delay data

1. Generate a histogram that shows the duration of airline delays on the X axis and the number of times
these delays occur on the Y axis. The duration of an airline delay is the sum of the carrier delay and
late aircraft delay fields for each observation. Limit the delays reported to those of 300 minutes or
less.
Note that the RxHadoopMR function does not support transformations as part of the rxHistogram
function, so you should create a temporary data frame that contains the transformed data first, and
use this data frame to construct the histogram.
2. Generate another histogram that shows the number of delayed flights by airline code. Use the XDF
file rather than the data frame for this graph.
Note that you might need to increase the resolution of the plot window to display the codes for all
airlines. You can do this using the png function. However, you must execute the png function and
the rxHistogram function together and not as separate commands in a remote session.
3. Generate a bar chart showing the total delay time across all flights for each airline. Display the airline
name rather than the airline code.
Note that this graph requires you to use ggplot with the geom_bar function rather than
rxHistogram.
Results: At the end of this exercise, you will have run a Pig script from R, and analyzed the data that the
script produces.
Question: According to the histogram that displays delays against frequency, what is the
most common delay period across all airlines?
Question: Using the second histogram, which airline has had the most delayed flights?
 Task 6: Investigate the mean airline delay by route

1. Create a data cube that calculates the mean delay for each airline by route (a route is a combination
of the origin and destination):
delayData <- rxCube(AverageDelay ~ AirlineCode:Origin:Dest, carrierData,

transforms = list(AverageDelay = CarrierDelay +
LateAircraftDelay),
means = TRUE,
na.rm = TRUE, removeZeroCounts = TRUE)
2. Use the rxSort function to sort the cube in descending order of the Counts and AverageDelay
variables in the cube.
Note that the rxSort function is not inherently distributable, so you cannot use it directly in the
RxHadoopMR environment. However, you can use the rxExec function to run it if you set the
timesToRun argument of rxExec to 1.
3. Display the top 50 (the worst routes for delays), and the bottom 50 (the best routes).
Note that the value returned by rxExec is a list of results named rxElem1, rxElem2, and so on; one
item for each task performed in parallel. There was only one task (timesToRun was set to 1), so you
can find all the data in the rxElem1 field.
4. Save the sorted data cube to the CSV file SortedDelayData.csv in your directory under the
/user/RevoShare directory in HDFS. Perform this operation using the local compute context rather
than Hadoop.
The file should contain 8,852 rows.
Question: Which route has the most frequent airline delays? How long is the average airline
delay on this route?
Exercise 2: Integrating ScaleR code with Spark and Hive

Scenario
You want to upload the sorted delay data to Hive to enable other users to examine it in a more ad-hoc
manner. You then want to use a sparklyr session to perform you own simple analytics over this data.
1. Upload the sorted delay data to Hive
2. Use Hive to perform ad-hoc queries over the data

3. Use a sparklyr session to analyze data
 Task 1: Upload the sorted delay data to Hive

1. In your remote R session, create an RxSpark compute context with the following parameters:
 sshUsername: your login name

 consoleOutput: TRUE
 numExecutors: 10
 executorCores: 2
 executorMem : "1g"
 driverMem = "1g"
Note that it is important to set the resource parameters of the RxSpark session appropriately,
otherwise you risk grabbing all the resources available and starving other concurrent users.
2. You will upload the data to a table named studentnnRouteDelays in Hive, where studentnn is your
login name. Create an RxHiveData data source that you can use to reference this table.
3. Use the RxDataStep function to upload the data. The inData argument should reference the CSV file
containing the sorted data cube, and the outFile argument should specify the Hive data source.
4. Use the rxSummary function over the Hive data source to verify that the data was uploaded
successfully.
 Task 2: Use Hive to perform ad-hoc queries over the data

1. Open a command prompt window, and run the putty command.
2. In the PuTTY Configuration window, select the LON-HADOOP session, click Load, and then click
Open. You should be logged in to the Hadoop VM.
hive
4. At the hive> prompt, run the following command:
select * from studentnnroutedelays;
Replace studentnn with your login name.
Verify that 8852 rows are retrieved.

select count(*), airlinecode from studentnnroutedelays group by airlinecode;
This command lists each airline together with the number of delayed flights for that airline.
6. At the hive> prompt, run the following command to exit hive:
exit;
7. Close the PuTTY terminal window, and return to your R development environment.
 Task 3: Use a sparklyr session to analyze data

1. In your R session, load the sparklyr and dplyr libraries.
2. Close the current RxSpark compute context and create a new one using the rxSparkConnect
function. Use the same parameters as before, but in addition set the interop argument to "sparklyr".
3. Use the rxGetSparklyrConnection function to create a new sparklyr session.
4. Use the sparklyr src_tbls function to list the tables available in Hive. Note that you should see not
only your table, but also the tables of the other students.
5. Cache your own routedelays table in the Spark session (using the tbl_cache function), retrieve the
contents of this table (using the tbl function), and display the first few rows from this table (using the
head function).
6. Construct a dplyr pipeline that filters the data in the table that you previously retrieved using the tbl
function to find all rows for American Airlines (code AA) that departed from New York JFK airport
(code JFK). Only include the Dest and AverageDelay columns in the results which you should save in
a tibble (using collect)
7. Display the tibble, and use the rxSummarize function to summarize the data in the tibble.
8. Use the rxSparkDisconnect function to terminate the sparklyr session and close the RxSpark
compute context.
environment.
Results: At the end of this exercise, you will have used R code running in an RxSpark compute context to
upload data to Hive, and then analyzed the data by using a sparklyr session running in the RxSpark
compute context.

 Use R in conjunction with SQL Server to analyze data held in a database.
 Incorporate Hadoop Map/Reduce functionality, together with Pig and Hive, into the ScaleR workflow.
 Utilize Hadoop Spark features in a ScaleR workflow.
Course Evaluation
Your evaluation of this course will help Microsoft understand the quality of your learning experience.
Please work with your training provider to access the course evaluation form.
Microsoft will keep your answers to this survey private and confidential and will use your responses to
improve your future learning experience. Your open and honest feedback is valuable and appreciated.
L1-1
Module 1: Microsoft R Server and Microsoft R Client

Lab: Exploring Microsoft R Server and
Microsoft R Client
Exercise 1: Using R Client in RTVS and RStudio
 Task 1: Start the development environment and create a new R script
2. If you are using RTVS, perform the following steps:
a. Click the Windows Start button, type Visual Studio 2015, and then click Visual Studio 2015.
b. In Visual Studio 2015, on the R Tools menu, click Data Science Settings.
c. In the Microsoft Visual Studio message box, click Yes.

d. On the File menu, point to New, and then click File.
e. In the New File dialog box, in the Installed pane, click R.
f. In the center pane, click R Script, and then click Open.

3. If you are using RStudio, perform the following steps:
a. Click the Windows Start button, click the RStudio program group, and then click RStudio.
b. On the File menu, point to New File, and then click R Script.
 Task 2: Use the development environment to examine data

1. In the script editor, add the following statement to the R file and run it (press Ctrl + Enter) to set the
working directory:
setwd("E:\\Labfiles\\Lab01")
2. Add the following code to the R file and run it. These statements create a data frame from the
2000.csv file and display the first 10 rows:
flightDataCsv <- "2000.csv"

flightDataSampleDF <- read.csv(flightDataCsv)
head(flightDataSampleDF, 10)
3. Add the following code to the R file and run it. The mName function returns the month name given
the month number. The code uses the lapply function to generate the month name for each row in
the data frame. The factor function converts this data into a factor. The result is added to the data
frame as the MonthName column:
mName <- function(mNum) {

month.name[mNum]
}
flightDataSampleDF$MonthName <- factor(lapply(flightDataSampleDF$Month, mName),
levels = month.name)
L1-2 Analyzing Big Data with Microsoft R
4. Add the following code to the R file and run it. These statements summarize the data frame and time
how long the operation takes before displaying the results:
system.time(delaySummary <- summary(flightDataSampleDF))

print(delaySummary)
5. Add the following code to the R file and run it. These statements display the name of each column in
the data frame and the number of rows. The code then finds the minimum and maximum flight
arrival delay times:
print(names(flightDataSampleDF))
print(nrow(flightDataSampleDF))
print(min(flightDataSampleDF$ArrDelay, na.rm = TRUE))
print(max(flightDataSampleDF$ArrDelay, na.rm = TRUE))
6. Add the following code to the R file and run it. This code cross tabulates the month name against the
number of flights cancelled and not cancelled:
print(xtabs(~MonthName + as.factor(Cancelled == 1), flightDataSampleDF))
Note: Record the console output, as this will be referenced in a later exercise.
Results: At the end of this exercise, you will have used either RTVS or RStudio to examine a subset of the
flight delay data for the year 2000.
Exercise 2: Exploring ScaleR functions

 Task 1: Examine the data by using ScaleR functions
1. Add the following code to the R file and run it. These statements summarize the data frame using the
rxSummary function and time how long the operation takes:
system.time(rxDelaySummary <- rxSummary(~., flightDataSampleDF))

print(rxDelaySummary)
2. Add the following statement to the R file and run it. This statement retrieves the number of variables
and observations from the data frame:
print(rxGetInfo(flightDataSampleDF))
3. Add the following statement to the R file and run it. This statement retrieves the details for each
variable in the data frame:
print(rxGetVarInfo(flightDataSampleDF))
4. Add the following code to the R file and run it. This statement calculates the quantiles for ArrDelay
variable in the data frame. The 0% quantile is the minimum value, and the 100% is the maximum
value:
print(rxQuantile("ArrDelay", flightDataSampleDF))
L1-3
5. Add the following code to the R file and run it. This statement generates a cross tabulation on month
name against flight cancellations.
print(rxCrossTabs(~MonthName:as.factor(Cancelled == 1), flightDataSampleDF))
6. Add the following code to the R file and run it. This statement generates a cube of month name
against flight cancellations. The data should be the same as that for the cross tabulation. The
difference is the format in which it is returned:
print(rxCube(~MonthName:as.factor(Cancelled), flightDataSampleDF))
7. Add the following code to the R file and run it. This statement removes the data frame from session
memory:
rm(flightDataSampleDF)
Note: Please note the console output, as this will be compared too in a later exercise.
Exercise 3: Performing operations on a remote server

 Task 1: Copy the data to the remote server
1. Add the following code to the R file and run it. This statement creates a remote R session on the LON-
RSVR server:
remoteLogin(deployr_endpoint = "http://LON-RSVR.ADATUM.COM:12800", session = TRUE,

diff = TRUE, commandline = TRUE, username = "admin", password = "Pa55w.rd")
2. Add the following code to the R file and run it. This statement pauses the remote session and returns
you to the local session:
pause()
3. Add the following statement to the R file and run it. This statement copies the file 2000.csv to the
remote server:
putLocalFile(c("2000.csv"))
4. Add the following statement to the R file and run it. This statement returns you to the remote session:
resume()
 Task 2: Read the data into a data frame

1. Add the following code to the R file and run it:
flightDataCsv <- "2000.csv"

flightDataSampleDF <- read.csv(flightDataCsv)
mName <- function(mNum) {

month.name[mNum]
}
flightDataSampleDF$MonthName <- factor(lapply(flightDataSampleDF$Month, mName),
levels = month.name)
Add the following code to the R file and run it. Verify that the MonthName column appears in the
output:
head(flightDataSampleDF, 10)
 Task 3: Examine the data remotely

1. Add the following code to the R file and run it. This code runs the rxSummary function, saves the
result in the rxRemoteDelaySummary variable, and then displays the data in this variable:
rxRemoteDelaySummary <- rxSummary(~., flightDataSampleDF)

print(rxRemoteDelaySummary)
rxRemoteInfo <- rxGetInfo(flightDataSampleDF)

print(rxRemoteInfo)
rxRemoteVarInfo <- rxGetVarInfo(flightDataSampleDF)

print(rxRemoteVarInfo)
rxRemoteQuantileInfo <- rxQuantile("ArrDelay", flightDataSampleDF)

print(rxRemoteQuantileInfo)
rxRemoteCrossTabInfo <- rxCrossTabs(~MonthName:as.factor(Cancelled == 1),

flightDataSampleDF)
print(rxRemoteCrossTabInfo)
rxRemoteCubeInfo <- rxCube(~MonthName:as.factor(Cancelled == 1), flightDataSampleDF)

print(rxRemoteCubeInfo)
L1-5
 Task 4: Transfer the results from the remote session

1. Add the following statement that pauses the remote session to the R file and run it:
pause()
2. Add the following statement to the R file and run it. This statement copies the remote variables back
to the local session:
getRemoteObject(c("rxRemoteDelaySummary", "rxRemoteInfo", "rxRemoteVarInfo",

"rxRemoteQuantileInfo", "rxRemoteCrossTabInfo", "rxRemoteCubeInfo"))
3. Add the following statements to the R file and run them. This code displays the contents of the
variables you have just copied. They should match the values from exercise 2:
print(rxRemoteDelaySummary)
print(rxRemoteInfo)
print(rxRemoteVarInfo)
print(rxRemoteQuantileInfo)
print(rxRemoteCrossTabInfo)
print(rxRemoteCubeInfo)
4. Add the following statement to the R file and run it. This statement logs out of the remote session:
remoteLogout()
environment.

L2-1
Module 2: Exploring Big Data

Lab: Exploring big data
Exercise 1: Importing and transforming CSV data
 Task 1: Import the data for the year 2000
3. Add the following statement to the R file and run it to set the working directory:
4. Add and run the following statements to import the first 10 rows of the 2000.csv file into a data
frame:
flightDataSampleCsv <- "2000.csv"

flightDataSample <- rxImport(flightDataSampleCsv, numRows = 10)
5. Add and run the following statement to view the structure of the data frame:
rxGetVarInfo(flightDataSample)
6. Add and run the following statement to create the flighDataColumns vector:
flightDataColumns <- c("Year" = "factor",

"DayofMonth" = "factor",
"DayOfWeek" = "factor",
"UniqueCarrier" = "factor",
"Origin" = "factor",
"Dest" = "factor",
"CancellationCode" = "factor"
)
7. Add and run the following statements to import the CSV data into the 2000.xdf file:
flightDataXdf <- "2000.xdf"

rxOptions(reportProgress = 1)
flightDataSampleXDF <- rxImport(inData = flightDataSampleCsv, outFile =
flightDataXdf, overwrite = TRUE, append = "none", colClasses = flightDataColumns)
8. Add and run the following statement to view the structure of the data frame:
rxGetVarInfo(flightDataXdf)
9. Open File Explorer and move to the E:\Labfiles\Lab02 folder. Verify that the 2000.csv file is
approximately 155 MB in size, whereas the 2000.xdf file is just under 28 MB.
 Task 2: Compare the performance of the data files

2. Add and run the following statement to generate a summary across all numeric fields in the CSV file.
Make a note of the timings reported by the system.time function:
system.time(csvDelaySummary <- rxSummary(~., flightDataSampleCsv))

3. Add and run the following statement to generate the same summary for the XDF file. Compare the
timings reported by the system.time function against those for the CSV file:
system.time(xdfDelaySummary <- rxSummary(~., flightDataSampleXDF))
4. The timings for the XDF file should be significantly quicker than those of the CSV file. If you need to
satisfy yourself that both statements are performing the same task, add print statements that display
the values of the csvDelaySummary and xdfDelaySummary variables, as follows:
print(csvDelaySummary)
print(xdfDelaySummary)
5. Add and run the following statement to generate a cross-tabulation that summarizes cancellations by
month in the CSV file. Make a note of the timings:
system.time(csvCrossTabInfo <- rxCrossTabs(~as.factor(Month):as.factor(Cancelled ==

1), flightDataSampleCsv))
6. Add and run the following statement to generate the same cross-tabulation for the XDF file. Compare
the timings against those for the CSV file:
system.time(xdfCrossTabInfo <- rxCrossTabs(~as.factor(Month):as.factor(Cancelled ==

1), flightDataSampleXDF))
7. Add and run the following statement to generate a cube that summarizes cancellations by month in
the CSV file. Make a note of the timings:
system.time(csvCubeInfo <- rxCube(~as.factor(Month):as.factor(Cancelled),

flightDataSampleCsv))
8. Add and run the following statement to generate the same cube for the XDF file. Compare the
timings against those for the CSV file:
system.time(xdfCubeInfo <- rxCube(~as.factor(Month):as.factor(Cancelled),

flightDataSampleXDF))
9. Add and run the following statement to tidy up the workspace:
rm(flightDataSample, flightDataSampleXDF, csvDelaySummary, xdfDelaySummary,

csvCrossTabInfo, xdfCrossTabInfo, csvCubeInfo, xdfCubeInfo)
Results: At the end of this exercise, you will have created a new XDF file containing the airline delay data
for the year 2000, and you will have performed some operations to test its performance.
L2-3
Exercise 2: Combing and transforming data

 Task 1: Copy the data files to R Server
1. On the LON-DEV VM, open a command prompt window.
2. Run the following commands:
net use \\LON-RSVR\Data

copy E:\Labfiles\Lab02\200?.csv \\LON-RSVR\Data
3. Verify that all the files are copied successfully.
 Task 2: Create a remote session

2. Add the following statement to your R script, and run it. This statement creates a remote connection
to the LON-RSVR VM. When prompted, specify the username admin with the password Pa55w.rd:

commandline = TRUE)
3. At the REMOTE> prompt, add and run the following command to temporarily pause the remote
session:
pause()
4. Add and run the following statement. This statement copies the local variable flightDataColumns to
the remote session:
putLocalObject(c("flightDataColumns"))
5. Add and run the following statement to resume the remote session:
resume()
6. Add and run the following statement. This statement lists the variables in the remote session. Verify
that the flightDataColumns variable is listed:
ls()
 Task 3: Import the flight delay data to an XDF file

1. Add and run the following statements. This code imports and transforms a subset of the data
comprising the first 1,000 rows from the 2000.csv file:
flightDataSampleXDF <- rxImport(inData = "\\\\LON-RSVR\\Data\\2000.csv", outFile =

"\\\\LON-RSVR\\Data\\Sample.xdf", overwrite = TRUE, append = "none", colClasses =
flightDataColumns,
transforms = list(
Delay = ArrDelay + DepDelay +
ifelse(is.na(CarrierDelay), 0, CarrierDelay) + ifelse(is.na(WeatherDelay), 0,
WeatherDelay) + ifelse(is.na(NASDelay), 0, NASDelay) + ifelse(is.na(SecurityDelay),
0, SecurityDelay) + ifelse(is.na(LateAircraftDelay), 0, LateAircraftDelay),
MonthName = factor(month.name[as.numeric(Month)],
levels=month.name)),
rowSelection = (Cancelled == 0),
varsToDrop = c("FlightNum", "TailNum",
"CancellationCode"),
numRows = 1000
)
2. Add and run the following statement to examine the first few rows in the XDF file:
head(flightDataSampleXDF, 100)
3. Using File Explorer, delete the file Sample.xdf from the \\LON-RSVR\Data share.
4. In your R environment, add and run the following statements. This code imports and transforms all of
the files in the \\LON-RSVR\Data share:
delayXdf <- "\\\\LON-RSVR\\Data\\FlightDelayData.xdf"
flightDataCsvFolder <- "\\\\LON-RSVR\\Data\\"
flightDataXDF <- rxImport(inData = flightDataCsvFolder, outFile = delayXdf, overwrite
= TRUE, append = ifelse(file.exists(delayXdf), "rows", "none"), colClasses =
flightDataColumns,
transforms = list(
Delay = ArrDelay + DepDelay + ifelse(is.na(CarrierDelay),
0, CarrierDelay) + ifelse(is.na(WeatherDelay), 0, WeatherDelay) +
ifelse(is.na(NASDelay), 0, NASDelay) + ifelse(is.na(SecurityDelay), 0, SecurityDelay)
+ ifelse(is.na(LateAircraftDelay), 0, LateAircraftDelay),
MonthName = factor(month.name[as.numeric(Month)],
levels=month.name)),
rowSelection = ( Cancelled == 0 ),
varsToDrop = c("FlightNum", "TailNum", "CancellationCode"),
rowsPerRead = 500000
)
5. Add and run the following statement to close the remote session:
exit
Results: At the end of this exercise, you will have created a new XDF file containing the cumulative
airline delay data for the years 2000 through 2008, and you will have performed some transformations
on this data.
Exercise 3: Incorporating data from SQL Server into an XDF file

 Task 1: Import the SQL Server data
1. Add the following statement to your R script and run it. This code creates a data source that connects
to the Airports table in the AirlineData database on the LON-SQLR server:
conString <- "Server=LON-SQLR;Database=AirlineData;Trusted_Connection=TRUE"

airportData <- RxSqlServerData(connectionString = conString, table = "Airports")
2. Add and run the following statement. This statement displays the first six rows from the Airports
table:
head(airportData)
3. Add and run the following statements. This code imports the data from the SQL Server database into
a data frame, and converts all string data to factors:
airportInfo <- rxImport(inData = airportData, stringsAsFactors = TRUE)

L2-5
4. Add and run the following statement that displays the first six rows of the data frame. Verify that they
are the same as the original SQL Server data:
head(airportInfo)
 Task 2: Add state information to the flight delay XDF data

1. Add the following statement to your R script and run it. This statement creates a remote connection
to the LON-RSVR VM. When prompted, specify the username admin with the password Pa55w.rd:

commandline = TRUE)
session:
pause()
3. Add and run the following statement that copies the local airportInfo data frame to the remote
session:
putLocalObject(c("airportInfo"))
resume()
5. Add and run the following statements that import the flight delay data and combine the state
information from the airport data:
enhancedDelayDataXdf <- "\\\\LON-RSVR\\Data\\EnhancedFlightDelayData.xdf"

flightDelayDataXdf <- "\\\\LON-RSVR\\Data\\FlightDelayData.xdf"
enhancedXdf <- rxImport(inData = flightDelayDataXdf, outFile = enhancedDelayDataXdf,
overwrite = TRUE, append = "none", rowsPerRead = 500000,
transforms = list(OriginState = stateInfo$state[match(Origin,
stateInfo$iata)],
DestState = stateInfo$state[match(Dest,
stateInfo$iata)]),
transformObjects = list(stateInfo = airportInfo)
)
6. Add and run the following statement that displays the first six rows of XDF file. Verify that they
include the OriginState and DestState variables:
head(enhancedXdf)
Results: At the end of this exercise, you will have augmented the flight delay data with the state in
which the origin and destination airports are located.
Exercise 4: Refactoring data and generating summaries

 Task 1: Examine delay intervals by airport and state
1. In the remote session, add and run the following statement. This code defines an expression that
factorizes the Delay variable into the required set of intervals:
delayFactor <- expression(list(Delay = cut(Delay, breaks = c(0, 1, 30, 60, 120, 180,
181), labels = c("No delay", "Up to 30 mins", "30 mins - 1 hour", "1 hour to 2
hours", "2 hours to 3 hours", "More than 3 hours"))))
2. Add and run the following statements. The first statement generates a cross-tabulation that
summarizes the delay by origin airport. It uses the delayFactor expression to transform the Delay
variable. The second statement displays the results:
originAirportDelays <- rxCrossTabs(formula = ~ Origin:Delay, data = enhancedXdf,

transforms = delayFactor
)
print(originAirportDelays)
3. Add and run the following statements to generate and display the cross-tabulation of delays by
destination airport:
destAirportDelays <- rxCrossTabs(formula = ~ Dest:Delay, data = enhancedXdf,

)
print(destAirportDelays)
origin state:
originStateDelays <- rxCrossTabs(formula = ~ OriginState:Delay, data = enhancedXdf,

)
print(originStateDelays)
destination state:
destStateDelays <- rxCrossTabs(formula = ~ DestState:Delay, data = enhancedXdf,

)
print(destStateDelays)
6. Add and run the following command to close the remote session:
exit
 Task 2: Summarize delay intervals by airport and state

1. Add and run the following statements to build and install dplyrXdf:
install.packages("devtools")
devtools::install_github("RevolutionAnalytics/dplyrXdf")
library(dplyr)
library(dplyrXdf)
L2-7
2. Add and run the following code to create a data source that retrieves the required columns from the
XDF data:
enhancedDelayDataXdf <- "\\\\LON-RSVR\\Data\\EnhancedFlightDelayData.xdf"

essentialData <-RxXdfData(enhancedDelayDataXdf, varsToKeep = c("Delay", "Origin",
"Dest", "OriginState", "DestState"))
3. Add and run the following code. This code is a dplyrXdf pipeline that calculates the mean delay by
origin airport and sorts them in descending order. The airport with the longest delays will be at the
top:
originAirportStats <- filter(essentialData, !is.na(Delay)) %>%

select(Origin, Delay) %>%
group_by(Origin) %>%
summarise(mean_delay = mean(Delay), .method = 1) %>% # Use methods 1 or 2 only
arrange(desc(mean_delay)) %>%
persist("\\\\LON-RSVR\\Data\\temp.xdf")
head(originAirportStats, 100)
4. Add and run the following code that calculates the mean delay by destination airport and sorts them
in descending order:
destAirportStats <- filter(essentialData, !is.na(Delay)) %>%

select(Dest, Delay) %>%
group_by(Dest) %>%
summarise(mean_delay = mean(Delay), .method = 1) %>%
head(destAirportStats, 100)
5. Add and run the following code that calculates the mean delay by origin state and sorts them in
descending order:
originStateStats <- filter(essentialData, !is.na(Delay)) %>%

select(OriginState, Delay) %>%
group_by(OriginState) %>%
head(originStateStats, 100)
6. Add and run the following code that calculates the mean delay by destination state and sorts them in
descending order:
destStateStats <- filter(essentialData, !is.na(Delay)) %>%

select(DestState, Delay) %>%
group_by(DestState) %>%
head(destStateStats, 100)
environment.
Results: At the end of this exercise, you will have examined flight delays by origin and destination
airport and state.

L3-1
Module 3: Visualizing Big Data

Lab: Visualizing data
Exercise 1: Visualizing data using the ggplot2 package
 Task 1: Import the flight delay data into a data frame
2. Start your R development environment of choice (Visual Studio or RStudio), and create a new R file.
3. Add the following statement to the R file and run it to set the working directory:
4. Add the following statements to the R file and run them to create an XDF data source that references
the data file:
flightDelayDataXdf <- "FlightDelayData.xdf"

flightDelayData <- RxXdfData(flightDelayDataXdf)
5. Add the following statements to the R file and run them to import the sample data into a data frame:
delayPlotData <- rxImport(flightDelayData, rowsPerRead = 1000000,
varsToKeep = c("Distance", "Delay", "Origin", "OriginState"),
rowSelection = (Distance > 0) & as.logical(rbinom(n =
.rxNumRows, size = 1, prob = 0.02))
)
 Task 2: Generate line plots to visualize the delay data

1. Add and run the following statements. This code installs the tidyverse package and brings the
tidyverse library into scope:
install.packages("tidyverse")
library(tidyverse)
2. Add and run the following statement. This code creates a scatter plot of flight distance on the x axis
against delay time on the y axis:
ggplot(data = delayPlotData) +
geom_point(mapping = aes(x = Distance, y = Delay)) +
xlab("Distance (miles)") +
ylab("Delay (minutes)")
3. Add and run the following statement. This code creates a line plot of flight distance on the x axis
against delay time on the y axis:
delayPlotData %>%
filter(!is.na(Delay) & (Delay >= 0) & (Delay <= 1000)) %>%
ggplot(mapping = aes(x = Distance, y = Delay)) +
ylab("Delay (minutes)") +
geom_point(alpha = 1/50) +
geom_smooth(color = "red")
4. Add and run the following statement. This code creates a faceted plot organized by departure state:
delayPlotData %>%
filter(!is.na(Delay) & (Delay >= 0) & (Delay <= 1000)) %>%
ggplot(mapping = aes(x = Distance, y = Delay)) +
ylab("Delay (minutes)") +
geom_point(alpha = 1/50) +
geom_smooth(color = "red") +
theme(axis.text = element_text(size = 6)) +
facet_wrap( ~ OriginState, nrow = 8)
Results: At the end of this exercise, you will have used the ggplot2 package to generate line plots that
depict flight delay times as a function of distance traveled and departure state.
Exercise 2: Examining relationships in big data using the rxLinePlot

function
 Task 1: Plot delay as a percentage of flight time
1. Add and run the following code. This code creates the XDF file to be used for the graphs:
delayDataWithProportionsXdf <- "FlightDelayDataWithProportions.xdf"

delayPlotDataXdf <- rxImport(flightDelayData, outFile = delayDataWithProportionsXdf,
overwrite = TRUE, append ="none", rowsPerRead = 1000000,
varsToKeep = c("Distance", "ActualElapsedTime", "Delay",
"Origin", "Dest", "OriginState", "DestState", "ArrDelay", "DepDelay", "CarrierDelay",
"WeatherDelay", "NASDelay", "SecurityDelay", "LateAircraftDelay"),
rowSelection = (Distance > 0) & (Delay >= 0) & (Delay <=
1000) & !is.na(ActualElapsedTime) & (ActualElapsedTime > 0),
transforms = list(DelayPercent = (Delay /
ActualElapsedTime) * 100)
)
2. Add and run the following code. This code creates a cube summarizing the data of interest from the
XDF file:
delayPlotCube <- rxCube(DelayPercent ~ F(Distance):OriginState, data =

delayPlotDataXdf,
rowSelection = (DelayPercent <= 100)
)
3. Add and run the following statement. This code changes the name of the first column in the cube to
Distance (it was F_Distance):
names(delayPlotCube)[1] <- "Distance"
4. Add and run the following statement. This code creates a data frame from the cube:
delayPlotDF <- rxResultsDF(delayPlotCube)

L3-3
5. Add and run the following statement. This code uses the rxLinePlot function to generate a scatter
plot of DelayPercent versus Distance:
rxLinePlot(DelayPercent~Distance, data = delayPlotDF, type="p",

title = "Flight delay as a percentage of flight time against distance",
xTitle = "Distance (miles)",
yNumTicks = 10,
yTitle = "Delay %",
symbolStyle = ".",
symbolSize = 2,
symbolColor = "red",
scales = (list(x = list(draw = FALSE)))
)
 Task 2: Plot regression lines for the delay data

1. Add and run the following statement. This code repeats the previous plot but adds a smoothed line
overlay:
rxLinePlot(DelayPercent~Distance, data = delayPlotDF, type=c("p", "smooth"),

title = "Flight delay as a percentage of flight time against distance",
yNumTicks = 10,
yTitle = "Delay %",
symbolStyle = ".",
symbolSize = 2,
)
2. Add and run the following statement. This code facets the plot by OriginState:
rxLinePlot(DelayPercent~Distance | OriginState, data = delayPlotDF, type="smooth",

title = "Flight delay as a percentage of flight time against distance, by
state",
yTitle = "Delay %",
symbolStyle = ".",
)
 Task 3: Visualize delays by the day of the week

1. Add and run the following code. This code creates the XDF file to be used to plot delays by the day of
the week:
delayDataWithDayXdf <- "FlightDelayWithDay.xdf"

delayPlotDataWithDayXdf <- rxImport(flightDelayData, outFile = delayDataWithDayXdf,
overwrite = TRUE, append ="none", rowsPerRead =
1000000,
varsToKeep = c("Delay", "CRSDepTime",
"DayOfWeek"),
transforms = list(CRSDepTime =
cut(as.numeric(CRSDepTime), breaks = 48)),
rowSelection = (Delay >= 0) & (Delay <= 180)
)
2. Add and run the following code. This statement refactors the DayOfWeek variable in the XDF data:
delayPlotDataWithDayXdf <- rxFactors(delayPlotDataWithDayXdf, outFile =

delayDataWithDayXdf,
overwrite = TRUE, blocksPerRead = 1,
factorInfo = list(DayOfWeek = list(newLevels =
c(Mon = "1", Tue = "2", Wed = "3", Thu = "4", Fri = "5", Sat = "6", Sun = "7"),
varName =
"DayOfWeek"))
)
3. Add and run the following code. This statement creates a cube that summarizes the data:
delayDataWithDayCube <- rxCube(Delay ~ CRSDepTime:DayOfWeek, data =

delayPlotDataWithDayXdf)
4. Add and run the following code. This statement creates a data frame from the cube:
delayPlotDataWithDayDF <- rxResultsDF(delayDataWithDayCube)
5. Add and run the following statement that generates a line plot of delay against the day of the week.
Note you may have to move the Plots window to view the graph clearly:
rxLinePlot(Delay~CRSDepTime|DayOfWeek, data = delayPlotDataWithDayDF, type=c("p",

"smooth"),
lineColor = "blue",
symbolStyle = ".",
symbolSize = 2,
title = "Flight delay, by departure day and time",
xTitle = "Departure time",
yTitle = "Delay (mins)",
xNumTicks = 24,
scales = (list(y = list(labels = c("0", "20", "40", "60", "80", "100",
"120", "140", "160", "180")),
x = list(rot = 90),
labels = c("Midnight", "", "", "", "02:00", "",
"", "", "04:00", "", "", "", "06:00", "", "", "", "08:00", "", "", "", "10:00", "",
"", "", "Midday", "", "", "", "14:00", "", "", "", "16:00", "", "", "", "18:00", "",
"", "", "20:00", "", "", "", "22:00", "", "", "")))
)
Results: At the end of this exercise, you will have used the rxLinePlot function to generate line plots that
depict flight delay times as a function of flight time and day of the week.
L3-5
Exercise 3: Creating histograms over big data

 Task 1: Create histograms to visualize the frequencies of different causes of delay
1. Add and run the following code. This code creates the XDF file to be used to plot flight delays by
state and by month:
delayReasonDataXdf <- "FlightDelayReasonData.xdf"

delayReasonData <- rxImport(flightDelayData, outFile = delayReasonDataXdf,
overwrite = TRUE, append ="none", rowsPerRead = 1000000,
varsToKeep = c("OriginState", "Delay", "ArrDelay",
"WeatherDelay", "MonthName"),
rowSelection = (Delay >= 0) & (Delay <= 1000) &
(ArrDelay >= 0) & (DepDelay >= 0)
)
2. Add and run the following code. This code creates a histogram that counts the frequency of arrival
delays:
rxHistogram(formula = ~ ArrDelay, data = delayReasonData,

histType = "Counts", title = "Total Arrival Delays",
xTitle = "Arrival Delay (minutes)",
xNumTicks = 10)
3. Add and run the following code. This code creates a histogram that shows the percentage frequency
of arrival delays as a percentage:
rxHistogram(formula = ~ ArrDelay, data = delayReasonData,

histType = "Percent", title = "Frequency of Arrival Delays",
xNumTicks = 10)
4. Add and run the following code. This code creates a histogram that shows the percentage frequency
of arrival delays as a percentage, organized by state:
rxHistogram(formula = ~ ArrDelay | OriginState, data = delayReasonData,

histType = "Percent", title = "Frequency of Arrival Delays by State",
xNumTicks = 10)
5. Add and run the following code. This code creates a histogram that shows the frequency of weather
delays:
rxHistogram(formula = ~ WeatherDelay, data = delayReasonData,

histType = "Counts", title = "Frequency of Weather Delays",
xTitle = "Weather Delay (minutes)",
xNumTicks = 20, xAxisMinMax = c(0, 180), yAxisMinMax = c(0, 20000))
6. Add and run the following code. This code creates a histogram that shows the frequency of weather
delays organized by month:
rxHistogram(formula = ~ WeatherDelay | MonthName, data = delayReasonData,

histType = "Counts", title = "Frequency of Weather Delays by Month",
xTitle = "Weather Delay (minutes)",
xNumTicks = 10, xAxisMinMax = c(0, 180), yAxisMinMax = c(0, 3000))
environment.
Results: At the end of this exercise, you will have used the rxHistogram function to create histograms
that show the relative rates of arrival delay by state, and weather delay by month.

L4-1
Module 4: Processing Big Data

Lab: Processing big data
Exercise 1: Merging the airport and flight delay datasets

copy E:\Labfiles\Lab04\airportData.xdf \\LON-RSVR\Data
copy E:\Labfiles\Lab04\FlightDelayData.xdf \\LON-RSVR\Data
4. If the Overwrite message appears, type y, and then press Enter.
5. Verify that both files are copied successfully.
 Task 2: Refactor the data

RSVR server:

3. Add the following code to the R file and run it. This code retrieves the information for the iata
variable in the airportData XDF file and displays it. The code then performs this same operation for
the Origin and Dest variables in the FlightDelayData XDF file:
airportData = RxXdfData("\\\\LON-RSVR\\Data\\airportData.xdf")
flightDelayData = RxXdfData("\\\\LON-RSVR\\Data\\flightDelayData.xdf")
iataFactor <- rxGetVarInfo(airportData, varsToKeep = c("iata"))
print(iataFactor)
originFactor <- rxGetVarInfo(flightDelayData, varsToKeep = c("Origin"))
print(originFactor)
destFactor <- rxGetVarInfo(flightDelayData, varsToKeep = c("Dest"))
print(destFactor)
4. Add the following code to the R file and run it. This code creates a new set of factor levels using the
levels in the iata, Origin, and Dest variables:
refactorLevels <- unique(c(iataFactor$iata[["levels"]],

originFactor$Origin[["levels"]],
destFactor$Dest[["levels"]]))
5. Add the following code to the R file and run it. This code refactors the iata variable in the
airportData XDF file with the new factor levels:
refactoredAirportDataFile <- "\\\\LON-RSVR\\Data\\RefactoredAirportData.xdf"
refactoredAirportData <- rxFactors(inData = airportData, outFile =
refactoredAirportDataFile, overwrite = TRUE,
factorInfo = list(iata = list(newLevels =
refactorLevels))
)
6. Add the following code to the R file and run it. This code refactors the Origin and Dest variables in
the FlightDelayData XDF file with the new factor levels:
refactoredFlightDelayDataFile <- "\\\\LON-RSVR\\Data\\RefactoredFlightDelayData.xdf"

refactoredFlightDelayData <- rxFactors(inData = flightDelayData, outFile =
refactoredFlightDelayDataFile, overwrite = TRUE,
factorInfo = list(Origin = list(newLevels =
refactorLevels),
Dest = list(newLevels =
refactorLevels))
)
7. Add the following code to the R file and run it. This code displays the new factor levels for the iata,
Origin, and Dest variables. They should all be the same now:
iataFactor <- rxGetVarInfo(refactoredAirportData, varsToKeep = c("iata"))

print(iataFactor)
originFactor <- rxGetVarInfo(refactoredFlightDelayData, varsToKeep = c("Origin"))
print(originFactor)
destFactor <- rxGetVarInfo(refactoredFlightDelayData, varsToKeep = c("Dest"))
print(destFactor)
 Task 3: Merge the datasets

1. Add the following code to the R file and run it. This code renames the iata field in the refactored
airport data file to Origin:
names(refactoredAirportData)[[1]] <- "Origin"
2. Add the following code to the R file and run it. This code reblocks the airport data XDF file:
reblockedAirportDataFile <- "\\\\LON-RSVR\\Data\\reblockedAirportData.xdf"

reblockedAirportData <- rxDataStep(refactoredAirportData,
reblockedAirportDataFile, overwrite = TRUE
)
3. Add the following code to the R file and run it. This code uses the rxMerge function to merge the
two XDF files, performing an inner join over the Origin field:
mergedFlightDelayDataFile <- "\\\\LON-RSVR\\Data\\MergedFlightDelayData.xdf"

mergedFlightDelayData <- rxMerge(inData1 = refactoredFlightDelayData, inData2 =
reblockedAirportData,
outFile = mergedFlightDelayDataFile, overwrite =
TRUE,
type = "inner", matchVars = c("Origin"), autoSort =
TRUE,
varsToKeep2 = c("timezone", "Origin"),
newVarNames2 = c(timezone = "OriginTimeZone"),
rowsPerOutputBlock = 500000
)
L4-3
4. Add the following code to the R file and run it. This code examines the structure of the new flight
delay data file that should now include the OriginTimeZone variable, and displays the first and last
few rows:
rxGetVarInfo(mergedFlightDelayData)
head(mergedFlightDelayData)
tail(mergedFlightDelayData)
Results: At the end of this exercise, you will have created a new dataset that combines information from
the flight delay data and airport information datasets.
Exercise 2: Transforming departure and arrival dates to UTC

 Task 1: Generate a sample of the flight delay data
1. Add the following code to the R file and run it. This code creates a new XDF file containing 0.5% of
the data in the flight delay data file (approximately 20,000 rows):
flightDelayDataSubsetFile <- "\\\\LON-RSVR\\Data\\flightDelayDataSubset.xdf"
flightDelayDataSubset <- rxDataStep(inData = mergedFlightDelayData,
outFile = flightDelayDataSubsetFile, overwrite =
TRUE,
rowSelection = rbinom(.rxNumRows, size = 1, prob
= 0.005)
)
2. Add the following code to the R file and run it. This code displays the metadata for the XDF file
containing the sample data:
rxGetInfo(flightDelayDataSubset, getBlockSizes = TRUE)
 Task 2: Transform the data

1. Add the following code to the R file and run it. This statement installs the lubridate package:
install.packages("lubridate")
2. Add the following code to the R file and run it. This code implements the standardizeTimes
transformation function:
standardizeTimes <- function (dataList) {

return(dataList)
}
# Create a new vector for holding the standardized departure time
departureTimeVarIndex <- length(dataList) + 1
dataList[[departureTimeVarIndex]] <- rep(as.numeric(NA), times = .rxNumRows)
names(dataList)[departureTimeVarIndex] <- "StandardizedDepartureTime"
# Do the same for standardized arrival time
arrivalTimeVarIndex <- length(dataList) + 1
dataList[[arrivalTimeVarIndex]] <- rep(as.numeric(NA), times = .rxNumRows)
names(dataList)[arrivalTimeVarIndex] <- "StandardizedArrivalTime"
departureYearVarIndex <- 1
departureMonthVarIndex <- 2
departureDayVarIndex <- 3
departureTimeStringVarIndex <- 4
elapsedTimeVarIndex <- 5
departureTimezoneVarIndex <- 6
# Iterate through the rows and add the standardized arrival and departure times
for (i in 1:.rxNumRows) {
# Get the local departure time details
departureYear <- dataList[[departureYearVarIndex]][i]
departureMonth <- dataList[[departureMonthVarIndex]][i]
departureDay <- dataList[[departureDayVarIndex]][i]
departureHour <- trunc(as.numeric(dataList[[departureTimeStringVarIndex]][i]) /
100)
departureMinute <- as.numeric(dataList[[departureTimeStringVarIndex]][i]) %% 100
departureTimeZone <- dataList[[departureTimezoneVarIndex]][i]
# Construct the departure date and time, including timezone
departureDateTimeString <- paste(departureYear, "-", departureMonth, "-",
departureDay, " ", departureHour, ":", departureMinute, sep="")
departureDateTime <- as.POSIXct(departureDateTimeString, tz = departureTimeZone)
# Convert to UTC and store it
standardizedDepartureDateTime <- format(departureDateTime, tz="UTC")
dataList[[departureTimeVarIndex]][i] <- standardizedDepartureDateTime
# Calculate the arrival date and time
# Do this by adding the elapsed time to the departure time
# The elapsed time is stored as the number of minutes (an integer)
elapsedTime = dataList[[5]][i]
standardizedArrivalDateTime <- format(as.POSIXct(standardizedDepartureDateTime) +
minutes(elapsedTime))
# Store it
dataList[[arrivalTimeVarIndex]][i] <- standardizedArrivalDateTime
}
# Return the data including the new variables
return(dataList)
}
3. Add the following code to the R file and run it. This code uses the rxDataStep function to perform
the transformation:
flightDelayDataTimeZonesFile <- "\\\\LON-RSVR\\Data\\flightDelayDataTimezones.xdf"

flightDelayDataTimeZones <- rxDataStep(inData = flightDelayDataSubset,
outFile = flightDelayDataTimeZonesFile,
overwrite = TRUE,
transformFunc = standardizeTimes,
transformVars = c("Year", "Month",
"DayofMonth", "DepTime", "ActualElapsedTime", "OriginTimeZone"),
transformPackages = c("lubridate")
)
4. Add the following code to the R file and run it. This code examines the transformed data file:
rxGetVarInfo(flightDelayDataTimeZones)
head(flightDelayDataTimeZones)
tail(flightDelayDataTimeZones)
Results: At the end of this exercise, you will have implemented a transformation function that adds
variables containing the standardized departure and arrival times to the flight delay dataset.
L4-5
Exercise 3: Calculating cumulative average delays for each route

 Task 1: Sort the data
1. Add the following code to the R file and run it. This code sorts the flight delay data by departure date:
sortedFlightDelayDataFile <- "sortedFlightDelayData.xdf"

sortedFlightDelayData <- rxSort(inData = flightDelayDataTimeZones,
outFile = sortedFlightDelayDataFile, overwrite =
TRUE,
sortByVars = c("StandardizedDepartureTime")
)
2. Add the following code to the R file and run it. This code displays the first and last few lines of the
sorted file. Examine the data in the StandardizedDepartureTime variable:
head(sortedFlightDelayData)
tail(sortedFlightDelayData)
 Task 2: Calculate the cumulative average delays

1. Add the following code to the R file and run it. This code implements the
calculateCumulativeAverageDelays transformation function:
calculateCumulativeAverageDelays <- function (dataList) {

return(dataList)
}
# Add a new vector for holding the cumulative average delay
cumulativeAverageDelayVarIndex <- length(dataList) + 1
dataList[[cumulativeAverageDelayVarIndex]] <- rep(as.numeric(NA), times =
.rxNumRows)
names(dataList)[cumulativeAverageDelayVarIndex] <- "CumulativeAverageDelayForRoute"
originVarIndex <- 1
destVarIndex <- 2
delayVarIndex <- 3
# Retrieve the vector containing the cumulative delays recorded so far for each
route
cumulativeDelays <- .rxGet("cumulativeDelays")
# Retrieve the vecor containing the number of times each route has occurred so far
cumulativeRouteOccurrences <- .rxGet("cumulativeRouteOccurrences")
# Iterate through the rows and add the standardized arrival and departure times
for (i in 1:.rxNumRows) {
# Get the route and delay details
origin <- dataList[[originVarIndex]][i]
dest <- dataList[[destVarIndex]][i]
routeDelay <- dataList[[delayVarIndex]][i]
# Create a string that identifies the route
route <- paste(origin, dest, sep = "")
# Retrieve the current cumulative delay and number of occurrences for each route
delay <- cumulativeDelays[[route]]
occurrences <- cumulativeRouteOccurrences[[route]]
# Update the cumulative statistics
delay <- ifelse(is.null(delay), 0, delay) + routeDelay
occurrences <- ifelse(is.null(occurrences), 0, occurrences) + 1
# Work out the new running average delay for the route
cumulativeAverageDelay <- delay / occurrences
# Store the data and updated stats
dataList[[cumulativeAverageDelayVarIndex]][i] <- cumulativeAverageDelay
cumulativeDelays[[route]] <- delay
cumulativeRouteOccurrences[[route]] <- occurrences
}
# Save the lists containing the cumulative data so far

.rxSet("cumulativeDelays", cumulativeDelays)
.rxSet("cumulativeRouteOccurrences", cumulativeRouteOccurrences)
# Return the data including the new variable
return(dataList)
}
2. Add the following code to the R file and run it. This code uses rxDataStep to run the
calculateCumulativeAverageDelays transformation function:
flightDelayDataWithAveragesFile <- "\\\\LON-

RSVR\\Data\\flightDelayDataWithAverages.xdf"
flightDelayDataWithAverages <- rxDataStep(inData = sortedFlightDelayData,
outFile = flightDelayDataWithAveragesFile,
overwrite = TRUE,
transformFunc =
calculateCumulativeAverageDelays,
transformVars = c("Origin", "Dest",
"Delay"),
transformObjects = list(cumulativeDelays =
list(), cumulativeRouteOccurrences = list())
)
 Task 3: Verify the results

1. Add the following code to the R file and run it. This code examines the transformed data file:
rxGetVarInfo(flightDelayDataWithAverages)
head(flightDelayDataWithAverages)
tail(flightDelayDataWithAverages)
2. Add the following code to the R file and run it. This creates a scatter and regression plot showing the
cumulative average delay for flights from ATL to PHX (Atlanta to Phoenix):
rxLinePlot(CumulativeAverageDelayForRoute ~ as.POSIXct(StandardizedDepartureTime),
type = c("p", "r"),
flightDelayDataWithAverages,
rowSelection = (Origin == "ATL") & (Dest == "PHX"),
yTitle = "Cumulative Average Delay for Route",
xTitle = "Date"
)
3. Repeat step 2 and replace the rowSelection argument with each of the following values in turn:
 rowSelection = (Origin == "SFO") & (Dest == "LAX")
 rowSelection = (Origin == "LAX") & (Dest == "SFO")
 rowSelection = (Origin == "DEN") & (Dest == "SLC")
 rowSelection = (Origin == "LGA") & (Dest == "ORD")
environment.
Results: At the end of this exercise, you will have sorted data, and created and tested another
transformation function.
L5-1
Module 5: Parallelizing Analysis Operations

Lab: Parallelizing analysis operations
Exercise 1: Capturing flight delay times and frequencies
 Task 1: Create a shared folder
If you have not already created the \\LON-RSVR\Data share in an earlier lab, perform the following
steps:
1. Log in to the LON-RSVR VM as Adatum\AdatumAdmin with the password Pa55w.rd.
3. In File Explorer, right-click the C:\Data folder, click Share with, and then click Specific people.
4. In the File Sharing dialog box, click the drop-down list, click Everyone, and then click Add.
5. In the lower pane, click the Everyone row, and set the Permission Level to Read/Write.
6. Click Share.
7. In the File Sharing dialog box, verify that the file share is named \\LON-RSVR\Data, and then click
Done.
 Task 2: Copy the data

3. At the command prompt, run the following commands:

5. Verify that the files are copied successfully, and then close the command prompt window.
 Task 3: Create the class generator for the PemaFlightDelays class

file.
2. Add the following statement to your R script, and run it. This statement installs the dplyr package:
3. Add the following code to your R script, and run it. These statements bring the dplyr and
RevoPemaR libraries into scope:
library(dplyr)
library(RevoPemaR)
4. Add the following code to your R script, but do not run it yet. Code defines the PemaFlightDelays
class generator:
PemaFlightDelays <- setPemaClass(

Class = "PemaFlightDelays",
contains = "PemaBaseClass",
)
5. Add the following code to your R script, after the contains line, but before the closing brace. Do not
run it yet. This code adds the fields to the class:
fields = list(
totalFlights = "numeric",
totalDelays = "numeric",
origin = "character",
dest = "character",
airline = "character",
delayTimes = "vector",
results = "list"
),
6. Add the following code to your R script, after the fields list, but before the closing brace. Do not run
it yet. This code defines the initialize method:
methods = list(
initialize = function(originCode = "", destinationCode = "",
airlineCode = "", ...) {
'initialize fields'
callSuper(...)
usingMethods(.pemaMethods)
totalFlights <<- 0
totalDelays <<- 0
delayTimes <<- vector(mode="numeric", length=0)
origin <<- originCode
dest <<- destinationCode
airline <<- airlineCode
},
7. Add the following code to your R script, after the initialize method, but before the closing brace. Do
not run it yet. This code defines the processData method:
processData = function(dataList) {
'Generates a vector of delay times for specified variables in the current chunk
of data.'
data <- as.data.frame(dataList)
# If no origin was specified, default to the first value in the dataset
if (origin == "") {
origin <<- as.character(as.character(data$Origin[1]))
}
# If no destination was specified, default to the first value in the dataset
if (dest == "") {
dest <<- as.character(as.character(data$Dest[1]))
}
# If no airline was specified, default to the first value in the dataset
if (airline == "") {
airline <<- as.character(as.character(data$UniqueCarrier[1]))
}
# Use dplyr to filter by origin, dest, and airline
# update the number of flights
# select the Delay variable
# only include delayed flights in the results
data %>%
filter(Origin == origin, Dest == dest, UniqueCarrier == airline) %T>%
{totalFlights <<- totalFlights + length(.$Origin)} %>%
select(ifelse(is.na(Delay), 0, Delay)) %>%
filter(Delay > 0) ->
temp
# Store the result in the delayTimes vector
delayTimes <<- c(delayTimes, as.vector(temp[,1]))
totalDelays <<- length(delayTimes)
invisible(NULL)
},
L5-3
8. Add the following code to your R script, after the processData method, but before the closing brace.
Do not run it yet. This code defines the updateResults method:
updateResults = function(pemaFlightDelaysObj) {
'Updates total observations and delayTimes vector from another PemaFlightDelays
object object.'
# Update the totalFlights and totalDelays fields
totalFlights <<- totalFlights + pemaFlightDelaysObj$totalFlights
totalDelays <<- totalDelays + pemaFlightDelaysObj$totalDelays
# Append the delay data to the delayTimes vector
delayTimes <<- c(delayTimes, pemaFlightDelaysObj$delayTimes)
invisible(NULL)
},
9. Add the following code to your R script, after the updateResults method, but before the closing
brace. Do not run it yet. This code defines the processResults method:
processResults = function() {
'Generates a list containing the results:'
' The first element is the number of flights made by the airline'
' The second element is the number of delayed flights'
' The third element is the list of delay times'
results <<- list("NumberOfFlights" = totalFlights,
"NumberOfDelays" = totalDelays,
"DelayTimes" = delayTimes)
return(results)
}
)
10. Highlight and run the code you have entered in the previous steps, starting at step 4, in this task.
Verify that no errors are reported.
 Task 4: Test the PemaFlightDelays class using a data frame

1. Add the following statement to your R script and run it. This statement creates an instance of the
PemaFlightDelays class in a variable named pemaFlightDelaysObj:
pemaFlightDelaysObj <- PemaFlightDelays()
2. Add the following statement to your R script, and run it. This statement creates a remote connection
to the LON-RSVR VM. Specify the username admin with the password Pa55w.rd when prompted:

commandline = TRUE)
session:
pause()
4. Add and run the following statement. This statement copies the local variable pemaFlightDelaysObj
to the remote session:
putLocalObject("pemaFlightDelaysObj")
resume()
6. Add the following statements to your R script, and run them:
library(dplyr)
library(RevoPemaR)
7. Add the following statements to your R script, and run them. These statements create a data frame
comprising the first 50,000 observations from the FlightDelayData.xdf file in the \\LON-RSVR\Data
share.
flightDelayDataFile <- ("\\\\LON-RSVR\\Data\\FlightDelayData.xdf")

flightDelayData <- RxXdfData(flightDelayDataFile)
testData <- rxDataStep(flightDelayData,
numRows = 50000)
8. Add the following statements to your R script, and run them. This code uses the pemaCompute
function to run the pemaFlightDelaysObj object to perform an analysis of flights from "ABE" to "PIT"
made by airline "US":
result <- pemaCompute(pemaObj = pemaFlightDelaysObj, data = testData, originCode =

"ABE", destinationCode = "PIT", airlineCode = "US")
print(result)
Verify that the results resemble the following data:
$NumberOfFlights
[1] 755
$NumberOfDelays
[1] 188
$DelayTimes
[1] 3 10 3 16 54 2 61 65 54 12 18 23 92 16 153 7 18 2
[19] 21 61 2 1 1 4 40 67 1 82 6 3 112 298 39 21 13 2
[37] 1 12 2 131 474 85 27 352 9 2 49 24 18 60 43 28 126 109
[55] 40 39 53 34 120 3 274 73 57 3 83 27 58 53 15 8 58 61
[73] 1 117 34 32 9 19 66 44 2 82 17 21 9 103 2 45 4 64
[91] 3 48 52 17 5 11 7 1 18 23 43 29 7 46 22 71 16 18
[109] 9 62 27 120 10 12 11 6 10 4 50 4 1 6 1 129 3 9
[127] 185 5 11 17 19 171 2 81 3 17 1 33 21 2 45 8 27 29
[145] 42 25 40 5 1 15 1 59 4 10 6 81 13 45 37 6 9 1
[163] 7 1 2 2 2 5 109 3 15 7 25 58 17 45 289 5 7 7
[181] 38 89 3 34 12 15 129 19
9. Add the following statements to your R script, and run them. This code displays the values of the
internal fields in the pemaFlightDelaysObj object:
print(pemaFlightDelaysObj$delayTimes)
print(pemaFlightDelaysObj$totalDelays)
print(pemaFlightDelaysObj$totalFlights)
print(pemaFlightDelaysObj$origin)
print(pemaFlightDelaysObj$dest)
print(pemaFlightDelaysObj$airline)
L5-5
 Task 5: Use the PemaFlightDelays class to analyze XDF data

1. Add the following statements to your R script, and run them. This code creates a subset of the flight
delay data as an XDF file with a small block size:
flightDelayDataSubsetFile <- ("\\\\LON-RSVR\\Data\\FlightDelayDataSubset.xdf")

testData2 <- rxDataStep(flightDelayData, flightDelayDataSubsetFile,
overwrite = TRUE, numRows = 50000, rowsPerRead = 5000)
2. Add the following statements to your R script, and run them. This code performs the same analysis as
before, but using the XDF file:
result <- pemaCompute(pemaObj = pemaFlightDelaysObj, data = testData2, originCode =

"ABE", destinationCode = "PIT", airlineCode = "US")
print(result)
3. Verify that the results are the same as before (755 flights, with 188 delayed).
4. Add the following statements to your R script, and run them. This code performs the same analysis as
before, but using the XDF file:
result <- pemaCompute(pemaObj = pemaFlightDelaysObj, data = flightDelayData,

originCode = "LAX", destinationCode = "JFK", airlineCode = "DL")
print(result)
5. Examine the results. Note the number of flights made, the number of flights that were delayed, and
the length of the longest delay.
environment.
Results: At the end of this exercise, you will have created and run a PEMA class that finds the number of
times flights that match a specified origin, destination, and airline are delayed—and how long each delay
was.

L6-1
Module 6: Creating and Evaluating Regression Models

Lab: Creating and using a regression model
Exercise 1: Clustering flight delay data


5. Verify that the file is copied successfully.
 Task 2: Examine the relationship between flight delays and departure times
RSVR server:

3. Add the following code to the R file and run it. This code creates a data file containing a random
sample of 10 percent of the flight delay data:
flightDelayData = RxXdfData("\\\\LON-RSVR\\Data\\flightDelayData.xdf")
sampleDataFile = "\\\\LON-RSVR\\Data\\flightDelayDatasample.xdf"
flightDelayDataSample <- rxDataStep(inData = flightDelayData,
outFile = sampleDataFile, overwrite = TRUE,
rowSelection = rbinom(.rxNumRows, size = 1, prob
= 0.10)
)
4. Add the following code to the R file and run it. This code displays a scatter plot with a regression line
showing how flight delays vary with departure time throughout the day:
rxLinePlot(formula = Delay~as.numeric(DepTime), data = flightDelayDataSample,

type = c("p", "r"), symbolStyle = c("."),
lineColor = "red",
xlab = "Departure Time",
ylab = "Delay (mins)",
xlim = c(0, 2400), ylim = c(-120, 1000)
)
5. Add the following code to the R file and run it. This code creates an expression that you can use to
factorize the departure times by hour:
depTimeFactor <- expression(list(DepTime = cut(as.numeric(DepTime),

breaks = c(0, 100, 200, 300, 400, 500,
600, 700, 800, 900, 1000, 1100,
1200, 1300, 1400, 1500,
1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400),
labels = c("Midnight-1:00AM", "1:01-
2:00AM", "2:01-3:00AM", "3:01-4:00AM", "4:01-5:00AM",
"5:01-6:00 AM", "6:01-
7:00AM", "7:01-8:00AM", "8:01-9:00AM", "9:01-10:00AM", "10:01-11:00AM",
"11:01-Midday", "12:01-
1:00PM", "1:01-2:00PM", "2:01-3:00PM", "3:01-4:00PM", "4:01-5:00PM",
"5:01-6:00PM", "6:01-
7:00PM", "7:01-8:00PM", "8:01-9:00PM", "9:01-10:00PM",
"10:01-11:00PM", "11:01PM-
Midnight"))))
6. Add the following code to the R file and run it. This code generates a histogram showing the number
of departures for each hour:
rxHistogram(formula = ~DepTime, data = flightDelayDataSample,

histType = "Counts",
ylab = "Number of Departures",
transforms = depTimeFactor)
 Task 3: Create clusters to model the flight delay data

1. Add the following code to the R file and run it. This code creates a cluster model of 12 clusters,
combining the DepTime and Delay variables in the sample data:
delayCluster <- rxKmeans(formula = ~DepTime + Delay,

data = flightDelayDataSample,
transforms = list(DepTime = as.numeric(DepTime)),
numClusters = 12
)
2. Add the following code to the R file and run it. This code calculates the ratio of the between clusters
sums of squares and the total sums of squares for this model. The value returned should be in the
high 80% to low 90% range:
delayCluster$betweenss / delayCluster$totss
3. Add the following code to the R file and run it. This code displays the cluster centers. There should be
12 rows, showing the value of DepTime and Delay used as the centroid values for each cluster:
delayCluster$centers
4. Add the following code to the R file and run it. These statements create a parallel compute context
and register the RevoScaleR parallel back end with the foreach package:
library(doRSR)
registerDoRSR()
# Maximize parallelism
rxSetComputeContext(RxLocalParallel())
L6-3
5. Add the following code to the R file and run it. This block of code runs a foreach loop to generate
the cluster models and calculate the sums of squares ratio for each model:
numClusters <- 12
testClusters <- vector("list", numClusters)
# Create the cluster models
foreach (k = 1:numClusters) %dopar% {
testClusters[[k]] <<- rxKmeans(formula = ~DepTime + Delay,
data = flightDelayDataSample,
numClusters = k * 2
)
}
Note: At the time of writing, there was still some instability in R Server running on
Windows. Placing it under a high parallel load can cause it to close the remote session. If this
step fails and returns to the local session on R Client, run the following code, and then repeat
this step:
resume()
6. Add the following code to the R file and run it. This block of code calculates the sums of squares ratio
for each model:
ratio <- vector()

for (cluster in testClusters)
ratio <- c(ratio, cluster$betweenss / cluster$totss)
7. Add the following code to the R file and run it. This code generates a scatter plot that shows the
number of clusters on the X-axis and the sums of squares ratio on the Y-axis. The graph should
suggest that the optimal number of clusters is 18. This is the point at which additional clusters add
little value to the model:
plotData <- data.frame(num = c(1:numClusters), ratio)

rxLinePlot(ratio ~ num, plotData, type="p")
Results: At the end of this exercise, you will have determined the optimal number of clusters to create,
and built the appropriate cluster model.
Exercise 2: Fitting a linear model to clustered data

 Task 1: Create a linear regression model
1. Add the following code to the R file and run it. This code constructs a linear regression model,
describing how delay varies with departure time, using the selected cluster model:
k <- 9
clusterModel <- rxLinMod(Delay ~ DepTime, data =
as.data.frame(testClusters[[k]]$centers),
covCoef = TRUE)
 Task 2: Generate test data and make predictions

1. Add the following code to the R file and run it. This code creates the test dataset:
delayTestData <- rxDataStep(inData = flightDelayData,

varsToKeep = c("DepTime", "Delay"),
rowSelection = rbinom(.rxNumRows, size = 1, prob = 0.01)
)
2. Add the following code to the R file and run it. This code makes predictions about delays in the test
data using the linear model:
delayPredictions <- rxPredict(clusterModel, data = delayTestData,

computeStdErr = TRUE, interval = "confidence",
writeModelVars = TRUE
)
3. Add the following code to the R file and run it. This code displays the first 10 predictions from the
results. The Pred_Delay variables contain the predicted delay times, while the Delay variables show
the actual delays. Note that the Pred_Delay values are not close to the actual Delay values, but are
within the very broad confidence level for each prediction:
head(delayPredictions)
 Task 3: Evaluate the predictions

1. Add the following code to the R file and run it. This code creates a scatter plot of predicted delay
time against departure time. Note that, although the graph has a similar shape overall to the earlier
one, the predicted delays show a bias to the low end of values. The regression line is very similar to
that of the earlier graph:
rxLinePlot(formula = Delay~as.numeric(DepTime), data = delayPredictions,

lineColor = "red",
xlim = c(0, 2400), ylim = c(-120, 1000)
)
2. Add the following code to the R file and run it. This code creates a scatter plot that shows the
differences between the actual and predicted delays. This graph emphasizes the bias of values in the
predictions:
rxLinePlot(formula = Delay - Delay_Pred~as.numeric(DepTime), data = delayPredictions,

type = c("p"), symbolStyle = c("."),
ylab = "Difference between Actual and Predicted Delay",
xlim = c(0, 2400), ylim = c(-500, 1000)
)
Results: At the end of this exercise, you will have created a linear regression model using the clustered
data, and tested predictions made by this model.
L6-5
Exercise 3: Fitting a linear model to a large dataset

 Task 1: Fit a linear regression model to the entire dataset
1. Add the following code to the R file and run it. This code fits a regression model over the entire flight
delay dataset:
regressionModel <- rxLinMod(Delay ~ as.numeric(DepTime), data = flightDelayData,

covCoef = TRUE)
2. Add the following code to the R file and run it. This statement uses the rxPredict function to make
predictions using the test dataset:
delayPredictionsFull <- rxPredict(regressionModel, data = delayTestData,

computeStdErr = TRUE, interval = "confidence",
writeModelVars = TRUE
)
3. Add the following code to the R file and run it. This code displays the first 10 predictions. Note that
the individual predictions are more accurate than before and that they have a tighter confidence
level. However, in some cases the confidence might be misplaced because the real delay frequently
falls outside this range:
head(delayPredictionsFull)
 Task 2: Compare the results of the regression models

1. Add the following code to the R file and run it. This code creates a scatter plot showing the predicted
delays against departure time. Note that this graph is very similar to that generated from the previous
model, and that the regression lines are almost identical:
rxLinePlot(formula = Delay~as.numeric(DepTime), data = delayPredictionsFull,

lineColor = "red",
xlim = c(0, 2400), ylim = c(-120, 1000)
)
2. Add the following code to the R file and run it. This code creates a scatter plot that shows the
differences between the actual and predicted delays. The graph still shows a bias, but it is much less
exaggerated than that created from the previous model:
rxLinePlot(formula = Delay - Delay_Pred~as.numeric(DepTime), data =

delayPredictionsFull,
type = c("p"), symbolStyle = c("."),
ylab = "Difference between Actual and Predicted Delay",
xlim = c(0, 2400), ylim = c(-500, 1000)
)
environment.
Results: At the end of this exercise, you will have created a linear regression model using the entire flight
delay dataset, and tested predictions made by this model.

L7-1
Module 7: Creating and Evaluating Partitioning Models

Lab: Creating a partitioning model to make
predictions
Exercise 1: Fitting a DTree model and making predictions

4. If the Overwrite message appears, type y, and then press Enter

5. Verify that the file is copied successfully, and then close the command prompt
 Task 2: Split the data into test and training datasets

RSVR server:

3. Add the following code to the R file and run it. This code transforms the flight delay data with a new
variable named DataSet that indicates whether an observation should be used for testing or training
models. The code also removes the variables not required by the models:
flightDelayDataFile <- "\\\\LON-RSVR\\Data\\FlightDelayData.xdf"

flightDelayData <- RxXdfData(flightDelayDataFile)
partitionedFlightDelayDataFile <- "\\\\LON-
RSVR\\Data\\PartitionedFlightDelayData.xdf"
flightData <- rxDataStep(inData = flightDelayData,
outFile = partitionedFlightDelayDataFile, overwrite = TRUE,
transforms = list(
DepTime = as.numeric(DepTime),
ArrTime = as.numeric(ArrTime),
Dataset = factor(ifelse(runif(.rxNumRows) >= 0.05,
"train", "test"))),
varsToKeep = c("Delay", "Origin", "Month", "DayOfWeek",
"DepTime", "ArrTime",
"Dest", "Distance", "UniqueCarrier",
"OriginState", "DestState"))
4. Add the following code to the R file and run it. This code splits the data into two files based on the
value of the DataSet variable:
partitionFilesBaseName <- "\\\\LON-RSVR\\Data\\Partition"

flightDataSets <- rxSplit(flightData, splitByFactor = "Dataset",
outFilesBase = partitionFilesBaseName, overwrite = TRUE)
5. Add the following code to the R file and run it. This statement shows the number of observations in
each file. There should be approximately 19 times more rows in the train file than the test file:
lapply(flightDataSets, nrow)

1. Add the following code to the R file and run it. This statement creates a DTree for predicting flight
delays using the departure and arrival times, the month, and the day of the week. It fits the model
using the training data:
delayDTree <- rxDTree(formula = Delay ~ DepTime + ArrTime + Month + DayOfWeek,

data = names(flightDataSets[2]))
2. Add the following code to the R file and run it. This statement displays the structure of the decision
tree:
delayDTree
3. Add the following code to the R file and run it. This code switches back to the local R Client session
and copies the decision tree. The code then uses the createTreeView function of the RevoTreeView
package to visualize the DTree. The DTree should be displayed using Microsoft Edge:
pause()
getRemoteObject("delayDTree")
library(RevoTreeView)
plot(createTreeView(delayDTree))
Close Microsoft Edge, add the following code to the R file, and run it. This statement switches back to
the session on the R Server:
resume()
4. Add the following code to the R file and run it. This code generates a scree plot of the DTree, and
shows the complexity parameters table. You can see that the DTree has a lot of levels (more than
320), but only the first few make any significant decisions. The remaining levels are primarily
concerned with making the model fit the data at a detailed level. This is classic overfit:
plotcp(rxAddInheritance(delayDTree))
delayDTree$cptable
5. Add the following code to the R file and run it. This code prunes the DTree, and displays the
amended complexity parameters table, which should now be much reduced:
prunedDelayDTree <- prune.rxDTree(delayDTree, cp=1.36e-03) # Replace cp with the

value for 7 levels
prunedDelayDTree$cptable
L7-3

1. Add the following code to the R file and run it. This code creates a data frame containing the
variables from the test dataset required for making predictions:
testData <- rxDataStep(names(flightDataSets[1]),

varsToKeep = c("DepTime", "ArrTime", "Month", "DayOfWeek"))
2. Add the following code to the R file and run it. This statement runs predictions against the data frame
using the DTree:
predictData <- rxPredict(prunedDelayDTree, data = testData)
3. Add the following code to the R file and run it. This code summarizes the statistics for the predicted
delays and the actual delays for comparison purposes. The mean values should be close to each
other, although the other statistics are likely to vary more widely:
rxSummary(~Delay, data = names(flightDataSets[1]))

rxSummary(~Delay_Pred, data = predictData)
4. Add the following code to the R file and run it. This code merges the predicted delays into a copy of
the test dataset:
mergedDelayData <- rxMerge(inData1 = predictData,

inData2 = names(flightDataSets[1]),
type = "oneToOne")
5. Add the following code to the R file and run it. This code defines a function that you will use for
analyzing the results of the predictions against the real data value:
processResults <- function(inputData, minuteRange) {

# Calculate the accuracy between the actual and predicted delays. How many are
within 'minuteRange' minutes of each other
results <- rxDataStep(inputData,
transforms = list(Accuracy = factor(abs(Delay_Pred - Delay)
<= minuteRange)),
transformObjects = list(minuteRange = minuteRange))
print(head(results, 100))
# How accurate are the predictions? How many are within 'minuteRange' minutes?
matches <- table(results$Accuracy)[2]
percentage <- (matches / nrow(results)) * 100
print(sprintf("%d predictions were within %d minutes of the actual delay time.
This is %f percent of the total", matches, minuteRange, percentage))
# How many are within 5% of the minuteRange' minute range
matches <- table(abs(results$Delay_Pred - results$Delay) / results$Delay <=
0.05)[2]
print(sprintf("%d predictions were within 5 percent of the actual delay time.
This is %f percent of the total", matches, percentage))
# How many are within 10%
0.1)[2]
# How many are within 50%
0.5)[2]
invisible(NULL)
}
6. Add the following code to the R file and run it. This statement calls the processResults function to
analyze the predictions and display the results:
processResults(mergedDelayData, 10)
Exercise 2: Fitting a DForest model and making predictions

 Task 1: Create the DForest model
 Add the following code to the R file and run it. This code fits a DForest model to the training data:
delayDForest <- rxDForest(formula = Delay ~ DepTime + ArrTime + Month + DayOfWeek,

data = names(flightDataSets[2]),
maxDepth = 7)
 Task 2: Make predictions using the DForest model

1. Add the following code to the R file and run it. This code generates predictions for the test data using
the DForest model:
predictData <- rxPredict(delayDForest, data = testData)
2. Add the following code to the R file and run it. This code merges the predictions into a copy of the
test data frame:

type = "oneToOne")
3. Add the following code to the R file and run it. This code uses the processResults function to analyze
the predictions:
 Task 3: Test whether overfitting of the model is an issue

1. Add the following code to the R file and run it. This code fits a DForest model with a maximum depth
of five levels to the training data:
delayDForest <- rxDForest(formula = Delay ~ DepTime + ArrTime + Month + DayOfWeek,

maxDepth = 5)
the new DForest model:
predictData <- rxPredict(delayDForest, data = testData)

L7-5
test data frame:

type = "oneToOne")
the predictions:
Exercise 3: Fitting a DTree model with different variables

 Task 1: Create the new DTree model
 Add the following code to the R file and run it. This code fits a DTree model with a formula
referencing the origin, destination, distance, and airline to the training data:
delayDTree2 <- rxDTree(formula = Delay ~ Origin + Dest +

Distance + UniqueCarrier + OriginState + DestState,
maxDepth = 7)
 Task 2: Make predictions using the new DTree model

1. Add the following code to the R file and run it. This code generates a data frame containing the test
data:

varsToKeep = c("Origin", "Dest", "Distance", "UniqueCarrier",
"OriginState", "DestState"))
the new DTree model:
predictData <- rxPredict(delayDTree2, data = testData)
test data frame:

type = "oneToOne")
the predictions:
Results: At the end of this exercise, you will have constructed a DTree model using a different set of
variables, made predictions using this model, and compared these predictions to those made using the
earlier DTree model.
Exercise 4: Fitting a DTree model with a combined set of variables

 Add the following code to the R file and run it. This code fits a DTree model with a formula
referencing the origin, destination, distance, and airline to the training data:
delayDTree3 <- rxDTree(formula = Delay ~ DepTime + Month + DayOfWeek + ArrTime +

Origin + Dest +
Distance + UniqueCarrier + OriginState + DestState,
maxDepth = 7)

1. Add the following code to the R file and run it. This code generates a data frame containing the test
data:

varsToDrop = c("Delay"))
the new DTree model:
predictData <- rxPredict(delayDTree3, data = testData)
test data frame:

type = "oneToOne")
the predictions:
Results: At the end of this exercise, you will have constructed a DTree model combining the variables
used in the two earlier DTree models, and made predictions using this model.
L8-1
Module 8: Processing Big Data in SQL Server and Hadoop

Lab A: Deploying a predictive model to SQL
Server
Exercise 1: Upload the flight delay data
 Task 1: Configure SQL Server and create the FlightDelays database
2. On the Windows desktop, click Start, type Microsoft SQL Server Management Studio, and then
press Enter.
3. In the Connect to Server dialog box, log on to LON-SQLR using Windows authentication.
4. In the toolbar, click New Query.

5. In the Query window, type the following commands. These commands enable you to run the
sp_execute_external_script stored procedure:
sp_configure 'external scripts enabled', 1;

RECONFIGURE;
6. In the toolbar, click Execute.
7. In Object Explorer, right-click LON-SQLR, and then click Restart.
8. In the Microsoft SQL Server Management Studio message box, click Yes.
9. In the second Microsoft SQL Server Management Studio message box, click Yes.
10. Wait for SQL Server to restart.
11. In Object Explorer, expand LON-SQLR, right-click Databases, and then click New Database.
12. In the New Database dialog box, in the Database name text box, type FlightDelays, and then click
OK.
13. Leave SQL Server Management Studio open.
 Task 2: Upload the flight delay data to SQL Server


copy E:\Labfiles\Lab08\FlightDelayDataSample.xdf \\LON-RSVR\Data
3. Verify that the file is copied successfully, and then close the command prompt.
5. In the script editor, add the following statement to the R file and run it. This code ensures that you
are running in the local compute context:
6. In the script editor, add the following statement to the R file and run it. This code creates a
connection string for the SQL Server FlightDelays database, and an RxSqlServerData data source for
the flightdelaydata table in the database:
connStr <- "Driver=SQL Server;Server=LON-

flightDelayDataTable <- RxSqlServerData(connectionString = connStr,
table = "flightdelaydata")
7. Add the following code to the R file and run it. These statements import the data from the
FlightDelayDataSample.xdf file and adds the DelayedByWeather logical factor and Dataset
column to each observation:
rxOptions("reportProgress" = 2)
flightDelayDataFile <- "\\\\LON-RSVR\\Data\\FlightDelayDataSample.xdf"
flightDelayData <- rxDataStep(inData = flightDelayDataFile,
outFile = flightDelayDataTable, overwrite = TRUE,
transforms = list(DelayedByWeather =
factor(ifelse(is.na(WeatherDelay), 0, WeatherDelay) > 0, levels = c(FALSE, TRUE)),
Dataset =
factor(ifelse(runif(.rxNumRows) >= 0.05, "train", "test")))
)
 Task 3: Examine the data in the database

1. Add the following statements to the R file and run it. This code creates a new SQL Server compute
context:
sqlWait <- TRUE

sqlConsoleOutput <- TRUE
sqlContext <- RxInSqlServer(connectionString = connStr,
wait = sqlWait, consoleOutput = sqlConsoleOutput
)
rxSetComputeContext(sqlContext)
L8-3
2. Add the following code to the R file and run it. These statements create an RxSqlServerData data
source that reads the flight delay data from the SQL Server database and refactors it:
weatherDelayQuery = "SELECT Month, MonthName, OriginState, DestState,

Dataset, DelayedByWeather, WeatherDelayCategory = CASE
WHEN CEILING(WeatherDelay) <= 0 THEN 'No delay'
WHEN CEILING(WeatherDelay) BETWEEN 1 AND 30 THEN '1-30
minutes'
WHEN CEILING(WeatherDelay) BETWEEN 31 AND 60 THEN '31-
60 minutes'
WHEN CEILING(WeatherDelay) BETWEEN 61 AND 120 THEN '61-
120 minutes'
WHEN CEILING(WeatherDelay) BETWEEN 121 AND 180 THEN
'121-180 minutes'
WHEN CEILING(WeatherDelay) >= 181 THEN 'More than 180
minutes'
END
FROM flightdelaydata"
delayDataSource <- RxSqlServerData(sqlQuery = weatherDelayQuery,
colClasses = c(
Month =
"factor",
OriginState = "factor",
DestState = "factor",
Dataset = "character",
DelayedByWeather = "factor",
WeatherDelayCategory = "factor"
),
colInfo = list(
MonthName = list(
type = "factor",
levels = month.name
)
),
connectionString = connStr
)
3. Add the following statement to the R file and run it. This statement retrieves the details for each
variable in the data source:
rxGetVarInfo(delayDataSource)
There should be seven variables, named Month, MonthName, OriginState, DestState, Dataset,
DelayedByWeather, and WeatherDelayCategory.
4. Add the following statement to the R file and run it. This statement retrieves the data from the data
source and summarizes it:
rxSummary(~., delayDataSource)
5. Add the following code to the R file and run it. This statement creates a histogram that shows the
categorized delays by month:
rxHistogram(~WeatherDelayCategory | MonthName,
data = delayDataSource,
xTitle = "Weather Delay",
scales = (list(
x = list(rot = 90)
))
)
6. Add the following code to the R file and run it. This statement creates a histogram that shows the
categorized delays by origin state:
rxHistogram(~WeatherDelayCategory | OriginState,
xTitle = "Weather Delay",
scales = (list(
x = list(rot = 90, cex = 0.5)
))
)
Results: At the end of this exercise, you will have imported the flight delay data to SQL Server and used
ScaleR functions to examine this data.
Exercise 2: Fit a DForest model to the weather delay data

 Task 1: Create a DForest model
1. Add the following code to the R file and run it. This statement fits a DTree model to the weather
delay data:
weatherDelayModel <- rxDForest(DelayedByWeather ~ Month + OriginState + DestState,

cp = 0.0001,
rowSelection = (Dataset == "train")
)
2. Add the following statement to the R file and run it. This statement summarizes the forecast accuracy
of the model:
print(weatherDelayModel)
3. Add the following statement to the R file and run it. This statement shows the structure of the
decision trees in the DForest model:
head(weatherDelayModel)
4. Add the following statement to the R file and run it. This statement shows the relative importance of
each predictor variable to the decisions made by the model:
rxVarUsed (weatherDelayModel)
 Task 2: Score the DForest model

1. Add the following code to the R file and run it. This statement modifies the query used by the data
source and adds a WHERE clause to limit the rows retrieved to the test data:
delayDataSource@sqlQuery <- paste(delayDataSource@sqlQuery, "WHERE Dataset = 'test'",

sep = " ")
2. Add the following code to the R file and run it. This statement creates a data source that will be used
to store scored results in the database:
weatherDelayScoredResults <- RxSqlServerData(connectionString = connStr,

table = "scoredresults")
L8-5
3. Add the following code to the R file and run it. These statements switch to the local compute context,
generate weather delay predictions using the new data set and save the scored results in the
scoredresults table in the database, and then return to the SQL Server compute context:
rxPredict(modelObj = weatherDelayModel,
outData = weatherDelayScoredResults, overwrite = TRUE,
writeModelVars = TRUE,
type = "prob")
rxSetComputeContext(sqlContext)
4. Add the following code to the R file and run it. This code tests the scored results against the real data
and plots the accuracy of the predictions:
install.packages('ROCR')
library(ROCR)
# Transform the prediction data into a standardized form
results <- rxImport(weatherDelayScoredResults)
weatherDelayPredictions <- prediction(results$PredictedDelay,
results$DelayedByWeather)
# Plot the ROC curve of the predictions
rocCurve <- performance(weatherDelayPredictions, measure = "tpr", x.measure = "fpr")
plot(rocCurve)
5. If the Microsoft Visual Studio dialog boxes appear, click Yes.
Results: At the end of this exercise, you will have created a decision tree forest using the weather data
held in the SQL Server database, scored it, and stored the results back in the database.
Exercise 3: Store the model in SQL Server

 Task 1: Save the model to the database
1. Add the following code to the R file and run it. These statements created a serialized representation
of the model as a string of binary data:
serializedModel <- serialize(weatherDelayModel, NULL)

serializedModelString <- paste(serializedModel, collapse = "")
2. Return to SQL Server Management Studio.
3. In the toolbar, click New Query. In the Query window, type the following code:
USE FlightDelays;
CREATE TABLE [dbo].[delaymodels]
(
modelId INT IDENTITY(1,1) NOT NULL Primary KEY,
model VARBINARY(MAX) NOT NULL
);
4. In the toolbar, click Execute. Verify that the code runs without any errors.
5. Overwrite the code in the Query window with the following block of Transact-SQL:
CREATE PROCEDURE [dbo].[PersistModel] @m NVARCHAR(MAX)

AS
BEGIN
SET NOCOUNT ON;
INSERT INTO delaymodels(model) VALUES (CONVERT(VARBINARY(MAX),@m,2))
END;
8. Add the following statements to the R file and run it. This code uses an ODBC connection to run the
PersistModel stored procedure and save your DTree model to the database:
install.packages('RODBC')
library(RODBC)
connection <- odbcDriverConnect(connStr)
cmd <- paste("EXEC PersistModel @m='", serializedModelString, "'", sep = "")
 Task 2: Create a stored procedure that runs the model to make predictions
1. Switch back to SQL Server Management Studio.
2. Overwrite the code in the Query window with the following block of Transact-SQL:
CREATE PROCEDURE [dbo].[PredictWeatherDelay]

@Month integer = 1,
@OriginState char(2),
@DestState char(2)
AS
BEGIN
DECLARE @weatherDelayModel varbinary(max) = (SELECT TOP 1 model FROM
dbo.delaymodels)
EXEC sp_execute_external_script @language = N'R',
@script = N'
delayParams <- data.frame(Month = month, OriginState = originState, DestState
= destState)
delayModel <- unserialize(as.raw(model))
OutputDataSet<-rxPredict(modelObject = delayModel,
data = delayParams,
outData = NULL,
type = "prob",
writeModelVars = TRUE)',
@params = N'@model varbinary(max),
@month integer,
@originState char(2),
@destState char(2)',
@model = @weatherDelayModel,
@month = @Month,
@originState = @OriginState,
@destState = @DestState
WITH RESULT SETS (([PredictedDelay] float, [PredictedNoDelay] float,
[PredictedDelayedByWeather] bit, [Month] integer, [OriginState] char(2), [DestState]
char(2)));
END

L8-7
5. Add the following statements to the R file and run it. This code tests the stored procedure:
cmd <- "EXEC [dbo].[PredictWeatherDelay] @Month = 10, @OriginState = 'MI', @DestState

= 'NY'"
environment.
7. Close SQL Server Management Studio, without saving any changes.
Results: At the end of this exercise, you will have saved the DForest model to SQL Server, and created a
stored procedure that you can use to make weather delay predictions using this model.
Lab B: Incorporating Hadoop Map/Reduce

and Spark functionality into the ScaleR
workflow
Exercise 1: Using Pig with ScaleR functions
 Task 1: Examine the Pig script
2. Using WordPad, open the file carrierDelays.pig in the E:\Labfiles\Lab08 folder.
3. Review the script.

4. Edit the first line of the script, and change the text {specify login name} to your login name, as
shown in the following example:
%declare loginName 'student01'
5. Save the file and close Wordpad.
 Task 2: Upload the Pig script and flight delay data to HDFS
2. In the script editor, add the following statements to the R file and run them. Change studentnn to
your login name, and then and run the code. These statements establish a new RxHadoopMR
compute context:
loginName = "studentnn"
context <- RxHadoopMR(sshUsername = loginName,
sshHostname = "LON-HADOOP",
consoleOutput = TRUE)
rxSetComputeContext(context, wait = TRUE)
3. Add the following statements to the R file and run them. This code removes the
FlightDelayDataSample.csv file from your directory in HDFS (if it exists; you can ignore the error
message if it is not found), and copies the latest data from the E:\Labfiles\Lab08 folder:
rxHadoopRemove(path = paste("/user/RevoShare/", loginName,

"/FlightDelayDataSample.csv", sep="") )
rxHadoopCopyFromClient(source = "E:\\Labfiles\\Lab08\\FlightDelayDataSample.csv",
hdfsDest = paste("/user/RevoShare/", loginName, sep=""))
4. Add the following statements to the R file and run them. This code uploads the carriers.csv file to
your directory in HDFS:
rxHadoopRemove(path = paste("/user/RevoShare/", loginName, "/carriers.csv", sep=""))

rxHadoopCopyFromClient(source = "E:\\Labfiles\\Lab08\\carriers.csv",
hdfsDest = paste("/user/RevoShare/", loginName, sep=""))
L8-9
5. Add the following statements to the R file and run them. Replace the text fqdn with the URL of the
Hadoop VM in Azure (for example, LON-HADOOP-01.ukwest.cloudapp.azure.com). This code closes
the RxHadoopMR compute context, creates a remote R session on the Hadoop VM, and loads the
RevoScaleR library in that session:
remoteLogin(deployr_endpoint = "http://fqdn:12800", session = TRUE, diff = TRUE,
commandline = TRUE, username = "admin", password = "Pa55w.rd")
library(RevoScaleR)
6. Add the following statements to the R file and run them. This code copies the Pig script (and the
loginName variable) to the remote session:
pause()
putLocalFile("E:\\Labfiles\\Lab08\\carrierDelays.pig")
putLocalObject(c("loginName"))
resume()
 Task 3: Run the Pig script and examine the results

1. Add the following statement to the R file and run it. This code runs the Pig script:
result <- system("pig carrierDelays.pig", intern = TRUE)
2. Add the following code to the R file and run it. This code lists the contents of the
/user/RevoShare/studentnn directory in HDFS. This directory should include a subdirectory named
results:
rxHadoopCommand(paste("fs -ls -R /user/RevoShare/", loginName, sep = ""), intern =

TRUE)
3. Verify that the results directory contains two files: _SUCCESS and part-r-00000. The data is actually in
the part-r-00000 file.
4. Add the following statement to the R file and run it. This code deletes the _SUCCESS file from the
results directory:
rxHadoopRemove(paste("fs -ls -R /user/RevoShare/", loginName, "/results/_SUCCESS",

sep = ""))
5. Add the following statements to the R file and run them. These statements display the structure of the
results generated by the Pig script:
rxSetFileSystem(RxHdfsFileSystem())
resultsFile <- paste("/user/RevoShare/", loginName, "/results", sep = "")
resultsData <- RxTextData(resultsFile)
rxGetVarInfo(resultsData)
 Task 4: Convert the results to XDF format and add field names
carrierDelayColInfo <- list(V1 = list(type = "factor", newName = "Origin"),

V2 = list(type = "factor", newName = "Dest"),
V3 = list(type = "factor", newName = "AirlineCode"),
V4 = list(type = "character", newName = "AirlineName"),
V5 = list(type = "numeric", newName = "CarrierDelay"),
V6 = list(type = "numeric", newName =
"LateAircraftDelay")
)
2. Add the following code to the R file and run it. This code creates the CarrierData composite XDF file
in HDFS:
carrierFile <- RxXdfData(paste("/user/RevoShare", loginName, "CarrierData", sep =

"/"))
carrierData <- rxImport(inData = resultsData,
outFile = carrierFile, overwrite = TRUE,
colInfo = carrierDelayColInfo,
createCompositeSet = TRUE)
3. Add the following code to the R file and run it. This code creates a new RxHadoopMR compute
context in the remote session:
hadoopContext = RxHadoopMR(sshUsername = loginName, consoleOutput = TRUE)

4. Add the following code to the R file and run it. This code displays the structure of the XDF file which
should now include the new mappings:
rxGetVarInfo(carrierData)
5. Add the following code to the R file and run it. This code summarizes the contents of the XDF file:
rxSummary(~., carrierData)
Note that this function runs as a Map/Reduce job
 Task 5: Create graphs to visualize the airline delay data

1. Add the following code to the R file and run it. This code creates a data frame that contains the airline
delay information (carrier delay plus late aircraft delay), and then uses this data frame to generate a
histogram of delay times:
transformedCarrierData <- rxDataStep(carrierData, transforms = list(TotalDelay =

CarrierDelay + LateAircraftDelay))
# The data source for this histogram is an in-memory data frame, so the processing is
not distributed
rxHistogram(~TotalDelay, data = transformedCarrierData,
xTitle = "Carrier + Late Aircraft Delay (minutes)",
yTitle = "Occurrences",
endVal = 300
)
L8-11
2. Add the following code to the R file and run it. This code changes the resolution of the plot window
to 1024 by 768 pixels, and then generates a histogram showing the number of delayed flights for
each airline:
png(width=1024, height=768);rxHistogram(~AirlineCode, data = carrierData,

xTitle = "Airline",
yTitle = "Number of delayed flights"
);dev.off()
3. Add the following code to the R file and run it. This code creates a bar chart showing the total delay
time for all flights made by each airline:
library(ggplot2)
ggplot(data = rxImport(carrierData, transforms = list(TotalDelay = CarrierDelay +
LateAircraftDelay))) +
geom_bar(mapping = aes(x = AirlineName, y = TotalDelay), stat = "identity") +
labs(x = "Airline", y = "Total Carrier + Late Aircraft Delay (minutes)") +
scale_x_discrete(labels = function(x) { lapply(strwrap(x, width = 25, simplify =
FALSE), paste, collapse = "\n")}) +
theme(axis.text.x = element_text(angle = 90, size = 8))
 Task 6: Investigate the mean airline delay by route

1. Add the following code to the R script and run it. This code creates a data cube that calculates the
mean delay for each airline by route:
delayData <- rxCube(AverageDelay ~ AirlineCode:Origin:Dest, carrierData,

transforms = list(AverageDelay = CarrierDelay +
LateAircraftDelay),
means = TRUE,
na.rm = TRUE, removeZeroCounts = TRUE)
2. Add the following code to the R script and run it. This code uses the rxExec function to run rxSort to
sort the data in the cube:
sortedDelayData <- rxExec(FUN=rxSort, as.data.frame(delayData), sortByVars =

c("Counts", "AverageDelay"), decreasing = c(TRUE, TRUE), timesToRun = 1)
3. Add the following code to the R script and run it. This the top 50 rows in the sorted cube (the routes
with the most frequent and longest delays).
head(sortedDelayData$rxElem1, 50)
4. Add the following code to the R script and run it. This code shows the bottom 50 rows in the sorted
cube (the routes with the least frequent and shortest delays):
tail(sortedDelayData$rxElem1, 50)
5. Add the following code to the R script and run it. This code switches to the local compute context
and saves the data cube to the file SortedDelayData.csv in HDFS:
sortedDelayDataFile <- paste("/user/RevoShare/", loginName, "/SortedDelayData.csv",
sep = "")
sortedDelayDataCsv <- RxTextData(sortedDelayDataFile)
sortedDelayDataSet <- rxDataStep(inData = sortedDelayData$rxElem1,
outFile = sortedDelayDataCsv, overwrite = TRUE
)
Results: At the end of this exercise, you will have run a Pig script from R, and analyzed the data that the
script produces.
Exercise 2: Integrating ScaleR code with Spark and Hive

 Task 1: Upload the sorted delay data to Hive
1. Add the following code to the R script and run it. This code creates an RxSpark compute context:
sparkContext = RxSpark(sshUsername = loginName,

consoleOutput = TRUE,
numExecutors = 10,
executorCores = 2,
executorMem = "1g",
driverMem = "1g")
rxSetComputeContext(sparkContext)
2. Add the following code to the R script and run it. This code creates an RxHiveData data source:
dbTable <- paste(loginName, "RouteDelays", sep = "")

hiveDataSource <- RxHiveData(table = dbTable)
3. Add the following code to the R script and run it. This code uploads the data to Hive:
data <- rxDataStep(inData = sortedDelayDataSet,

outFile = hiveDataSource, overwrite = TRUE)
4. Add the following code to the R script and run it. This code runs the rxSummary function over the
Hive data:
rxSummary(~., hiveDataSource)
 Task 2: Use Hive to perform ad-hoc queries over the data

1. Open a command prompt window, and run the putty command.
2. In the PuTTY Configuration window, select the LON-HADOOP session, click Load, and then click
Open. You should be logged in to the Hadoop VM.
hive
select * from studentnnroutedelays;
Verify that 8852 rows are retrieved.
select count(*), airlinecode from studentnnroutedelays group by airlinecode;
This command lists each airline together with the number of delayed flights for that airline.
L8-13
6. At the hive> prompt, run the following command to exit hive:
exit;
7. Close the PuTTY terminal window, and return to your R development environment.
 Task 3: Use a sparklyr session to analyze data

1. Add the following code to the R script and run it. This code loads the sparklyr and dplyr libraries.
library(sparklyr)
library(dplyr)
2. Add the following code to the R script and run it. This code closes the current RxSpark compute
context and creates a new one that supports sparklyr interop:
rxSparkDisconnect(sparkContext)
connection = rxSparkConnect(sshUsername = loginName,
consoleOutput = TRUE,
numExecutors = 10,
executorCores = 2,
executorMem = "1g",
driverMem = "1g",
interop = "sparklyr")
3. Add the following code to the R script and run it. This code creates a new sparklyr session:
sparklyrSession <- rxGetSparklyrConnection(connection)
4. Add the following code to the R script and run it. This code lists the tables available in Hive:
src_tbls(sparklyrSession)
5. Add the following code to the R script and run it. This code caches your routedelays table and
fetches the data in this table:
tbl_cache(sparklyrSession, dbTable)
routeDelaysTable <- tbl(sparklyrSession, dbTable)
head(routeDelaysTable)
6. Add the following code to the R script and run it. This code constructs a dplyr pipeline that finds the
delays for all flights for American Airlines that departed from New York JFK, and saves the results in a
tibble:
routeDelaysTable %>%
filter(AirlineCode == "AA" & Origin == "JFK") %>%
select(Dest, AverageDelay) %>%
collect ->
aajfkData
7. Add the following code to the R script and run it. This code displays the data in the tibble and
summarizes it:
print(aajfkData)
rxSummary(~., aajfkData)
8. Add the following code to the R script and run it. This code closes the sparklyr session and
disconnects from the RxSpark compute context:
rxSparkDisconnect(connection)
environment.
Results: At the end of this exercise, you will have used R code running in an RxSpark compute context to
upload data to Hive, and then analyzed the data by using a sparklyr session running in the RxSpark
compute context.

20773A ENU TrainerHandbook PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

20773A ENU TrainerHandbook PDF

Transféré par

Droits d'auteur :

Formats disponibles

MCT USE ONLY.

STUDENT USE PROHIBITED

Microsoft and the trademarks listed at

Product Number: 20773A

Part Number (if applicable): X21-42248

a. If you are a Microsoft IT Academy Program Member:

b. If you are a Microsoft Learning Competency Member:

d. If you are an End User:

e. If you are a Trainer.

3. LICENSED CONTENT BASED ON PRE-RELEASE TECHNOLOGY. If the Licensed Content’s subject

11. APPLICABLE LAW.

This limitation applies to

LIMITATION DES DOMMAGES-INTÉRÊTS ET EXCLUSION DE RESPONSABILITÉ POUR LES

Revised July 2013

Davis Springate – Content Developer

Ian Creese – Content Developer

John Sharp – Content Developer

Lesson 1: Introduction to Microsoft R Server 1-2

Lesson 2: Using Microsoft R Client 1-7

Lesson 3: The ScaleR functions 1-15

Lab: Exploring Microsoft R Server and Microsoft R Client 1-20

Module Review and Takeaways 1-25

Module 2: Exploring Big Data

Lesson 2: Reading and writing XDF data 2-9

Lesson 3: Summarizing data in an XDF object 2-21

Module Review and Takeaways 2-39

Module 3: Visualizing Big Data

Lesson 2: Visualizing big data 3-22

Lab: Visualizing data 3-33

Module 4: Processing Big Data

Lesson 1: Transforming big data 4-2

Lesson 2: Managing big datasets 4-13

Lab: Processing big data 4-19

Module Review and Takeaways 4-25

Module 5: Parallelizing Analysis Operations

Lesson 1: Using the RxLocalParallel compute context with rxExec 5-2

Lesson 2: Using the RevoPemaR package 5-11

Lab: Parallelizing analysis operations 5-20

Module Review and Takeaways 5-24

Module 6: Creating and Evaluating Regression Models

Lesson 1: Clustering big data 6-2

Lesson 2: Generating regression models and making predictions 6-10

Lab: Creating and using a regression model 6-23

Module Review and Takeaways 6-28

Module 7: Creating and Evaluating Partitioning Models

Lesson 1: Creating partitioning models based on decision trees 7-3

Lesson 2: Evaluating models 7-14

Lesson 3: Using the MicrosoftML package 7-24

Lab: Creating a partitioning model to make predictions 7-28

Module Review and Takeaways 7-34

Module 8: Processing Big Data in SQL Server and Hadoop

Lesson 1: Integrating R with SQL Server 8-2

Lab A: Deploying a predictive model to SQL Server 8-13

Lesson 3: Using ScaleR functions with Spark 8-30

Lab B: Incorporating Hadoop Map/Reduce and Spark functionality into the

Module Review and Takeaways 8-41

Lab Answer Keys

Module 3 Lab: Visualizing data L03-1

Module 4 Lab: Processing big data L04-1

Module 5 Lab: Parallelizing analysis operations L05-1

Module 6 Lab: Creating and using a regression model L06-1

Module 7 Lab: Creating a partitioning model to make predictions L07-1

Module 8 Lab A: Deploying a predictive model to SQL Server L08-1