Vous êtes sur la page 1sur 65

Data

Science Boot-Camp Survival Manual

Table of Contents
1. Prologue
2. Chapter 0 - Data Scientist's Toolbox
3. Chapter 1 - R Programming
4. Chapter 2 - Getting and Cleaning Data
5. Chapter 3 - Exploratory Data Analysis
6. Chapter 4 - Reproducible Research
7. Chapter 5 - Statistical Inference
8. Chapter 6 - Regression Models
9. Chapter 7 - Practical Machine Learning
10. Chapter 8 - Developing Data Products
11. Capstone
12. Epilogue

Data Science Boot-Camp Survival Manual

Prologue
Welcome recruits!
During the next year you will learn the fundamentals of data science. The Data Science Specialization, offered by Johns
Hopkins University, is challenging. Success requires a strategy. This book aims to equip each of you with the knowledge
and skills to complete boot-camp. The "Data Science Boot-Camp Survival Manual" alone cannot guarantee success. Listen
to the instructor's lectures and apply yourself to the evaluations throughout your training.
According to Jeff Leek and the Data Science Specialization Team the key word in data science is "science". To this end, the
focus of the ten-course series including a capstone project is to provide the learner with:
1. an introduction to the key ideas behind reproducible research,
2. an introduction to the tools and techniques to transform raw data into a presentable report,
3. an opportunity to gain hands-on practice so you can learn the techniques for yourself, and
4. an appreciation of the mathematics & statistics involved in data science.

Core Courses
The courses comprising the Data Science Specialization are:
Data Scientist's Toolbox
R Programming
Getting and Cleaning Data
Exploratory Data Analysis
Reproducible Research
Statistical Inference
Regression Models
Practical Machine Learning
Developing Data Products
These courses taught by Brian Caffo, Jeff Leek, and Roger D. Peng enable the learner to get the foundational skills. While
the lectures and assignments build these foundational skills, learners often required further explanations. The course
forums allow learners to discuss the lecture topics and assignments. Yet each session of a course begins without the
shared knowledge of previous participants. As a Community Teaching Assistant (CTA) it became clear that a companion
guide would be beneficial.
Are you up to the challenge of Johns Hopkins University's Data Science Specialization?

Structure of the Boot-Camp Survival Manual


Each chapter covers one of the core courses. A tutorial-style balancing theory and practical application makes surviving
data science boot-camp possible. You learn the workflow typically involved in all phases of a data analysis project.
Chapter 0: The Data Scientist's Toolbox
URL: https://www.coursera.org/course/datascitoolbox
Synopsis: "Get an overview of the data, questions, and tools that data analysts and data scientists work with. This is the
first course in the Johns Hopkins Data Science Specialization."

Prologue

Data Science Boot-Camp Survival Manual

Chapter 1: R Programming
URL: https://www.coursera.org/course/rprog
Synopsis: "Learn how to program in R and how to use R for effective data analysis. This is the second course in the Johns
Hopkins Data Science Specialization."
Chapter 2: Getting and Cleaning Data
URL: https://www.coursera.org/course/getdata
Synopsis: "Learn how to gather, clean, and manage data from a variety of sources. This is the third course in the Johns
Hopkins Data Science Specialization."
Chapter 3: Exploratory Data Analysis
URL: https://www.coursera.org/course/exdata
Synopsis: "Learn the essential exploratory techniques for summarizing data. This is the fourth course in the Johns Hopkins
Data Science Specialization."
Chapter 4: Reproducible Research
URL: https://www.coursera.org/course/repdata
Synopsis: "Learn the concepts and tools behind reporting modern data analyses in a reproducible manner. This is the fifth
course in the Johns Hopkins Data Science Specialization."
Chapter 5: Statistical Inference
URL: https://www.coursera.org/course/statinference
Synopsis: "Learn how to draw conclusions about populations or scientific truths from data. This is the sixth course in the
Johns Hopkins Data Science Course Track."
Chapter 6: Regression Models
URL: https://www.coursera.org/course/regmods
Synopsis: "Learn how to use regression models, the most important statistical analysis tool in the data scientist's toolkit.
This is the seventh course in the Johns Hopkins Data Science Specialization."
Chapter 7: Practical Machine Learning
URL: https://www.coursera.org/course/predmachlearn
Synopsis: "Learn the basic components of building and applying prediction functions with an emphasis on practical
applications. This is the eighth course in the Johns Hopkins Data Science Specialization."
Chapter 8: Developing Data Products
URL: https://www.coursera.org/course/devdataprod
Synopsis: "Learn the basics of creating data products using Shiny, R packages, and interactive graphics. This is the ninth
course in the Johns Hopkins Data Science Specialization."
Data Science Capstone
Prologue

Data Science Boot-Camp Survival Manual

URL: https://www.coursera.org/course/dsscapstone
Synopsis: "The capstone project class will allow students to create a usable/public data product that can be used to show
your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry,
government, and academic partners. "
Course synposes quoted from the course information pages at Coursera as at 1 April 2015.

Course Dependency and Recommended Sequence


Although the courses are standalone, the knowledge is cumulative. The pedagogical course dependencies are available
from Johns Hopkins University.

Figure 1 Course dependency diagram provided by Daniel M. Bontje (created 17 November 2014)
You need a language or system to perform the tasks (R Programming) and data to analyse (Getting and Cleaning Data) to
get a sense of the data (Exploratory Data Analysis) before building models and drawing inferences (Statistical Inference,
Regression Models) or making predictions (Practical Machine Learning) from the data before presenting your conclusions
and supporting evidence (Building Data Products, Reproducible Research).
The recommended mathematics background is linear algebra and introductory statistics (descriptive and inferential).
Statistical Inference and Regression Models, courses in this specialisation, cover all the basic statistical concepts forming a
solid foundation for subsequent courses in the Data Science Specialization. These courses along with Practical Machine
Learning are the theoretical underpinnings, while the other six courses are applied in nature: obtaininng data, scrubbing
data, exploring data, modeling data, and interpreting data collectively known as the OSEMN (prounounced as awesome)
model.
Again welcome to the Data Science Boot-Camp. Review the "Data Science Boot-Camp Survival Manual" on a regular basis
throughout your training.

Prologue

Data Science Boot-Camp Survival Manual

Chapter 0 - The Data Scientist's Toolbox


We shall neither fail nor falter; we shall not weaken or tire...give us the tools and we will finish the job. - Winston
Churchill, Prime Minister of Great Britain
Primary Instructor: Jeff Leek, MS, PhD (Biostatistics)
The foundational course Data Scientist's Toolbox is a high-level overview of the specialisation. This course lays the
groundwork for the nine-course series plus capston project. A comprehensive approach teaching fundamental skills for data
science regardless of data set.
The keyword in "data science" is science not data. The method is not dependent upon the dataset size; it scales from small
data to big data. The data science method equates to the scientific method used in the natural sciences. The Financial
Times article, "Big data: are we making a big mistake?", argues for a rigorous methodology. An article "The Data Science
Methology", published on Data Science 101, argues for adoption of the scientific method familiar to scientists in the natural
sciences.
Data Science Methodology
1. problem formulation (hypothesis)
2. obtain data (experiment)
3. analysis (validate or refute hypothesis)
4. data product (report)
The courses in the Johns Hopkins University Data Science Specialization "emphasise a data science methodology rather
than focusing primarily on data science technique. [T]he instructors have taken care throughout to demonstrate a
responsible, scientifically-based approach to collecting, curating and analyzing data sources," says specialisation
participant John Frederick Thiels.

Learning Objectives
You will have learned the basic skills to successfully use the various tools required throughout the book and the data
science specialisation courses.

Tools of the Trade


To successfully complete the hands-on exercises in the book and course assignments (quizzes, programming, and course
projects) some software must be installed on your computer or in a hosted environment: Git, R and RStudio. A GitHub
account is mandatory because peer-assessed submissions must be accessible. Internet access is necessary to fully
participate in the courses; such as watching or downloading lectures, taking quizzes, submitting programming assignments,
and participating in the peer assessment process. Due to the variety of operating system platforms on which the software
can be deployed, for this book, we decided to solely focus on Ubuntu Linux running locally or remotely in a virtualised
environment.
Before delving into how to use the various tools in our toolbox it is important to consider the types of skills we need as datascientists-in-training. Firstly, linear algebra, probability and calculus at the introductory level is sufficient mathematics.
Secondly, introductory descriptive and inferential statistics including hypothesis testing is the recommended statistics
background. Thirdly, basic programming skills are recommended. None of the aforementioned skills are mandatory for the
Data Science Specialization. For those readers seeking to learn any of these skills there are courses available, including:
Pre-Calculus - Instructors: Sarah Eichhorn and Rachel Cohen Lehman, University of California, Irvine
Probability - Instructor: Santosh S. Venkatesh

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Calculus: Single Variable - Instructor: Robert Ghrist, University of Pennsylvania


Descriptive Statistics - Instructor: Matthijs Rooduijn, University of Amsterdam
Inferential Statistics - Instructor: Annemarie Zand Scholten, University of Amsterdam
Data Analysis and Statistical Inference - Instructor: Mine etinkaya-Rundel, Duke University
Programming for Everbody (Python) - Instructor: Charles Severance, University of Michigan
Programming for Everybody (Python) deserves special mention because it is consistently highly-rated by course
participants for the teaching-style of "Dr. Chuck." You do not have to be a geek to enjoy this course.
Read the information page of each course especially if you prefer a self-teaching approach to learning. There are freely
available textbooks for some of these courses.

Virtualisation Software
While the various applications required for these courses can be installed on the host operating system of your computer
we recommend using virtualisation software such as Oracle VirtualBox, VMWare Workstation or Fusion or Player, and
Parallels Desktop depending upon the operating system running on the computer. Another virtualisation option is RStudio
Server Amazon Machine Image (AMI) or rolling your own local or hosted virtual machine instance.
This section will describe two scenarios:
importing a ready-made disk image (AMI) of Ubuntu Linux 14.04 LTS (64-bit) on the Amazon Web Service Elastic
Computing 2 (AWS EC2) hosting platform.
importing a ready-made disk image of Ubuntu Linux 14.04 LTS (32-bit or 64-bit) into Oracle VirtualBox on your
computer, and
An advantage of virtualisation software, running on your computer or remotely hosted by a service provider, is all the
required applications are kept separate from your computer's operating system and by default isolated from the host file
system.
Option A: Amazon Web Service Elastic Compute 2 with Amazon Machine Image
If you prefer installing Oracle VirtualBox and creating a virtual machine on your computer, you can skip this section.
Instructions are forthcoming.
Option B: Local Computer with Oracle VirtualBox
Please consult the instructions about downloading and installing Oracle VirtualBox onto your computer before proceeding.
Download the ready-made disk image of Ubuntu Linux (32-bit or 64-bit) based on the version supported by Oracle
VirtualBox and the architecture of the computer.
Note: Some computers are 64-bit but only allow 32-bit operating systems to run within virtualisation software.
Extract the compressed archive containing the disk image using p7zip.

$ 7za e Ubuntu_14.04.2-32bit.7z
7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: Ubuntu_14.04.2-32bit.7z
Extracting 32bit/Ubuntu 14.04.2 (32bit).vdi
Extracting 32bit
Everything is Ok

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Folders: 1
Files: 1
Size: 3807379456
Compressed: 776252068
$

After installing Oracle VirtualBox it is time to launch it so we can import the virtual machine disk image (.vdi).

Figure 0.1 Creating a new virtual machine instance


Click 'New' on the main menu. A dialogue box pop-up appears where you enter the name to assign to the virtual machine
and select the operating system and version. Click 'Next' to continue.

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.2 Allocating system memory to the new virtual machine instance
Select the amount of system memory (RAM) to allocate to the virtual machine. Allocate 2048 MB of system memory to this
virtual machine instance. This parameter can be modified later if necessary. Click 'Next' to continue.

Chapter 0 - Data Scientist's Toolbox

Data Science Boot-Camp Survival Manual

Figure 0.3 Associating an existing virtual hard drive to the new virtual machine instance
Select 'Use an existing virtual hard drive file' and click on the file folder icon to navigate to the virtual hard drive file
previously downloaded and uncompressed. Click 'Create' to associate this disk image with the current virtual machine.

Chapter 0 - Data Scientist's Toolbox

10

Data Science Boot-Camp Survival Manual

Figure 0.4 Mount the VirtualBox Guest Additions ISO


Make the VirtualBox Guest Additions ISO accessible to the virtual machine instance. At the main screen of Oracle
VirtualBox select the DataScientistsToolbox virtual machine. Click 'Settings', then 'Storage', followed by 'Empty'.

Chapter 0 - Data Scientist's Toolbox

11

Data Science Boot-Camp Survival Manual

Figure 0.5 Mount the VirtualBox Guest Additions ISO


Click the CD/DVD icon and select VBoxGuestAdditions.iso from the dropdown list. Click 'OK' to return to the main screen.

Chapter 0 - Data Scientist's Toolbox

12

Data Science Boot-Camp Survival Manual


Figure 0.6 Starting the new virtual machine instance
At the main screen of Oracle VirtualBox select the newly created virtual machine instance. Click 'Start' to launch the virtual
machine. At the login prompt type the password from the download webpage.
The final preparatory step is enabling the VirtualBox Guest Additions and updating any out-of-date packages installed on
the virtual machine. Open a terminal window (CTRL + ALT + T).
Activate the VirtualBox Guest Additions so the virtual machine instance integrates with the host system.

$ cd /media/osboxes/VBOXADDITIONS*
$ sudo sh VBoxLinuxAdditions.run

Upon successful installation shutdown the virtual machine instance by clicking the Gear icon in the upper right corner of the
virtual machine, umount the VirtualBox Guest Additions by reversing the steps shown in Figures 0.4 and 0.5. Alternatively,
you may choose to leave the VirtualBox Guest Additions ISO attached.
Note: Whenever an updated Linux kernel is installed as part of the normal update process the VirtualBox Guest Additions
will have to be reapplied to ensure the shared clipboard, for example, continues to work. Do NOT forget to restart the virtual
machine instance so the VirtualBox Guest Additions are activated.

Chapter 0 - Data Scientist's Toolbox

13

Data Science Boot-Camp Survival Manual

Figure 0.7 Enable/Disable shared clipboard and drag-and-drop


Enabling a shared clipboard between your computer and the virtual machine instance is configurable via the 'Settings'
menu.

Chapter 0 - Data Scientist's Toolbox

14

Data Science Boot-Camp Survival Manual

Figure 0.8 Pointing device and device boot-order configuration


The mouse device type should be configured as 'PS/2 Mouse' whether using a wired or wireless mouse. The device boot
order should be configured to ensure the virtual disk image is the default boot device.
Restart the virtual machine instance.
Switching between standard mode and full-screen mode is as easy as Host_Key + F (RIGHT_CTRL + F by default).
For convenience launch a terminal session (CTRL + ATL + T) and when its icon appears in the application bar right-click
the mouse and select 'Lock to Launcher'. From this point forward any time a terminal session is wanted simply click the
'Terminal' icon.

Chapter 0 - Data Scientist's Toolbox

15

Data Science Boot-Camp Survival Manual

Figure 0.9 System settings configuration


Before proceeding with updating the currently installed system software and applications we should select an Ubuntu Linux
package repository in geographic proximity to your location. This can be accomplished by clicking the 'System Settings'
icon in the application bar along the left-edge of the screen. Click 'Software & Updates'.
Next, open a terminal session (CTRL + ALT + T). When the terminal displays the shell prompt type the following commands
to update and upgrade the currently installed system software and applications. If you see the 'Software Updater' icon in
the application bar, you can apply software updates by clicking the icon instead.

$ sudo apt-get -y update


$ sudo apt-get -y upgrade

Chapter 0 - Data Scientist's Toolbox

16

Data Science Boot-Camp Survival Manual

Figure 0.10 Editing the user name, password, language preference and enabling automatic login
Automatic login can be enabled and the display name for the user account and password can be changed, if desired, via
'Systems Setting's by clicking 'User Accounts'.

Figure 0.11 Automatic login enabled


Click 'Unlock' to enable editing of the user account configuration. Type the current password when prompted. If you want to
change the account name, click 'osboxes.org' and type the desired account name. If you want to change the password,
click on the asterisks and type the desired password. If you want to enable automatic login, click 'OFF' so that 'ON' is
visible. Finally, click 'Lock' to relock the user account configuration.

Getting Familiar with the Command-Line Interface (CLI)


After a short detour to familiarise ourselves with the command-line interface (CLI) we will install Git, R, and RStudio. Rest
assured that interacting with command-line is not required beyond this chapter. RStudio provides seamless integration with
the file system to navigate and manipulate files, version control and repository synchronisation between your computer and
repository hosting services, and the statistical computation and software development environment.

Chapter 0 - Data Scientist's Toolbox

17

Data Science Boot-Camp Survival Manual

15-minute Introduction to Navigating and Manipulating the File System from the Terminal
Let's start exploring the basic features of the environment from the comfort of a terminal session and the command-line.
Open a terminal window (CTRL+ALT+T) if you are running a graphical desktop environment. By learning a few basic
commands to navigate and manipulate the file system you will feel at ease and understand what is going in behind the
scenes within File Panel of RStudio.
Command

Description

pwd

print working directory


name

ls

list file and/or directory


names

mkdir

Common Flags

-l (long form)
-a (hidden)
-R (recursive)

Arguments

[directory_path/]
[pattern]
(optional)
[directory_path/]directory_name or
[directory_path/]directory_name_list

make directory

(mandatory)
[directory_path/][directory_name]
cd

change directory
(optional)
[directory_path/]file_name

touch

create an empty file


(mandatory)

echo

create a file
(by default stdout)

-e -n
(no carriage
return)

"a string of characters"


(mandatory)
(source)
[directory_path/][filename]

cp

copy file or directory

-r (recursive)

(target)
[directory_path/][file_name]
(mandatory)
(source)
[directory_path/][file_name]

mv

move file or directory

-r (recursive)

(target)
[directory_path/][file_name]
(mandatory)

rm

remove/delete file or
directory

-f (force)
-r (recursive)

[directory_path/][file_name]
(mandatory)
Arguments in brackets are optional
but if the 'mandatory' designation is
present, at least one of the
arguments must be supplied. Directory
names and paths as well as file names
may contain wildcard characters
(* and ?) when used with some of these
commands.

Table 0.1 Basic File and Directory Commands


For each example type the commands to the right of the command prompt ($) to interactively follow along these examples.
Take your time working through the commands until you fully understand why each command produces the observed
results.
Chapter 0 - Data Scientist's Toolbox

18

Data Science Boot-Camp Survival Manual

Example 1: Determine the current working directory

$ pwd
/home/osboxes

Example 2: List the file and subdirectory names in the current working directory

$ ls
Desktop Downloads Music Public Videos
Documents examples.desktop Pictures Templates

Example 3: Create a subdirectory named 'test' in the current directory

$ mkdir test
$ cd test
$ pwd
/home/osboxes/test

Example 4: Create subdirectories named '1', '2', '3', and '4' in the current directory

$ mkdir {1,2,3,4}

Example 5: List the files and subdirectory names in the current directory
$ ls 1 2 3 4
Example 6: Create some empty files and some files with content
$ touch 1/file01.txt 2/file02.txt $ echo "Bonjour tout le monde" Bonjour tout le monde $ echo "Hello World!" > ./1/file0101.txt
$ echo "To be or not to be" > ./3/file03.txt
Example 7: Change to the directory immediately above the current directory and list the files and subdirectory names in the
subdirectory named '1'

$ cd ..
$ ls -l test/1
total 4
-rw-rw-r-- 1 osboxes osboxes 13 Apr 3 09:28 file0101.txt
-rw-rw-r-- 1 osboxes osboxes 0 Apr 3 09:27 file01.txt

Example 8: List the files ending with '.txt' in the subdirectory named '3'

$ ls -l test/3/*.txt
-rw-rw-r-- 1 osboxes osboxes 19 Apr 3 09:29 test/3/file03.txt

Example 9: (a) Copy the file 'file02.txt' from directory named '${HOME}/test/2' to directory '${HOME}/test/4' and name the
file 'file04.txt'

$ cp ./test/2/file02.txt ./test/4/file04.txt

Chapter 0 - Data Scientist's Toolbox

19

Data Science Boot-Camp Survival Manual

(b) Copy the file 'file02.txt' from directory named '${HOME}/test/2' to directory '${HOME}/test/4' and name the file 'file02.txt'

$ cp ~/test/2/file02.txt ./test/4/file02.txt

Example 10: Make subdirectory '${HOME}/test/3' the current working directory and create a hidden file and a hidden
subdirectory

$ cd test/3
$ touch .hidden01.txt
$ mkdir .hidden

Example 11: List the names of non-hidden files and subdirectories in the current directory

$ ls
file03.txt
$ ls -a
. .. file03.txt .hidden .hidden01.txt

Example 12: Create a subdirectory named 'another' in the home directory of the user and copy the files and recursively
from '${HOME}/osboxes/test' to '${HOME}/another'

$ mkdir ~/another
$ cp -r ../* ~/another

Exampke 13: List the files and subdirectories in the home directory of user

$ ls ~
another Documents examples.desktop Pictures Templates Videos
Desktop Downloads Music Public test

Example 14: List the file and subdirectory names in '${HOME}/another'

$ ls ~/another
1 2 3 4

Example 15: List the file namess and recursively the subdirectories in '${HOME}/another'

$ ls -R ~/another
/home/osboxes/another:
1 2 3 4
/home/osboxes/another/1:
file0101.txt file01.txt
/home/osboxes/another/2:
file02.txt
/home/osboxes/another/3:
file03.txt
/home/osboxes/another/4:
file02.txt file04.txt

Example 16: Create a subdirectory named 'test/5' in the home directory of the user and move (copy and delete) the files
and/or subdirectories from '${HOME}/another'

Chapter 0 - Data Scientist's Toolbox

20

Data Science Boot-Camp Survival Manual

$ mkdir ~/test/5
$ mv ~/another/* ../5

Example 17: List the file and subdirectory names in '${HOME}//another'

$ ls -a /home/osboxes/another
. ..

Example 18: List the file and subdirectory names in '${HOME}/test/5'

$ ls ../5
1 2 3 4

Example 19: List the file names and recursively the subdirectories in '${HOME}/test/5'

$ ls -R ~/test/5
/home/osboxes/test/5:
1 2 3 4
/home/osboxes/test/5/1:
file0101.txt file01.txt
/home/osboxes/test/5/2:
file02.txt
/home/osboxes/test/5/3:
file03.txt
/home/osboxes/test/5/4:
file02.txt file04.txt

Example 20: Make directory '/home/osboxes' the current working directory

$ cd
$ pwd
/home/osboxes

Example 21: Delete the subdirectories 'test' and 'another' from '${HOME}', and then list the file and subdirectory names in
the current directory

$ rm -rf test another


$ ls
Desktop Downloads Music Public Videos
Documents examples.desktop Pictures Templates

Example 22: Close the terminal session

$ exit

A cheatsheet for the Bourne Again SHell (BASH) has been prepared by the folks at Learn Code the Hardway (LCodeTHW).
A complete manual for BASH is available from the GNU Project if you want to further explore the CLI and its capabilities.

Markdown - Writing Documentation the Easy Way


The markdown language, created by John Gruber, is relatively small and easy to learn unlike markup languages such as

Chapter 0 - Data Scientist's Toolbox

21

Data Science Boot-Camp Survival Manual

HTML and XML. Taking a portion of this book as an example, with some minor changes to demonstrate particular features,
we explore some of the more common markdown elements.

Prologue
===
# Introduction
During the next year you will learn the fundamentals of data science.
Surviving the nine courses which make up the [Data Science
Specialization][0001] offered by [Johns Hopkins University][jhu] requires a
**strategy**.
To this end, the focus of the ten-course series including a capstone project
is to provide the learner with:
1. an introduction to the key ideas behind reproducible research,
2. an introduction to the tools and techniques to transform raw
data into a presentable report,
4. an opportunity to gain hands-on practice so you can learn the
techniques for yourself, and
3. an appreciation of the mathematics & statistics involved in
data science.
## Core Courses
The courses comprising the Data Science Specialization are:
* Data Scientist's Toolbox
* R Programming
* Exploratory Data Analysis
* Getting and Cleaning Data
* Reproducible Research
* Statistical Inference
* Regression Models
* Practical Machine Learning
* Developing Data Products
![Course Dependency](dst_courses.png)
*Figure 1 Course dependency diagram*
[0001]: https://www.coursera.org/specialization/jhudatascience/1?utm_medium=
courseDescripTop
[jhu]: http://www.jhu.edu

Listing 0.1 Sample markdown document


So you can immediately practise each of the markdown elements used in the sample document a concise description is
supplied with references to the sample document.
Font Modifiers
There are two styles of font modifier supported by standard markdown:
bold (text surrounded by **)
italics (text surrounded by *)
From the sample document we see that 'strategy' is modified during conversion to render bolded, whilst 'Figure 1 Course
dependency diagram' is modified during conversion to render italicised.
Headings
There are two styles of headers supported by standard markdown:
setext
First-level (text underlined by at least 3 equal-signs)
Chapter 0 - Data Scientist's Toolbox

22

Data Science Boot-Camp Survival Manual

Secondary-level (text underlined by at least 3 dashes)


atx
First-level (# preceding text)
Secondary-level (## preceding text)
Third-level (### preceding text)
Fourth-level (#### preceding text)
From the sample document we see that 'Prologue' and 'Introduction' are first-level headers, and 'Core Courses' is a
second-level header.
Images
There are two styles of image links supported by standard markdown:
inline
filename: ![alternate text](directory_path/image "optional title")
reference
id: ![alternate text][string of digits | string of terms]
Links
There are two styles of links supported by standard markdown:
inline
URL: [random website][website]
reference
id: [random website][string of digits | string of terms]
From the sample document we see that 'Data Science Specialization' is referenced by the id label (0001) whereas 'John
Hopkins University' is referenced by the id label (jhu). The actual URLs are collected at the end of the same document
although the labels could appear anywhere in the document.
Lists
There are two styles of lists supported by standard markdown:
ordered list
number (followed by an optional period and two mandatory spaces; physical ordering overrides numeric label
during conversion)
unordered list
* (asterisk)
- (dash)
+ (plus)
From the sample document we see an ordered list containing the learner outcomes and an unordered list containing the
names of each of the nine core courses.
Install the markdown (MD) to hyper-text markup language (HTML) converter to practise modifying the sample markdown
document.

$ sudo apt-get install markdown

Chapter 0 - Data Scientist's Toolbox

23

Data Science Boot-Camp Survival Manual

A text editor combined with the markdown-to-html converter is all that is needed.

$ nano sample.md
$ markdown sample.md # sends HTML output to the screen
$ markdown sample.md > sample.html # sends HTML outout to a file named 'sample.html'
$ firefox sample.html # view the rendered HTML in a web broswer

Take your time working through the sample markdown document until you fully understand why each element produces the
observed results. This book is written in a markdown language. In another course you will learn how to produce a
markdown document combining text and executable R code using Rmarkdown, and convert it to HTML and PDF using
RStudio.

Git - Version Control


Git is a distributed version control system allowing any number of people to collaboratively contribute to software
development or other projects. Some of the courses require learners to submit their programming assignments to GitHub
as part of a peer assessment grading process.
Installing Git
By installing the Git command-line client you can choose whether to manage your local and remote repositories from a
terminal session or within RStudio. Assuming you are running the Ubuntu Linux virtual machine or another Debian
GNU/Linux derived distribution type the command shown to install the Git client.

$ sudo apt-get install -y git git-doc

If you have installed a different distribution refer to the system documentation to determine the package manager needed to
install software from the software repository.
15-minute Introduction to Version Control with Git from the Terminal
Let's start exploring the basic features of the version control from the comfort of an R Console session. Open a terminal
window (CTRL+ALT+T) if you are running a graphical desktop environment and then type 'R' and press the [ENTER] key.
Once RStudio is installed you will have integrated access to R.
Command

git init

Description

Common Flags

Arguments
[directory_path/]
[directory_name]

initialise a local repository;


default is current working directory

(optional)
git branch

determine the current branch

git
checkout

create a new branch in the current


repository

git status

reports the status of the local


repository

git show

reports the historical differences of the


files
in the local repository

branch_name
-b (new branch)
(mandatory)

-A (add)
git add

add files to the local repository

Chapter 0 - Data Scientist's Toolbox

-u (track file name


changes and

[directory_path/][file_name]
(mandatory)
24

Data Science Boot-Camp Survival Manual

deletions)
[directory_path/][file_name]
git commit

commit any changes the local


repository

-a (add)
"a string of characters"
-m (message)
(optional, mandatory)

git pull

source target

fetch changes from another repository


and merge with current repository

(mandatory)
target source

git push

update remote repository with changes


from the current repository

-u (add upstream
(tracking) reference)

git merge

flatten commit history before merging


source branch with target branch

--squash

(mandatory unless -u flag


present)
branch_name
(mandatory)
reference_point

git revert

undo changes to the local repository


(mandatory)
Arguments in brackets are
optional
but if the 'mandatory'
designation is
present, at least one of the
arguments must be supplied.

Table 0.2 Basic Git Commands


For each of the examples in this section type the commands to the right of the command prompt ($) to interactively follow
along these examples. Take your time working through the commands until you fully understand why each command
produces the observed results.
Preliminaries: Configure your email address and username to be used by Git. The flag --global means apply the
configuration to all of your Git repositories on the computer. The flag --local means apply the confoguration to only the
current Git repository.

$ git config [--local | --global] user.email "userid@domain.tld"


$ git config [--local | --global] user.name "username"

Note: The output of some Git commands in these examples has been reformatted for presentation within this book.
Example 1: Create a local repository.

$ mkdir Projects
$ mkdir Projects/DataScientistsToolbox
$ mkdir Projects/DataScientistsToolbox/sample
$ cd Projects/DataScientistsToolbox/sample
$ git init
Initialised empty Git repository in /home/osboxes/Projects/DataScientistsToolbox/sample/.git/
$ ls -la
drwxrwxr-x 3 osboxes osboxes 4096 Apr 5 19:15 .
drwxrwxr-x 3 osboxes osboxes 4096 Apr 5 19:07 ..
drwxrwxr-x 7 osboxes osboxes 4096 Apr 5 19:15 .git

Example 2: Create an empty README.md file in the local repository.

Chapter 0 - Data Scientist's Toolbox

25

Data Science Boot-Camp Survival Manual

$ touch README.md
$ git add .
$ git commit -m "initial commit"
[master (root-commit) b7c48f3] initial commit
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 README.md
$ git status
On branch master
nothing to commit, working directory clean
$ git show
commit b7c48f3e5cdc772e6a198c3633acd853a69a5778
Author: jhudss
Date: Sun Apr 5 19:21:21 2015 -0300
initial commit
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..e69de29

Example 3: Edit the README.md file and paste the sample markdown document into the file.

$ nano README.md
$ git add -A .
$ git commit -m "added content"
[master 8fd8eb8] added content
1 file changed, 41 insertions(+)

Example 4: Edit the README.md file swapping "Getting and Cleaning Data" and "Exploratory Data Analysis."

$ nano README.md
$ git commit -m "swapped order of two courses"
[master 87d0125] swapped order of two courses
1 file changed, 1 insertion(+), 1 deletion(-)

Example 5: Determine whether there are any changes.

$ git status
On branch master
nothing to commit, working directory clean
$ git show
commit 87d012594aa5a8a39e99d4728dc8c853779587ab
Author: jhudss
Date: Sun Apr 5 19:34:34 2015 -0300
swapped order of two courses
diff --git a/README.md b/README.md
index 756292a..48587e6 100644
--- a/README.md
+++ b/README.md
@@ -25,8 +25,8 @@ The courses comprising the Data Science Specialization are:
* Data Scientist's Toolbox
* R Programming
-* Exploratory Data Analysis

Chapter 0 - Data Scientist's Toolbox

26

Data Science Boot-Camp Survival Manual

* Getting and Cleaning Data


+* Exploratory Data Analysis
* Reproducible Research
* Statistical Inference
* Regression Models

Example 6: Create a branch named 'draft'.

$ git checkout -b draft


Switched to a new branch 'draft'
$ git status
On branch draft
nothing to commit, working directory clean

Example 7: Edit the README.md file to add "Git is easy. Git is fun. Thanks Linus!" anywhere in the file.

$ nano README.md
$ git status
On branch draft
Changes not staged for commit:
(use "git add ..." to update what will be committed)
(use "git checkout -- ..." to discard changes in working directory)
modified: README.md
no changes added to commit (use "git add" and/or "git commit -a")
$ git commit -a -m "thanked the creator of Git"
[draft 34af00f] thanked the creator of Git
1 file changed, 2 insertions(+)

Example 8: Switch to the 'master' branch and check the repository status.

$ git checkout master


Switched to branch 'master'
$ git status
On branch master
nothing to commit, working directory clean

Example 9: Merge the 'draft' branch' with the 'master' branch and check the repository status.

$ git merge draft


Updating 87d0125..34af00f
Fast-forward
README.md | 2 ++
1 file changed, 2 insertions(+)
$ git status
On branch master
nothing to commit, working directory clean
$ git show
commit 34af00fc564fd28e485503715dd5a9a9a461329a

Chapter 0 - Data Scientist's Toolbox

27

Data Science Boot-Camp Survival Manual

Author: jhudss
Date: Sun Apr 5 19:49:08 2015 -0300
thanked the creator of Git
diff --git a/README.md b/README.md
index 48587e6..aa53fee 100644
--- a/README.md
+++ b/README.md
@@ -19,6 +19,8 @@ is to provide the learner with:
3. an appreciation of the mathematics & statistics involved in
data science.
+Git is easy. Git is fun. Thanks Linus!
+
## Core Courses
The courses comprising the Data Science Specialization are:

A cheatsheet for Git and GitHub has been prepared by the folks at GitHub.

GitHub - Repository Hosting Service Supporting the Git Version Control System
GitHub is a repository hosting service allowing any number of people to collaboratively contribute to software development
or other projects. Some of the courses require learners to submit their programming assignments to GitHub as part of a
peer assessment grading process.
15-minute Introduction to Version Control with GitHub from the Terminal and Web Browser

Figure 0.12 Create an account with GitHub


Before creating a repository on GitHub you must create an account preferably with the same name email address used
when configuring Git. If you use an alternate email address and username for your GitHub account, you can associate Git's
username and email address with this account.

Chapter 0 - Data Scientist's Toolbox

28

Data Science Boot-Camp Survival Manual

Figure 0.13 Choose a Personal Plan


Select the repository hosting plan for your account. The default free plan is sufficient for peer assessments during the
Johns Hopkins University Data Science Specialization.

Figure 0.14 New Account Orientation Dashboard


After your GitHub account is set-up you are ready to explore the service. You should update the profile information at the
very least before proceeding.
For each of the examples in this section type the commands to the right of the command prompt ($) to interactively follow
along these examples. Take your time working through the commands until you fully understand why each command
produces the observed results.
Example 1: Synchronise a local repository with an empty repository of the same name on GitHub. The commands below
create the empty repository on GitHub and push the content of the local repository to your GitHiub account. Substitute your
GitHub account name for 'user_name' and type your account password when prompted.

$ curl -u user_name https://api.github.com/user/repos \


-d "{\"name\":\"sample\",\"description\":\"learning about Git and GitHub\"}"
$ git add remote origin https://github.com/username/sample.git
$ git push origin master

Chapter 0 - Data Scientist's Toolbox

29

Data Science Boot-Camp Survival Manual

A cheatsheet for Git and GitHub has been prepared by the folks at GitHub.

R - Statistical Analysis and Computing Environment


R is a statistical analysis and computing environment providing "an integrated suite of software facilities for data
manipulation, calculation and graphical display."
Installing R
Add the line "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" to the end of the sources.list file.

$ sudo nano /etc/apt/sources.list

Fetch the signing key for the CRAN repository.

$sudo apt-key adv --keyserver keys.gnupg.net --recv-key 51716619E084DAB9

Install the lastest version of R which might be newer than shown in the figures.

$ sudo apt-get update


$ sudo apt-get upgrade
$ sudo apt-get install -y r-base r-doc-info r-mathlib libcurl4-gnutls-dev

15-minute Introduction to the R Statistical and Computational Environment


Let's start exploring the basic features of the R environment from the comfort of the R Console command-line interface.
Open a terminal window (CTRL + ALT + T) if you are running a graphical desktop environment and type 'R' followed by the
[ENTER] key. Once RStudio is installed you won't have to work at the command-line unless you choose to do so.
Command

Description

Arguments
package_name

install.packages

install a package from CRAN


(mandatory)
package_name

install_github

install a package from GitHub


(mandatory)
package_name

library

load a package
(mandatory)

access the help system

[package_name]
[function_name]
(mandatory)

q()

exit R

Prompt to save the environment before


shutting down the R Statistical Analysis
and Computing Environment.
Arguments in brackets are optional but if the
'mandatory' designation is present, at least
one of the arguments must be supplied.

Chapter 0 - Data Scientist's Toolbox

30

Data Science Boot-Camp Survival Manual

Table 0.3 Essential R Commands


For each example type the commands to the right of the command prompt (>) to interactively follow along these examples.
Take your time working through the commands until you fully understand why each command produces the observed
results.
Example: For simplicity we do not show the output of the commands used within R Console. We will install the devtools
package, as an exemplar, which will be needed to successfully compile other packages throughout these chapters and the
nine data science courses.


$ R
R version 3.1.1 (201-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: i686-pc-linux-gnu (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> install.packages("devtools")
> library(devtools)
> ?devtools
> q()
$

RStudio - Integrated Development Environment


RStudio is an integrated development environment providing a platform "to tackle the toughest and most interesting
problems with R."
Installing RStudio

$ wget http://download1.rstudio.org/rstudio-0.98.1103-i386.deb -O ${HOME}/Downloads/rstudio.deb


$ sudo apt-get install libjpeg62
$ sudo dpkg -i ${HOME}/Downloads/rstudio.deb

15-minute Introduction to the RStudio Integrated Development Environment

Chapter 0 - Data Scientist's Toolbox

31

Data Science Boot-Camp Survival Manual

Figure 0.15 RStudio Integrated Development Environment


Launch RStudio by clicking on the 'Button' icon near the upper left of the application bar, typing 'rstudio' into the search
field, and clicking on the RStudio icon. Once the application is visible as shown in Figure 0.15 right-click on the RStudio
icon in the application bar and select 'Lock to Launcher'.

Chapter 0 - Data Scientist's Toolbox

32

Data Science Boot-Camp Survival Manual

Figure 0.16 Configure Global Options


Click 'Tools' on the main menu followed by 'Global Options'to configure RStudio.

Chapter 0 - Data Scientist's Toolbox

33

Data Science Boot-Camp Survival Manual

Figure 0.17 Select the CRAN repository mirror to fetch packages


Select a geographically-nearby CRAN repository after clicking 'Packages'.

Chapter 0 - Data Scientist's Toolbox

34

Data Science Boot-Camp Survival Manual

Figure 0.18 Configure code editing preferences


Click 'Code Editing' to configure the appearance and behaviour of the code editing pane.

Chapter 0 - Data Scientist's Toolbox

35

Data Science Boot-Camp Survival Manual

Figure 0.19 Configure version control options


Click 'Git/SVN' to configure which version control system system will be used. If Git has laready been installed, the defaults
can be accepted. Click 'Apply'. Click 'OK'.

Chapter 0 - Data Scientist's Toolbox

36

Data Science Boot-Camp Survival Manual

Figure 0.20 Create a directory


Click the 'Files' tab in the lower right pane and navigate to the Projects directory and click 'New Folder'. Type the name of
the course DataScientistsToolbox. If a directory named Projects does not exist, create it.

Chapter 0 - Data Scientist's Toolbox

37

Data Science Boot-Camp Survival Manual

Figure 0.21 Create a new project - Step 1


In the upper right click 'Project (None)' and select 'New Project'.

Chapter 0 - Data Scientist's Toolbox

38

Data Science Boot-Camp Survival Manual

Figure 0.22 Create a new project - Step 2


Click 'New Directory' to create a new repository.

Chapter 0 - Data Scientist's Toolbox

39

Data Science Boot-Camp Survival Manual

Figure 0.23 Create a new project - Step 3


Click 'Empty Project' as the project type.

Chapter 0 - Data Scientist's Toolbox

40

Data Science Boot-Camp Survival Manual

Figure 0.24 Create a new project - Step 4

Chapter 0 - Data Scientist's Toolbox

41

Data Science Boot-Camp Survival Manual

Figure 0.25 Create a new project - Step 5


Navigate to ${HOME}/Projects/DataScientistsToolbox. Click 'Choose'.

Chapter 0 - Data Scientist's Toolbox

42

Data Science Boot-Camp Survival Manual

Figure 0.26 Create a new project - Step 6


Type a name for the project. To create the project type a directory name, select 'Create a git repository', and click 'Create
Project'.

Chapter 0 - Data Scientist's Toolbox

43

Data Science Boot-Camp Survival Manual

Figure 0.27 Create a new text file


Select 'File' on the main menu followed by 'New File' and select 'Text File' as the file type.

Chapter 0 - Data Scientist's Toolbox

44

Data Science Boot-Camp Survival Manual

Figure 0.28 Save the student_grades.csv data file


Type the contents shown in the code editing pane. Click on the diskette icon or select 'File, Save' from the menu. Type the
file name and click 'Save'.

Chapter 0 - Data Scientist's Toolbox

45

Data Science Boot-Camp Survival Manual

Figure 0.29 Set the working directory


Click 'Session' on the main menu and select 'Set Working Directory' followed by 'To Files Pane Location'.

Chapter 0 - Data Scientist's Toolbox

46

Data Science Boot-Camp Survival Manual

Figure 0.30 Save the student_grades.R script


Click 'File' on the main menu followed by 'New File' and select 'R Script'.
Type the R code shown below. Then click 'File' followed by 'Save' before typing the file name and clicking 'Save'.

Chapter 0 - Data Scientist's Toolbox

47

Data Science Boot-Camp Survival Manual

Figure 0.31 Read student grades file and output the contents
Highlight the code in the 'student_grades.R' tab. Click 'Run'.

Chapter 0 - Data Scientist's Toolbox

48

Data Science Boot-Camp Survival Manual

Figure 0.32 Commit changes to the local repository


Click the 'Git' tab in the upper right pane. Click 'Commit'.

Chapter 0 - Data Scientist's Toolbox

49

Data Science Boot-Camp Survival Manual

Figure 0.33 Select changes to be commited to the local repository


Select each of the four files by marking them as staged. Type a commit message. Click 'Commit' to commit these changes
to the local repository.

Chapter 0 - Data Scientist's Toolbox

50

Data Science Boot-Camp Survival Manual

Figure 0.34 Summary of changes to the local repository


Review the messages before clicking 'Close'. Afterwards close the 'Review Changes' pop-up window.

Chapter 0 - Data Scientist's Toolbox

51

Data Science Boot-Camp Survival Manual

Figure 0.35 Tracking changes in an open project


Modify the R code as shown in the 'student_grades.R' tab. Did you notice the new entry under the 'Git' tab? Highlight the
last line of code and run it. Commit this change using the same procedure.

Chapter 0 - Data Scientist's Toolbox

52

Data Science Boot-Camp Survival Manual

Figure 0.36 Push the contents of the local repository to GitHub


Login to GitHub using a web browser and create an empty repository named 'demo'. In RStudio click the gear icon under
the 'Git' tab and select 'Shell'. For convenience we put the git commands in the code pane. Type these commands in the
shell substituting your GitHub account. Type 'exit' to close the shell. Verify the repository on GitHub has been updated.
Logout of GitHub.

Chapter 0 - Data Scientist's Toolbox

53

Data Science Boot-Camp Survival Manual

Figure 0.37 Close the currently active project


Click on 'demo' in the upper right corner of RStudio and click 'Close Project'.

Figure 0.38 GitHub repository named demo after the push from local repository
Congratulations! You successfully onfigured a virtual machine for use during the data science boot-camp.
Practise. Practise. Practice your newly acquired knowledge and skills in preparation for the course project.

Chapter 0 - Data Scientist's Toolbox

54

Data Science Boot-Camp Survival Manual

Final Thoughts
Data Scientist's Toolbox introduced the statistical computing and graphing suite, the integrated development
environment, and the version / revision control system selected by the Data Science Specialization Lab Team in the
Biostatistics Department of Johns Hopkins University. The features and capabiilities of these tools extend beyond the
basics presented in this chapter. While the graphical user interface is convenient we highly recommend and encourage you
to become comfortable with the command-line as well.
As a data science recruit outfitted with your kit (Git, R, RStudio, Ubuntu Linux, and GitHub account) the instructor for R
Programming awaits. Boot-camp has been easy up to this point. Read the "Data Science Boot-Camp Survival Manual"
regularly to avoid washing-out of boot-camp.
Recruits, dismissed.

Chapter 0 - Data Scientist's Toolbox

55

Data Science Boot-Camp Survival Manual

Chapter 1 - R Programming

Chapter 1 - R Programming

56

Data Science Boot-Camp Survival Manual

Chapter 2 - Getting and Cleaning Data

Chapter 2 - Getting and Cleaning Data

57

Data Science Boot-Camp Survival Manual

Chapter 3 - Exploratory Data Analysis

Chapter 3 - Exploratory Data Analysis

58

Data Science Boot-Camp Survival Manual

Chapter 4 - Reproducible Research

Chapter 4 - Reproducible Research

59

Data Science Boot-Camp Survival Manual

Chapter 5 - Statistical Inference

Chapter 5 - Statistical Inference

60

Data Science Boot-Camp Survival Manual

Chapter 6 - Regression Models

Chapter 6 - Regression Models

61

Data Science Boot-Camp Survival Manual

Chapter 7 - Practical Machine Learning

Chapter 7 - Practical Machine Learning

62

Data Science Boot-Camp Survival Manual

Chapter 8 - Developing Data Products

Chapter 8 - Developing Data Products

63

Data Science Boot-Camp Survival Manual

Capstone

Capstone

64

Data Science Boot-Camp Survival Manual

Epilogue

Epilogue

65