Vous êtes sur la page 1sur 72

Advanced R Data Analysis

Training

Trainer: Dr. Ghazaleh Babanejad


Website:www.tertiarycourses.com.my
Email: malaysiacourses@tertiaryinfotech.com
About the Trainer
Dr Ghazaleh Babanejad has received Phd from
University Putra Malaysia in Faculty of
Computer Science and Information Technology..She is
working on recommender systems in the field of
skyline queries over Dynamic and Incomplete
databases for her PhD thesis. She is also working on
Data Science field as a trainer and Data Scientist. She
worked on Machine Learning and Process Mining
projects. She also has several international
certificates in Practical Machine Learning
(John Hopkins University) Mining Massive Datasets
(Stanford University), Process Mining
(Eindhoven University), Hadoop (University of San
Diego), MongoDB for DBAs (MongoDB Inc) and some
other certificates. She has more than 5 year
i l t dd t b d i i t t
Agenda
Module 1: R Data Analysis Packages
- Data Analysis Components
- Data Analysis Steps
- R Data Analysis Packages

Module 2: Obtaining Data


- Reading Data from CSV file
- Reading Data from JSON file
- Reading Data from XML file
- Reading Data from Web
- Reading Data from APIs
Agenda
Module 3: Data Exploration and Cleaning
- Exploring data
- Imputing missing data
- Dealing with Outliers

Module 4: Data Preprocessing


- Selecting columns and rows
- Calculated columns
- Arranging data
- Chain operations
- Joins
- Summarize and group by
Agenda
Module 5: Data Reshaping
- Splitting and merging columns
- Rearranging and reorienting columns

Module 6: Data Visualization


- ggplot2 syntax and analysis

Module 7: Advanced Analysis (optional)


- Map function
- User defined functions & logical testing
- pmap function
Prerequisite
Basic knowledge of R is assumed
Exercise Files
Download the exercise file from

https://github.com/rkrtiwari/rAdvanc
ed
Module 1
Getting Started
Data Analysis Steps
• Data Collection
• Data Processing
• Data Cleaning
• Data Visualization
• Data Product
R Data Analysis Packages
Data Manipulation
dplyr: Data manipulation
tasks
tidyr: Reshape data
mice: Missing data
Imputation

Data Analysis
Data Explorer: Visualize variables
R Data Analysis Packages
Data Visualization
ggplot2: Powerful visualization
shiny: Interactive data
visualization
VIM: Missing data
visualization
Install Packages
install.packages(“tidyverse”)
install.packages(“DataExplorer”)
install.packages(“data.table”)
install.packages("mice")
install.packages("ggplot2")
Module 2
Obtaining Data
Read Data from CSV File
data1 <- read.csv("data.csv", header =
TRUE)
Read Data from json
data <- fromJSON(“data.json”)
Read Data from Web
url<-
"http://archive.ics.uci.edu/ml/machi
ne-learning-
databases/wine/wine.data"
read.csv(url, nrows=5, header =
FALSE)
Read Data from XML
library(XML)
data <- xmlTreeParse(data.xml)
Challenge
Read the housing data from the
following webpage
“https://archive.ics.uci.edu/ml/machi
ne-learning-
databases/housing/housing.data”
and store it in a dataframe named
house

Time: 5 min
Module 3
Data Exploration
and Cleaning
Exploring our data
# load our library
library(DataExplorer)
library(data.table)

## explore our dataset


names(heart)
head(heart)
str(heart)
summary(heart)

## changing our data type


heartDT=data.table(heart)
Exploring our data
# grouping and frequency analysis

group_category(heartDT, "chest_pain", 0,
"chol")

# view frequency based on another


measure
group_category(heartDT, "chest_pain", 0,
"age")
Plotting

#discrete features (categorical data)


plot_bar(heartDT)

# continous features (numeric data)


plot_boxplot(heartDT, by="disease")
# disease is the categorical var

# correlation plot
plot_correlation(heartDT)
Plotting

# density plot
plot_density(heartDT)
# only for numerical columns

# histogram
plot_histogram(heartDT)
# only for numeric columns

# scatterplot
plot_scatterplot(heartDT,"age")
# using age as y axis
Splitting data

# will generate 2 data tables for


continuous and discrete data

output=split_columns(heartDT)

output$discrete

output$continous
Imputing data

library(mice)
library(VIM)

# Visualization of the missing pattern


aggr(miss_mtcars, numbers=TRUE

# Mean Substitution
mean_sub <- miss_mtcars
mean_sub$qsec[is.na(mean_sub$qsec)] <-
mean(mean_sub$qsec, na.rm = TRUE)
Dealing with Outliers
# ESD method

t=2
m=mean(x)
s=sd(x)

b1=m - s*t
b2=m + s*t

y=ifelse(x >=b1 & x <=b2, 0, 1)

table(y)
Dealing with Outliers
# boxplot method

boxplot(x)
boxplot.stats(x)

# outliers package
library(outliers)

dixon.test(x)
Challenge (10 mins)
Using the airquality dataset in R

1)explore the dataset


2)do frequency analysis
3) plot features and correlation plot
4)view the missing values
5) substitute the missing values with
mean
6)remove any outliers
Module 4
Data
Preprocessing
Data structure

glimpse(x)
lst(x)
tbl_sum(x)
Selecting columns

x2=select(x,col1,col2,col3,col4)
# selecting only 4 columns

x2=select(x, -col1, -col2)


# dropping columns 1 and 2

x2= rename(x, “col99”=col2)


# renaming column2 to column 99
Filtering rows
x2=filter(x, disease==“negative”)
# filter only negative disease rows

x2=filter(x, disease==“negative” &


thalach>160)
# double condition filtering

x2=filter(x, chest_pain != “asympt”)


# filter off “asympt”

x2=filter(x, chest_pain %in%


c(“asympt”,”angina”))
# only retain “asympt” and “angina”
Creating calculated columns

x2= mutate(x, old = age>50)


# this will give a new column with TRUE or
FALSE

x2= mutate(x, chol_class=chol/20)

x2= mutate(heart, chol_class=chol/20,

trestbps_class=trestbps/5)
# this will give two new columns
Creating calculated columns

# using if_else function in mutate

x2=mutate(x, cholLevel=
if_else(chol>250,"highrisk","normal"),
chol_class=chol/20)
Counting and arranging
count(x, chest_pain, sort = TRUE)

count(x, disease, sort=TRUE)

count(x, chest_pain, disease)

distinct(x, exang) # gives only 2 levels

distinct(x, exang, disease)


# look at 2 variables at same time
Counting and arranging
x2=arrange(x, age)
# arrange all the rows by the age var
number

x2=arrange(x, age, thalach)


# arrange by age first then thalach

x2=arrange(x, desc(age))
# descending order

x2=top_n(x,20)
#top 20 rows
Chaining
# the “%>%” is used in chain operations
# link one process to another

heart %>% select(1:5) %>%


mutate(chol_class=chol/20,
trestbps_class=trestbps/5)

heart %>% select(thalach) %>%


mutate(thalach_class=thalach/15)
Joins
left_join(A,B, by="col1")
#join matching rows from B to A

right_join(A,B, by="col1")
# join matching rows from B to A

inner_join(A,B, by="col1")
# join data, retain only rows in both sets)

full_join(A,B, by="col1")
# join data, retain all values, all rows)
Group by
groupDisease=group_by(x, disease)
# disease is the variable which we want to
create groups ["positive", "negative"]

groupDisease2=group_by(x, disease, fbs)


# more groups
Summarize
# you can choose your own summary
statistics

summarize(heart,
count=n(),
avgAge=mean(
age, na.rm=TRUE),
sdAge=sd(age, na.rm=TRUE),
medAge=median(age,
na.rm=TRUE),
Q3rdAge=quantile(age, .75)
)
Challenge (10 mins)
Use the mtcars dataset

1) Select first 9 columns and 20 rows


2) Create calculated column for average of
3) Mpg and Disp
4) Arrange by qsec descending
5) Group by cyl and vs
6) Do summary stats like (count, mean, max)
Module 5
Data Reshaping
Separate
# if your data contains 2 sets of
information in 1 column you can split them
up

Arguments
#first: dataset name,
#second: column Name,
#third: new col names to split column into
(names)
#fourth: the seperator (what split the
columns by)
Unite
#opposite of separate, combining columns

Arguments
#first: dataset name,
#second: column Name to unite columns
into,
#third: column names to combine
#fourth: the seperator in the new columns

unite(team, "Full Name", c(First_Name,


Last_Name), sep=" ")
Gather
# rearranging and re-orienting the
columns by stacking them into 1 single
year column

#first: dataset name,


#second: new column name (for columns
we are stacking into),
# third: new column names (for values of
the stacked columns)
#fourth: columns that we are stacking

homeruns2=gather(homeruns, year,
home_runs, YR2015:YR2013)
Spread
#opposite of gather, spreading out the
columns

# first: dataset name,


# second: column to spread across
multiple column,
# third: values multiple columns will take

spread(homeruns2, year, home_runs)


Module 6
Data Visualization
Scatter Plot
gplot(mtcars) + aes(x=wt, y=mpg) +
geom_point(size=3, color = “blue”)
Scatter Plot (grouped data)
ggplot(mtcars) + aes(x=wt, y=mpg,
color = factor(cyl) ) +
geom_point(size=3)
Scatter Plot (adding a trendline)
ggplot(mtcars) + aes(x=wt, y=mpg) +
geom_point() + stat_smooth(method =
"lm")
Scatter Plot (faceting: I)
ggplot(mtcars) + aes(x=wt, y=mpg) +
geom_point() + facet_grid( am ~ .)
Scatter Plot (faceting: II)
ggplot(mtcars) + aes(x=wt, y=mpg) +
geom_point() + facet_grid( am ~ cyl)
Scatter Plot (facetting: III)
ggplot(mm) + aes(x=value, y = mpg) +
geom_point() + facet_wrap( ~variable,
scales = "free", ncol = 2)
Bar Plot
ggplot(mtcars, aes(x = factor( cyl))) +
geom_bar()
Multiple Bar Plot
ggplot(mm) + aes(x=factor(month), y=
value) + geom_bar() + facet_grid( . ~
variable)
Histogram
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 3)
Boxplot
ggplot( mtcars, aes(x = factor( cyl), y =
mpg)) + geom_boxplot()
Challenge
Use ggplot to plot the Median value
of owner-occupied homes vs. per
capita crime rate
Module 7
Advanced Analysis
(optional)
Map functions
library(purr)

# map() returns a list or dataframe


# map_lgl() returns a logical vector
# map_int() returns a integer value
# map_dbl() returns a double vector
# map_chr() returns a character
vector
Map functions
map(x, summary) # find a summary
of each column

map_lgl(x, is.numeric) # find columns


that are numeric (return logical)

map_chr(x, typeof) # find the type of


each column (return character)
Apply functions
map_dbl(x, mean) # find column
means

map_dbl(x, sd) # find column std dev

map_dbl(x, quantile, probs=c(0.05)) #


find 5th percentile
Apply user-defined functions
# group the heart chest_pain types
# nest function to convert to tibble

n_heart <- heart %>%


group_by(chest_pain) %>%
nest()
Apply user-defined functions
# create a model for each chestpain

mod_fun=function(x) lm(chol~ age +


trestbps + thalach, data=x)

# apply the model

model_heart=n_heart %>%
mutate(model=map(data, mod_fun))
# use "data" to symbolize the data
Logical testing
pluck(heart,"age") # get values in
"age"

old=function(x){x>50}

keep(heart$age, old) # keep


elements that pass a logical test

discard(heart$age, old) # remove


elements that pass a logical test
Summarize data
every(heart$age, old)
# do all elements pass a test

some(heart$age, old)
# do some elements pass a test

detect(heart$age, old)
# find first element that pass a test

detect_index(heart$age, old)
pmap
# pmap takes a list of arguments as
input

# using multiple arguments with map


n=list(5,10,20)
mu=list(1,5,10)
sd=list(0.1,1,0.1)

pmap(list(n, mu, sd), rnorm)


Challenge (10 mins)
Use the mtcars dataset

1) Map summary of each column


2) Find column means
3) Group by cyl and am (nest)
4) Apply a model for each group
Summary
Parting
Message
Q&A
Feedback
https://goo.gl/EDezXH
Thank You!
Ghazaleh Babanejad
ghazaleh.babanejad@gmail.com
01123005257

Vous aimerez peut-être aussi